Previous chapter
ModelForecasting
Next chapter

What can we forecast?

Forecasting is difficult

{width=“80%”}

What can we forecast?

{width=“80%”}

What can we forecast?

{width=“80%”}

What can we forecast?

{width=“80%”}

What can we forecast?

{width=“80%”}

What can we forecast?

{width=“80%”}

Which is easiest to forecast?

  1. daily electricity demand in 3 days time
  2. timing of next Halley’s comet appearance
  3. time of sunrise this day next year
  4. Google stock price tomorrow
  5. Google stock price in 6 months time
  6. maximum temperature tomorrow
  7. exchange rate of $US/AUS next week
  8. total sales of drugs in Australian pharmacies next month
  9. debit card usage in Iceland (in million ISK)
  • how do we measure “easiest”?
  • what makes something easy/difficult to forecast?

Electricity demand

Debit Card Usage in Iceland

Factors affecting forecastability

Something is easier to forecast if:

  • we have a good understanding of the factors that contribute to it
  • there is lots of data available;
  • the forecasts cannot affect the thing we are trying to forecast.
  • there is relatively low natural/unexplainable random variation.
  • the future is somewhat similar to the past

Time series data

  • Daily IBM stock prices
  • Monthly rainfall
  • Quarterly Amazon Revenues
  • Monthly milk production per cow

Quarterly Amazon Revenues

Monthly milk production per cow

The statistical forecasting perspective

Sample futures

Forecast intervals

Statistical forecasting

  • Thing to be forecast: a random variable, \(y_t\).
  • Forecast distribution: If \({\cal I}\) is all observations, then \(y_{t} |{\cal I}\) means ``the random variable \(y_{t}\) given what we know in
  • The ``point forecast’’ is the mean (or median) of \(y_{t} |{\cal I}\)
  • The ``forecast variance’’ is \(\text{var}[y_{t} |{\cal I}]\)
  • A prediction interval or ``interval forecast’’ is a range of values of \(y_t\) with high
  • With time series,
  • \(\hat{y}_{T+h|T} =\text{E}[y_{T+h} | y_1,\dots,y_T]\) (an \(h\)-step forecast taking account of all observations up to time \(T\)).

Forecasting methods implemented in forecast

The forecast package by Rob Hyndman is currently the most feature complete package for time series forecasting (in R). It implements numerous methods for time series forecasting. The most important ones for (univariate) time series forecasting are:

  • RW-DRIFT: Random Walk with Drift.
  • NAIVE: The naive method, using the last observation of the series as the forecasts.
  • SEASONAL NAIVE: The forecast are the last observed value of the same season.
  • ETS: Exponential smoothing state space model.
  • STLM-AR: Seasonal and Trend decomposition using Loess with AR modeling of the seasonally adjusted series
  • THETAF: The theta method of Assimakopoulos and Nikolopoulos (2000).
  • ARIMA: The ARIMA model.
  • NNETAR: Feed-forward neural network
  • TBATS: The Exponential smoothing state space Trigonometric, Box-Cox transformation, ARMA errors, Trend and Seasonal components model.

See also https://otexts.com/fpp2/arima-r.html for the forecasting book by Rob Hyndman.

RW-DRIFT

Random walk model with drift using forecast::rwf(debitcards, h = 24, drift = TRUE).

ARIMA

Using algorithm as described in https://otexts.com/fpp2/arima-r.html using forecast::auto.arima(debitcards, stepwise=FALSE, approximation=FALSE).

NNETAR

Feed-forward neural network using lagged inputs. NNAR(p,k) where p is number of lagged inputs and \(k\) is the number of hidden nodes.

The ts object

A time series can be thought of as a list of numbers, along with some information about what times those numbers were recorded. This information can be stored as a ts object in R.

Suppose you have annual observations for the last few years:

YearObservation
2012123
201339
201478
201552
2016110

We turn this into a ts object using the ts() function:

If you have annual data, with one observation per year, you only need to provide the starting year (or the ending year).

For observations that are more frequent than once per year, you simply add a frequency argument. For example, if your monthly data is already stored as a numerical vector z, then it can be converted to a ts object like this:

y <- ts(z, start=2003, frequency=12)

Almost all of the data used in this book is already stored as ts objects. But if you want to work with your own data, you will need to use the ts() function before proceeding with the analysis.

Frequency of a time series

The “frequency” is the number of observations before the seasonal pattern repeats.1 When using the ts() function in R, the following choices should be used.

Datafrequency
Annual1
Quarterly4
Monthly12
Weekly52

Actually, there are not \(52\) weeks in a year, but \(365.25/7=52.18\) on average, allowing for a leap year every fourth year. But most functions which use ts objects require integer frequency.

If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (\(frequency = 7\)) or an annual seasonality (\(frequency = 365.25\)). Similarly, data that are observed every minute might have an

  • hourly seasonality (\(frequency = 60\))
  • a daily seasonality (\(frequency = 24×60 = 1440\))
  • a weekly seasonality (\(frequency = 24×60×7=10080\)) and an
  • annual seasonality (\(frequency = 24×60×365.25=525960\)).

If you want to use a ts object, then you need to decide which of these is the most important.

Classical Time Series Decomposition

Time Series Components

If we assume an additive decomposition, then we can write

\[y_t = S_t + T_t + R_t\] where \(y_t\) is the data, \(S_t\) is the seasonal component, \(T_t\) is the trend-cycle component, and \(R_t\) is the remainder component, all at period \(t\). Alternatively, a multiplicative decomposition would be written as

\[y_t = S_t \times T_t \times R_t\] The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decompositions are common with economic time series.

An alternative to using a multiplicative decomposition is to first transform the data until the variation in the series appears to be stable over time, then use an additive decomposition. When a log transformation has been used, this is equivalent to using a multiplicative decomposition because

\(y_t = S_t \times T_t \times R_t\) is equivalent to \(log\ y_t = log\ S_t \times log\ T_t \times log\ R_t\)

Classical Decomposition

The classical decomposition method originated in the 1920s. It is a relatively simple procedure, and forms the starting point for most other methods of time series decomposition. There are two forms of classical decomposition: an additive decomposition and a multiplicative decomposition. These are described below for a time series with seasonal period \(m\) (e.g., \(m=4\) for quarterly data, \(m=12\) for monthly data, \(m=7\) for daily data with a weekly pattern).

Additive decomposition

Step 1

If \(m\) is an even number, compute the trend-cycle component \(\hat{T}\) using \(2 \times m\) moving average. If \(m\) is an odd number, compute the trend-cycle component \(\hat{T}\) using a moving average of length \(m\).

Step 2

Calculate the detrended series: \(y_T - \hat{T}\).

Step 3

To estimate the seasonal component for each season, simply average the detrended values for that season. For example, with monthly data, the seasonal component for March is the average of all the detrended March values in the data. These seasonal component values are then adjusted to ensure that they add to zero. The seasonal component is obtained by stringing together these monthly values, and then replicating the sequence for each year of data. This gives \(S_t\).

Step 4

The remainder component is calculated by subtracting the estimated seasonal and trend-cycle components: \(\hat{R}_t = y_t - \hat{T}_t - \hat{S}_T\).

Exercise: Amazon Revenues

For the next exercise, we try to decompose Amazon quarterly Revenues. Follow the steps below:

  1. Decompose the time series object amzn_rev holding the Amazon revenue data using the additive type.
  2. Plot the resulting decomposed object. What do you see looking at the random/residual component?
  3. Try to improve the decomposition by either using log to preprocess the time series or using a multiplicative decomposition.

Now it should be easy to see

  1. The trend and the seasonal component.
  2. That additive decomposition (with log()) and multiplicative decomposition are equivalent.

Exercise: Calendar adjustments

Some of the variation seen in seasonal data may be due to simple calendar effects. In such cases, it is usually much easier to remove the variation before fitting a forecasting model. The \(monthdays()\) function will compute the number of days in each month or quarter.

For example, if you are studying the monthly milk production on a farm, there will be variation between the months simply because of the different numbers of days in each month, in addition to the seasonal variation across the year.

For the next exercise, plot the monthly milk production. As a next step, divide the time series milk by monthdays(milk) to transform to a average daily production series. How do the time series compare?

Autoregressive models

In a multiple regression model, we forecast the variable of interest using a linear combination of predictors. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

Thus, an autoregressive model of order \(p\) can be written as

\[y_t = c + \phi\ y_{t-1} + \phi\ y_{t-2} + \dots + \phi\ y_{t-p} + \epsilon_t\],

where \(\epsilon_t\) is white noise. This is like a multiple regression but with lagged values of \(y_t\) as predictors. We refer to this as an AR(\(p\)) model, an autoregressive model of order \(p\).

Differencing

Note that the plot of the Google stock price above is non-stationary since the price level seems to be heavily dependent on time. Hoever, the daily changes seem to be stationary:

This shows one way to make a non-stationary time series stationary — compute the differences between consecutive observations. This is known as differencing.

Moving Average Models

Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

\[y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q}\]

where \(ε_t\) is white noise. We refer to this as an MA(q) model, a moving average model of order \(q\). Of course, we do not observe the values of \(ε_t\), so it is not really a regression in the usual sense.

Notice that each value of \(y_t\) can be thought of as a weighted moving average of the past few forecast errors. However, moving average models should not be confused with the moving average smoothing as discussed previously. A moving average model is used for forecasting future values, while moving average smoothing is used for estimating the trend-cycle of past values.

Forecasting using ARIMA

ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average (in this context, “integration” is the reverse of differencing). The full model can be written as

where \(y'_t\) is the differenced series (it may have been differenced more than once). The “predictors”" on the right hand side include both lagged values of \(y_t\) and lagged errors. We call this an ARIMA(p, d, q) model, where

  • \(p \dots\) order of the autoregressive part;
  • \(d \dots\) degree of first differencing involved;
  • \(q \dots\) order of the moving average part.

Special Cases of ARIMA Models

Many of the models we have already discussed are special cases of the ARIMA model, as shown below:

  • White noise: \(ARIMA(0,0,0)\)
  • Random walk \(ARIMA(0,1,0)\) with no constant
  • Random walk with drift: \(ARIMA(0,1,0)\) with a constant
  • Autoregression: \(ARIMA(p,0,0)\)
  • Moving average: \(ARIMA(0,0,q)\)

US consumption

Example: US consumption

The plot below shows quarterly percentage changes in US consumption expenditure. Although it is a quarterly series, there does not appear to be a seasonal pattern, so we will fit a non-seasonal ARIMA model.

The following R code was used to select a model automatically.

fit <- auto.arima(uschange[,"Consumption"], 
                  seasonal=FALSE)

We can also plot the forecast using the pipe operator. Set \(h\) to specify the forecast horizon.

Manual ARIMA

Typically, to fit ARIMA models manually we need to follow multiple steps (in this example using Amazon revenue data):

  • Load the data set of Amazon (AMZN) revenue data from amzn_rev.rda.
  • Plot data and determine if data is stationary.
  • Should a log-scale be used?
    • Try to plot decompose() of time series and determine right parameter type to be used.
  • Run forecast::auto.arima() on time series. Inspect the resulting model and determine if
    • the resulting parameters make sense
    • the model$residuals are normally distributed and not time-dependent
    • the plotted forecast looks ok.

Auto ARIMA

  • Try to use auto.arima() to automatically fit a forecasting model on Amazon sales data amzn_rev and forecast 12 months ahead.
  • Finally plot the resulting ARIMA forecast.

M4 Competition

Comparing automatic forecasting methods as seen during the last M4 competition.

History of M-Competitions

Purpose: Focus attention on what models produced good forecasts, rather than on the mathematical properties of those models.

  • M1 (1982): 1001 time series and 15 forecasting methods
  • M2 (1982): 29 time series
  • M3 (2000): 3003 time series (yearly, quarterly, monthly, daily, and other)
  • M4 (2018): 100,000 time series

Leaderboard M4

Evaluation Metrics

Just as an overview please see the evaluation metrics below.

  • Symmetric Mean Absolute Percentage Error (sMAPE2).
  • Mean Absolute Scaled Error (MASE)
  • Overall Weighted Average (OWA)

\[ sMAPE = \frac{1}{h}\sum_{t=1}^{h}\frac{2 |Y_t - \hat{Y_t}|}{|Y_t| + |\hat{Y_t}|} \] \[ MASE = \frac{1}{h}\frac{\sum_{t=1}^{h} |Y_t - \hat{Y_t}|}{\frac{1}{n-m}\sum_{t=m+1}^{n}|Y_t - Y_{t-m}|} \] Where \(Y_t\) is the post sample value of the time series at point \(t\), \(\hat{Y_t}\) the estimated forecast, \(h\) the forecasting horizon and \(m\) the frequency of the data (i.e., 12 for monthly series).

Conclusion from M4

  • Combination of methods better than single ones.
  • Pure Machine Learning approaches did not work well.
  • Hybrid method using LSTMs with exponential smoothing worked best.

Outlook for M5: Use external regressors for forecasting, focus on causality.

Second Place M4: FFORMA, a combination of methods, see also https://robjhyndman.com/papers/fforma.pdf.

Plot Forecast comparison from M4

SMAPE by Method

SMAPE by Method and Period

Executiontime