12 May 2020 - Abhishek Nandekar
Auto-Regressive$(AR)$ Processes
An Auto-Regressive Process/Model is a stochastic process, usually very useful in time-series analysis.
- regressive $\implies$ like regression
- auto $\implies$ predict something based on its own past values.
Hence, the output variable depends linearly on its own past value and on a stochastic term(an imperfectly predictable term).
For example, stock prices or temperature(on a particular day) are usually based on values they took in the past.
As another example, consider a mild distributor who wants to know how much milk they shall load on the supply truck for the upcoming month, based upon the selling trend of the previous months.
Naturally, finding optimal lags for the model is essential. Optimal lags
- Should avoid overfitting
- Can use partial autocorrelation function to estimate the direct effect of a particular past state on the current state.
For the above example, a good model might look like:
\(m_t = \beta_0+\beta_1m_{t-1}+\beta_4m_{t-4}+\beta_12m_{t-12}+\epsilon_t\)
$AR(p)$ process
Here, $p$ is the order of the $AR$ process.
\(\begin{align*} X_t &= \mu + \phi_1X_{t-1}+ \phi_2X_{t-2}+ \phi_3X_{t-3} + \dots + \phi_pX_{t-p} + \epsilon_t \\ \implies X_t &= \mu + \sum_{i=1}^p \phi_iX_{t-i} + \epsilon_t \\ &\implies \mathbb E[\epsilon_t] = 0, ~ \mathbb E[\epsilon_t^2] = \sigma^2_{\epsilon}, ~ \mathbb E[\epsilon_t\epsilon_s] = 0 ~~ \forall~ t \neq s \end{align*}\) where $\epsilon_t$ is white noise.
Not all values of “$\phi$’s” are permitted in an $AR$ model. In order for an $AR$ series to not diverge, some constraints are required on the $\phi$ values. For that, the model is assumed to be Wide Sense Stationary.
Stationarity
Desirable property of the series’ for which the forecasting is being done.
Types of Stationarity:
-
- Weak/Wide Sense Stationarity or Covariance Stationarity
- Properties of mean, variance and autocovariance remain the same over time.
\(\mu_X(t) = \mu_X \\ \sigma^2_X(t) = \sigma^2 \\ c_X(t_1, t_2) = c(t_1 - t_2, 0)\)
-
- Strong/Strict Sense Stationarity
- For stochastic process $\{X_t\}$, if all $N$-point joint CDFs are invariant to arbitrary time shifts, it is SSS.
\(\implies F_X(x_{t_1 + \zeta},~x_{t_2 + \zeta},\dots, x_{t_N + \zeta} ) = F_X(x_{t_1},~x_{t_2},\dots,x_{t_N})\)
Note: Covariance Stationarity assumption is required to solve AR model equations for $\phi$’s
Thus, for a process $\{Y_t\}$,
\(\mathbb E[Y_t] = \mu ~~ \forall~t \\ \mathbb E[(Y_t-\mu)(Y_{t-k} - \mu)] = \gamma_k~~\forall~t, k\)
$\implies$ covariance depends only on $k$. When $k=0$, we get variance $= \gamma_0 = \sigma_Y^2$
$AR(1)$ process
\[\begin{align*} Y_t &= \phi_0 + \phi_1Y_{t-1} + \epsilon_t \\ \mathbb E[Y_t] &= \phi_0 + \phi_1\mathbb E[Y_{t-1}] + 0 \\ &\mu = \phi_0 + \phi_1\mu \\ &\mu(1 - \phi_1) = \phi_0 \\ &\mu = \frac{\phi_0}{1-\phi_1} \end{align*}\]If series is centred about the mean, $\phi_0 = 0$ i.e. no constant term. Taking a series $Y_t$ and centring it about its mean -
\[\begin{align*} (Y_t - \mu) &= \phi_1(Y_{t-1} - \mu) + \epsilon_t \\ \mathbb E[(Y_{t} - \mu)(Y_{t-k} - \mu)] &= \phi_1 \mathbb E[(Y_{t-1} - \mu)(Y_{t-k} - \mu)] +\underbrace{\mathbb E[\epsilon_t(Y_{t-k} - \mu)]}_{=~0} \\ \implies &\fbox{ $\gamma_k = \phi_1\gamma_{k-1}\\ \gamma_1 = \phi_1\gamma_0 $} \\ \mathbb E[Y_tY_t] &= \mathbb E[(\phi_1Y_{t-1} + \epsilon_t^2)] \\ \mathbb E[Y_tY_t] &= \mathbb E[\phi_1^2(Y_{t-1}Y_{t-1}) + (\epsilon_t)^2 + 2\phi Y_{t-1}\epsilon_t] \\ \sigma_Y^2 &= \phi_1^2\sigma_Y^2 + \sigma_\epsilon^2 + 0 \\ &\fbox{$\sigma_Y^2 = \frac{\sigma_\epsilon^2}{1 - \phi_1^2} = \gamma_0$ (variance)} \end{align*}\]$AR(p)$ process
(Assuming that it is centred about mean)
\[Y_t = \phi_1Y_{t-1} + \phi_2Y_{t-2} + \dots + \phi_pY_{t-p} + \epsilon_t \\ \mathbb E[Y_{t}Y_{t-k}] = \phi_1\mathbb E[Y_{t}Y_{t-k}] + \dots + \phi_p\mathbb E[Y_{t-p}Y_{t-k}] + \mathbb E[Y_{t-k}\epsilon_t]\] \[\begin{equation*} \tag{$\ast$} \gamma_k = \phi_1\gamma_{k-1} + \phi_2\gamma_{k-2}+\ldots + \phi_p\gamma_{k-p} \label{e1} \end{equation*}\]Multiplying on both sides by $Y_t$ and taking $\mathbb E~\rightarrow$
\[\begin{equation*} \tag{$\star$} \gamma_0 = \phi_1\gamma_{1} + \phi_2\gamma_{2}+\ldots + \phi_p\gamma_{p} + \sigma_\epsilon^2 \label(e2) \end{equation*}\]For $\eqref{e1}$, suppose, take $k=3$
\[\begin{align*} \gamma_3 &= \phi_1\gamma_{2} + \phi_2\gamma_{1}+ \phi_3\gamma_{0} \\ \gamma_2 &= \phi_1\gamma_{1} + \phi_2\gamma_{0}+ \underbrace{\phi_3\gamma_{-1}}_{=~0} \\ \gamma_1 &= \phi_1\gamma_{0} + \underbrace{\phi_2\gamma_{-1}}_{=~0}+ \underbrace{\phi_3\gamma_{-2}}_{=~0} \end{align*}\]Some terms in the expansion of $\gamma_2$ and $\gamma_1$ are 0 as any time value is uncorrelated with its future. Hence,
\[\begin{bmatrix}\gamma_0& 0 & 0 \\ \gamma_1 & \gamma_0 & 0 \\ \gamma_2 & \gamma_1 & \gamma_0 \end{bmatrix} \begin{bmatrix}\phi_1 \\ \phi_2 \\ \phi_3 \end{bmatrix} = \begin{bmatrix} \gamma_1 \\ \gamma_2 \\ \gamma_3 \end{bmatrix}\]These are known as Yule-Walker equations.
Akaike Information Criterion (AIC)
Provides a means for model selection - is used to check the order of $AR$ process.
\[AI(k) = \ln\hat{\sigma}_{\epsilon, k}^2 + \frac{2k}{n}\]where $k$ is the assumed lag and $n$ is the total number of data points. The variance of the error term, as found from $\eqref{e2}$ is
\[\hat{\sigma}_\epsilon^2 = \frac{1}{n} \sum_{t=1}^n \hat{\epsilon}_t^2\]Vector Autoregressive Process (VAR)
VAR(1) $\rightarrow$
\[\begin{align*} \begin{bmatrix}X_t^{(1)} \\ X_t^{(2)} \\ \vdots \\ X_t^{(m)}\end{bmatrix} &= \begin{bmatrix} \phi_{11} & \phi_{12} & \dots & \phi_{1m} \\ \phi_{21} & \phi_{22} & \dots & \phi_{2m} \\ \vdots & & \ddots & \vdots \\ \phi_{m1} & \phi_{m2} & \dots & \phi_{mm} \end{bmatrix} \begin{bmatrix} X_{t-1}^{(1)} \\ X_{t-1}^{(2)} \\ \vdots \\ X_{t-1}^{(m)} \end{bmatrix} + \overrightarrow{\textbf{E}_t} \\ \overrightarrow{X_t}&= \phi_{1_{m \times m}}\overrightarrow{X_{t-1}} + \overrightarrow{\textbf{E}_t} \\ \overrightarrow{X_{t}}~\overrightarrow{X_{t-1}'} &= \phi_{1_{m \times m}}\overrightarrow{X_{t-1}}~\overrightarrow{X_{t-1}'} \end{align*}\]Where, $\overrightarrow{X_{t-1}’}$ has each of the $m$ series delayed by 1 time step.
\[\gamma_1 ( m \times m \text{ matrix}) = \phi_{m \times m} \gamma_{0_{m \times m}}\]To solve a VAR(3) process -
\[\overrightarrow{X_{t}} = \phi_1\overrightarrow{X_{t-1}} + \phi_2\overrightarrow{X_{t-2}}+ \phi_3\overrightarrow{X_{t-3}}+ \overrightarrow{E_t}\]Obtain equations as -
\[\begin{align*} \gamma_1 &= \phi_1 \gamma_0 \\ \gamma_2 &= \phi_1 \gamma_1 + \phi_2 \gamma_0 \\ \gamma_3 &= \phi_1 \gamma_2 +\phi_2 \gamma_1 + \phi_3 \gamma_0 \end{align*}\]Note that each of the $\gamma$’s and $\phi$’s here are $(m \times m)$ matrices, thus we need to be careful when appending these matrices in order to form a bigger matrix to solve them quickly -
\[\begin{bmatrix}\gamma_1 & \gamma_2 & \gamma_3\end{bmatrix} = \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 \end{bmatrix} \begin{bmatrix} \gamma_0 & \gamma_1 & \gamma_2 \\ 0 & \gamma_0 & \gamma_1 \\ 0 & 0 & \gamma_0 \end{bmatrix}\]AIC for VAR
\[\Sigma(k) = \overrightarrow{E_t} \overrightarrow{E_t'}\]which equals $\sigma_\epsilon^2$ in case of simple $AR$. This is an $(m \times m)$ matrix found using \eqref{e2}, the only difference being $\gamma$’s and $\phi$’s are now $(m \times m)$ matrices as well.
Then,
\[AIC(k) = \ln (\det{(\Sigma(k))}) + \frac{2k_m^2}{n}\]where, $k_m$ is the number of series in the vector, and $n$ is the number of data points in each series. As $\Sigma(k)$ is a diagonal matrix, $\det(\Sigma(k)) = $ product of diagonal entries.
Granger Causality
Sometimes, we may not be interested in how the values of just a single time series depend upon their past(or lagged versions), but we may be interested multiple time-series and their interactions. (e.g. property rates in neighbouring counties).
Consider, for example, goods exported by two cities $R$ and $P$ over a period of time. Suppose $R$ is a rich city and the goods exported by them in the year $t$ is $r_t$, and $P$ is a poor city and the goods exported by them in the year $t$ is $p_t$.
Granger Causality gives an idea of prediction.
Steps involved:
- Best AR model for $p_t$:
\(p_t = \phi_1p_{t-1} + \phi_3p_{t-3} +(\epsilon_t){_1}\) - Best VAR model for $p_t$ (with $r_t$ terms):
\(p_t = \phi_1p_{t-1} + \phi_3p_{t-3} + \psi_1r_{t-1} + \psi_5r_{t-5} + (\epsilon_t){_2}\) - Conclude based on which model is better
Construct two models of $X(t)$ as follows -
- using VAR
\(X(t) = \sum_{\zeta = 1}^\infty (a_\zeta X(t-\zeta)) + \sum_{\zeta = 1}^\infty (c_\zeta Y(t-\zeta)) + \epsilon_c\) - using simple AR
\(X(t) = \sum_{\zeta = 1}^\infty (b_\zeta X(t-\zeta)) + \epsilon\)
Then, the Granger F-statistic is defined as:
\[F_{Y \to X} = \ln \frac{var(\epsilon)}{var(\epsilon_c)}\]If $Y$ causes $X$, then, $F_{Y \to X} > 0$, otherwise, $F_{Y \to X} = 0$.
To estimate the causality from $Y$ to $X$ when there are other variables in the system, construct VAR models of $X$ along with other variables including $Y$ in one case and excluding it in the other. Then check which of the models is better suited by computing Granger F-statistic.
Exercise
Simulate coupled AR processes defined as -
\[X(t) = aX(t-1) + cY(t-1) + \epsilon_{X,t} \\ Y(t) = bY(t-1)+ \epsilon_{Y, t}\]where $a=0.9, b=0.8, c=0.8$ and length of the series is 1000. Simulate the noise terms, $\epsilon_X, \epsilon_Y = \eta\nu$ where $\eta$ is the standard normal distribution and $\nu = 0.01$ is the intensity of noise. Estimate the AR coefficients of the simulated signal $X$, assuming two models of $X$:
- $X(t)$ depends only on past $X$ values
- $X(t)$ depends on past $X$ and $Y$ values
Assume maximum lags = $k$ in both cases to be 10. By estimating the variance of the error terms for each $k$, estimate $AIC(k)$ for both models. Select models 1. and 2. with optimal $AIC$. Using these models, compute Granger causality F-statistic from $Y$ to $X$.