Auto-Regressive Processes and Granger Causality

12 May 2020 - Abhishek Nandekar

Auto-Regressive$(AR)$ Processes

An Auto-Regressive Process/Model is a stochastic process, usually very useful in time-series analysis.

regressive $\implies$ like regression
auto $\implies$ predict something based on its own past values.

Hence, the output variable depends linearly on its own past value and on a stochastic term(an imperfectly predictable term).

For example, stock prices or temperature(on a particular day) are usually based on values they took in the past.

As another example, consider a mild distributor who wants to know how much milk they shall load on the supply truck for the upcoming month, based upon the selling trend of the previous months.
MilkMan

Naturally, finding optimal lags for the model is essential. Optimal lags

Should avoid overfitting
Can use partial autocorrelation function to estimate the direct effect of a particular past state on the current state.

For the above example, a good model might look like:
$m_t = \beta_0+\beta_1m_{t-1}+\beta_4m_{t-4}+\beta_12m_{t-12}+\epsilon_t$

$AR(p)$ process

Here, $p$ is the order of the $AR$ process.

$\begin{align*} X_t &= \mu + \phi_1X_{t-1}+ \phi_2X_{t-2}+ \phi_3X_{t-3} + \dots + \phi_pX_{t-p} + \epsilon_t \\ \implies X_t &= \mu + \sum_{i=1}^p \phi_iX_{t-i} + \epsilon_t \\ &\implies \mathbb E[\epsilon_t] = 0, ~ \mathbb E[\epsilon_t^2] = \sigma^2_{\epsilon}, ~ \mathbb E[\epsilon_t\epsilon_s] = 0 ~~ \forall~ t \neq s \end{align*}$ where $\epsilon_t$ is white noise.

Not all values of “$\phi$’s” are permitted in an $AR$ model. In order for an $AR$ series to not diverge, some constraints are required on the $\phi$ values. For that, the model is assumed to be Wide Sense Stationary.

Stationarity

Desirable property of the series’ for which the forecasting is being done.

StochasticVenn
Types of Stationarity:

Weak/Wide Sense Stationarity or Covariance Stationarity

Properties of mean, variance and autocovariance remain the same over time.
$\mu_X(t) = \mu_X \\ \sigma^2_X(t) = \sigma^2 \\ c_X(t_1, t_2) = c(t_1 - t_2, 0)$
Strong/Strict Sense Stationarity

For stochastic process $\{X_t\}$, if all $N$-point joint CDFs are invariant to arbitrary time shifts, it is SSS.
$\implies F_X(x_{t_1 + \zeta},~x_{t_2 + \zeta},\dots, x_{t_N + \zeta} ) = F_X(x_{t_1},~x_{t_2},\dots,x_{t_N})$

Note: Covariance Stationarity assumption is required to solve AR model equations for $\phi$’s
Thus, for a process $\{Y_t\}$,
$\mathbb E[Y_t] = \mu ~~ \forall~t \\ \mathbb E[(Y_t-\mu)(Y_{t-k} - \mu)] = \gamma_k~~\forall~t, k$
$\implies$ covariance depends only on $k$. When $k=0$, we get variance $= \gamma_0 = \sigma_Y^2$

$AR(1)$ process

\[\begin{align*} Y_t &= \phi_0 + \phi_1Y_{t-1} + \epsilon_t \\ \mathbb E[Y_t] &= \phi_0 + \phi_1\mathbb E[Y_{t-1}] + 0 \\ &\mu = \phi_0 + \phi_1\mu \\ &\mu(1 - \phi_1) = \phi_0 \\ &\mu = \frac{\phi_0}{1-\phi_1} \end{align*}\]

If series is centred about the mean, $\phi_0 = 0$ i.e. no constant term. Taking a series $Y_t$ and centring it about its mean -

\[\begin{align*} (Y_t - \mu) &= \phi_1(Y_{t-1} - \mu) + \epsilon_t \\ \mathbb E[(Y_{t} - \mu)(Y_{t-k} - \mu)] &= \phi_1 \mathbb E[(Y_{t-1} - \mu)(Y_{t-k} - \mu)] +\underbrace{\mathbb E[\epsilon_t(Y_{t-k} - \mu)]}_{=~0} \\ \implies &\fbox{ $\gamma_k = \phi_1\gamma_{k-1}\\ \gamma_1 = \phi_1\gamma_0 $} \\ \mathbb E[Y_tY_t] &= \mathbb E[(\phi_1Y_{t-1} + \epsilon_t^2)] \\ \mathbb E[Y_tY_t] &= \mathbb E[\phi_1^2(Y_{t-1}Y_{t-1}) + (\epsilon_t)^2 + 2\phi Y_{t-1}\epsilon_t] \\ \sigma_Y^2 &= \phi_1^2\sigma_Y^2 + \sigma_\epsilon^2 + 0 \\ &\fbox{$\sigma_Y^2 = \frac{\sigma_\epsilon^2}{1 - \phi_1^2} = \gamma_0$ (variance)} \end{align*}\]

$AR(p)$ process

(Assuming that it is centred about mean)

\[Y_t = \phi_1Y_{t-1} + \phi_2Y_{t-2} + \dots + \phi_pY_{t-p} + \epsilon_t \\ \mathbb E[Y_{t}Y_{t-k}] = \phi_1\mathbb E[Y_{t}Y_{t-k}] + \dots + \phi_p\mathbb E[Y_{t-p}Y_{t-k}] + \mathbb E[Y_{t-k}\epsilon_t]\] \[\begin{equation*} \tag{$\ast$} \gamma_k = \phi_1\gamma_{k-1} + \phi_2\gamma_{k-2}+\ldots + \phi_p\gamma_{k-p} \label{e1} \end{equation*}\]

Multiplying on both sides by $Y_t$ and taking $\mathbb E~\rightarrow$

\[\begin{equation*} \tag{$\star$} \gamma_0 = \phi_1\gamma_{1} + \phi_2\gamma_{2}+\ldots + \phi_p\gamma_{p} + \sigma_\epsilon^2 \label(e2) \end{equation*}\]

For $\eqref{e1}$, suppose, take $k=3$

\[\begin{align*} \gamma_3 &= \phi_1\gamma_{2} + \phi_2\gamma_{1}+ \phi_3\gamma_{0} \\ \gamma_2 &= \phi_1\gamma_{1} + \phi_2\gamma_{0}+ \underbrace{\phi_3\gamma_{-1}}_{=~0} \\ \gamma_1 &= \phi_1\gamma_{0} + \underbrace{\phi_2\gamma_{-1}}_{=~0}+ \underbrace{\phi_3\gamma_{-2}}_{=~0} \end{align*}\]

Some terms in the expansion of $\gamma_2$ and $\gamma_1$ are 0 as any time value is uncorrelated with its future. Hence,

\[\begin{bmatrix}\gamma_0& 0 & 0 \\ \gamma_1 & \gamma_0 & 0 \\ \gamma_2 & \gamma_1 & \gamma_0 \end{bmatrix} \begin{bmatrix}\phi_1 \\ \phi_2 \\ \phi_3 \end{bmatrix} = \begin{bmatrix} \gamma_1 \\ \gamma_2 \\ \gamma_3 \end{bmatrix}\]

These are known as Yule-Walker equations.

Akaike Information Criterion (AIC)

Provides a means for model selection - is used to check the order of $AR$ process.

\[AI(k) = \ln\hat{\sigma}_{\epsilon, k}^2 + \frac{2k}{n}\]

where $k$ is the assumed lag and $n$ is the total number of data points. The variance of the error term, as found from $\eqref{e2}$ is

\[\hat{\sigma}_\epsilon^2 = \frac{1}{n} \sum_{t=1}^n \hat{\epsilon}_t^2\]

Vector Autoregressive Process (VAR)

VAR(1) $\rightarrow$

\[\begin{align*} \begin{bmatrix}X_t^{(1)} \\ X_t^{(2)} \\ \vdots \\ X_t^{(m)}\end{bmatrix} &= \begin{bmatrix} \phi_{11} & \phi_{12} & \dots & \phi_{1m} \\ \phi_{21} & \phi_{22} & \dots & \phi_{2m} \\ \vdots & & \ddots & \vdots \\ \phi_{m1} & \phi_{m2} & \dots & \phi_{mm} \end{bmatrix} \begin{bmatrix} X_{t-1}^{(1)} \\ X_{t-1}^{(2)} \\ \vdots \\ X_{t-1}^{(m)} \end{bmatrix} + \overrightarrow{\textbf{E}_t} \\ \overrightarrow{X_t}&= \phi_{1_{m \times m}}\overrightarrow{X_{t-1}} + \overrightarrow{\textbf{E}_t} \\ \overrightarrow{X_{t}}~\overrightarrow{X_{t-1}'} &= \phi_{1_{m \times m}}\overrightarrow{X_{t-1}}~\overrightarrow{X_{t-1}'} \end{align*}\]

Where, $\overrightarrow{X_{t-1}’}$ has each of the $m$ series delayed by 1 time step.

\[\gamma_1 ( m \times m \text{ matrix}) = \phi_{m \times m} \gamma_{0_{m \times m}}\]

To solve a VAR(3) process -

\[\overrightarrow{X_{t}} = \phi_1\overrightarrow{X_{t-1}} + \phi_2\overrightarrow{X_{t-2}}+ \phi_3\overrightarrow{X_{t-3}}+ \overrightarrow{E_t}\]

Obtain equations as -

\[\begin{align*} \gamma_1 &= \phi_1 \gamma_0 \\ \gamma_2 &= \phi_1 \gamma_1 + \phi_2 \gamma_0 \\ \gamma_3 &= \phi_1 \gamma_2 +\phi_2 \gamma_1 + \phi_3 \gamma_0 \end{align*}\]

Note that each of the $\gamma$’s and $\phi$’s here are $(m \times m)$ matrices, thus we need to be careful when appending these matrices in order to form a bigger matrix to solve them quickly -

\[\begin{bmatrix}\gamma_1 & \gamma_2 & \gamma_3\end{bmatrix} = \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 \end{bmatrix} \begin{bmatrix} \gamma_0 & \gamma_1 & \gamma_2 \\ 0 & \gamma_0 & \gamma_1 \\ 0 & 0 & \gamma_0 \end{bmatrix}\]

AIC for VAR

\[\Sigma(k) = \overrightarrow{E_t} \overrightarrow{E_t'}\]

which equals $\sigma_\epsilon^2$ in case of simple $AR$. This is an $(m \times m)$ matrix found using \eqref{e2}, the only difference being $\gamma$’s and $\phi$’s are now $(m \times m)$ matrices as well.

Then,

\[AIC(k) = \ln (\det{(\Sigma(k))}) + \frac{2k_m^2}{n}\]

where, $k_m$ is the number of series in the vector, and $n$ is the number of data points in each series. As $\Sigma(k)$ is a diagonal matrix, $\det(\Sigma(k)) = $ product of diagonal entries.

Granger Causality

Sometimes, we may not be interested in how the values of just a single time series depend upon their past(or lagged versions), but we may be interested multiple time-series and their interactions. (e.g. property rates in neighbouring counties).

Consider, for example, goods exported by two cities $R$ and $P$ over a period of time. Suppose $R$ is a rich city and the goods exported by them in the year $t$ is $r_t$, and $P$ is a poor city and the goods exported by them in the year $t$ is $p_t$.
RichDadPoorDad

Granger Causality gives an idea of prediction.

Steps involved:

Best AR model for $p_t$:
$p_t = \phi_1p_{t-1} + \phi_3p_{t-3} +(\epsilon_t){_1}$
Best VAR model for $p_t$ (with $r_t$ terms):
$p_t = \phi_1p_{t-1} + \phi_3p_{t-3} + \psi_1r_{t-1} + \psi_5r_{t-5} + (\epsilon_t){_2}$
Conclude based on which model is better

Construct two models of $X(t)$ as follows -

using VAR
$X(t) = \sum_{\zeta = 1}^\infty (a_\zeta X(t-\zeta)) + \sum_{\zeta = 1}^\infty (c_\zeta Y(t-\zeta)) + \epsilon_c$
using simple AR
$X(t) = \sum_{\zeta = 1}^\infty (b_\zeta X(t-\zeta)) + \epsilon$

Then, the Granger F-statistic is defined as:

\[F_{Y \to X} = \ln \frac{var(\epsilon)}{var(\epsilon_c)}\]

If $Y$ causes $X$, then, $F_{Y \to X} > 0$, otherwise, $F_{Y \to X} = 0$.

To estimate the causality from $Y$ to $X$ when there are other variables in the system, construct VAR models of $X$ along with other variables including $Y$ in one case and excluding it in the other. Then check which of the models is better suited by computing Granger F-statistic.

Exercise

Simulate coupled AR processes defined as -

\[X(t) = aX(t-1) + cY(t-1) + \epsilon_{X,t} \\ Y(t) = bY(t-1)+ \epsilon_{Y, t}\]

where $a=0.9, b=0.8, c=0.8$ and length of the series is 1000. Simulate the noise terms, $\epsilon_X, \epsilon_Y = \eta\nu$ where $\eta$ is the standard normal distribution and $\nu = 0.01$ is the intensity of noise. Estimate the AR coefficients of the simulated signal $X$, assuming two models of $X$:

$X(t)$ depends only on past $X$ values
$X(t)$ depends on past $X$ and $Y$ values

Assume maximum lags = $k$ in both cases to be 10. By estimating the variance of the error terms for each $k$, estimate $AIC(k)$ for both models. Select models 1. and 2. with optimal $AIC$. Using these models, compute Granger causality F-statistic from $Y$ to $X$.