Skip to the content.

18 May 2020 - Abhishek Nandekar

Relative Entropy(KL divergence) and Mutual Information

Shannon Entropy:

\[H(I) = -\sum_i p(i) \log p(i)\]

where the $i$’s denote possible states. Then, Shannon Entropy is the average number of bits needed to optimally encode the independent draws of a discrete random variable.

To construct the optimal encoding, we need to know the $p(i)$’s. Excess number of bits that will be used in a distribution $q(i)$ is used instead of $p(i)$ is given by using KL Divergence:

\[\tag{1}\label{eq_KL} KL(I) = \sum_i p(i) \log\frac{p(i)}{q(i)}\]

The KL Divergence can also be seen as the measure of distance between distributions.

KL Divergence can also be defined for conditional probabilities.

Conditioning $I$ upon a single state $j=j_1$ from $J$

\[KL(I|J=j_1) = \sum_i p(i|j_1)\log\frac{p(i|j_1)}{q(i|j_1)}\]

Summing over all the states from $J$:

\[\begin{aligned} KL(I|J) &= \sum_j p(j) \sum_i p(i|j) \log\frac{p(i|j)}{q(i|j)} \\ &= \sum_{i, j} p(i, j) \log\frac{p(i|j)}{q(i|j)} \end{aligned}\]

Mutual Information

$I,~J$ have joint distribution $p(i, j)$ and marginal distributions $p(i)$ and $p(j)$ respectively. Then,

Mutual Information between $I$ and $J$, $M(I; J)$:
Relative entropy between joint distribution and product distributions.
\[\tag{2}\label{eq_MI} M(I; J) = \sum_{i, j} p(i, j) \log\frac{p(i, j)}{p(i)~\cdot~p(j)}\]

It can be seen as the excess amount of code produced by erroneously assuming that the two systems are independent, i. e. $q(i, j) = p(i)~\cdot~p(j)$ instead of $p(i, j)$.

Check Lecture 4

Transfer Entropy

Transfer Entropy was first proposed by T. Schreiber in the year 2000, in his paper Measuring Information Transfer.

As seen before, the Mutual Information quantifies information overlap. However, I does not contain dynamical or directional information. To give Mutual Information a directional sense,

\[M(I; J, \zeta) = \sum_{i_n, j_{n-\zeta}} p({i_n, j_{n-\zeta}} ) \log \frac{p({i_n, j_{n-\zeta}} )}{p({i_n} )~\cdot~p( j_{n-\zeta})}\]

Conditional Mutual Information is also used as a measure of Causality.

Transition Probabilities will allow us to consider the same system at different times and also different systems at different times. Transition instead of static probabilities will also give a dynamical and directional structure to the measure.

Assumption - The given process is a stationary Markov Process of order $k$.
\(\implies p(i_{n+1}\big|i_{n}, i_{n-1}, i_{n-2}, \ldots, i_1) = p(i_{n+1}\big|i_{n}, i_{n-1}, i_{n-2}, \ldots, i_{n+1-k})\)

Define $i_n^{(k)} := (i_{n}, i_{n-1}, i_{n-2}, \ldots, i_{n+1-k})$

Entropy Rate:

\[h{'}_{X} = \lim_{n \to \infty} \frac{1}{n} H(X_{t_1}, X_{t_2}, \ldots, X_{t_n})\]

This is known as the per symbol Entropy for $n$ random variables.

\[h_X = \lim_{n \to \infty} H(X_{t_n}\big| X_{t_{n-1}},X_{t_{n-2}} \ldots, X_{t_{1}})\]

For stationary stochastic processes, $h_X = h{‘}_X$

For a stationary Markov chain of order $k$:

\[\begin{aligned} h{'}_X = h_X &= \lim_{n \to \infty} H(X_{t_n}\big| X_{t_{n-1}},X_{t_{n-2}} \ldots, X_{t_{1}}) \\ &= \lim_{n \to \infty} H(X_{t_n}\big| X_{t_{n-1}},X_{t_{n-2}} \ldots, X_{t_{n-k}}) \\ &= H(X_{t_n}\big| X_{t_{n-1}},X_{t_{n-2}} \ldots, X_{t_{n-k}})\\ &\text{(for finite $n$ as well)} \end{aligned}\]

Going back to our system:

Since,

\[\begin{aligned} p(i_{n+1}|i_n^{(k)}) &= p(i_{n+1}^{(k+1)})/p(i_n^{(k)}) \\ \implies (h_{I}{)}_{1} &= H(I^{(k+1)}) - H(I^{(k)}) \\ &= H(i_{n+1}|i_n^{(k)}) \\ &= -\sum_{i_{n+1}, i_n^{(k)}} p(i_{n+1}, i_n^{(k)}) \log p(i_{n+1}|i_n^{(k)}) \end{aligned}\]

Mutual Information Rate is constructed when there is another process in $J$ in the system:

\[\begin{aligned} (h_I{)}_2 &= H(i_{n+1}|i_n^{(k)}, j_n^{(l)}) \\ &= -\sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log p(i_{n+1}|i_n^{(k)}, j_n^{(l)}) \end{aligned}\]

Transfer Entropy($T_{J\to I}$) is the difference in the entropy rates.

\[\begin{aligned} T_{J \to I} &= (h_I{)}_1 - (h_I{)}_2 \\ &= -\sum_{i_{n+1}, i_n^{(k)}} p(i_{n+1}, i_n^{(k)}) \log p(i_{n+1}|i_n^{(k)})\\ &+ \sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log p(i_{n+1}|i_n^{(k)}, j_n^{(l)}) \end{aligned}\]

The first term can be written as:

\[\begin{aligned} &-\sum_{i_{n+1}, i_n^{(k)}} \log p(i_{n+1}|i_n^{(k)})\sum_{ j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \\ =& -\sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log p(i_{n+1}|i_n^{(k)}) \\ \implies T_{J\to I} =& \sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log p(i_{n+1}|i_n^{(k)}, j_n^{(l)}) \\ &-\sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log p(i_{n+1}|i_n^{(k)}) \\ =& \sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log \frac{p(i_{n+1}|i_n^{(k)}, j_n^{(l)})}{p(i_{n+1}|i_n^{(k)})} \end{aligned}\]

Hence,

\[\fbox{$ T_{J \to I} = \sum_{i_{n+1}, i_n^{(k)}, j_n^{(l)}} p(i_{n+1}, i_n^{(k)}, j_n^{(l)}) \log \frac{p(i_{n+1}|i_n^{(k)}, j_n^{(l)})}{p(i_{n+1}|i_n^{(k)})} $}\]

This can also be looked at as the KL Divergence between distributions $p(i_{n+1}| i_n^{(k)}, j_n^{(l)})$ and $p(i_{n+1}| i_n^{(k)})$ This is the incorrectness of assuming the latter distribution, when it is actually the former one.