Lecture 3: Information Theory

29 Jan 2020 - Abhishek Nandekar

A few notes on prime numbers:

Example: Cicadas - small insects that have survived over time by hibernating underground. They hibernate underground for a prime number of years(7, 11…..). That way, they will avoid predator life cycles for a longer time, since, collision probability is drastically reduced by the use of prime numbers. Suppose a certain predator has a life cycle of 4 years, and that of a cicada was ~8 years. In that case, a collision would happen every 8 years, making life difficult for the cicadas. But if the cicada’s life cycle is of 7 years instead of 8, a collision will occur every 28 years, thereby making it a bit easier for them to prevail.
Matiyasevich’s Theorem:
- Diophantine Set: A set \(\mathbb M \) of non-negative integers such that there exists a polynomial \( P(a, x_1, \ldots, x_m)\) with integer coefficients such that: \begin{equation} \tag{DS} P(a, x_1, \ldots, x_m) = 0 \label{eq: ds} \end{equation} has a solution in the unknowns \(x_1, \ldots, x_m \) if and only if the value of parameter \(a\) belongs to the set \( \mathbb M \).
- Effectively Enumerable Set: A set \( \mathbb M\) is called effectively enumerable if there exists an algorithm, working potentially infinitely long, would output all elements of the set \(\mathbb M\) in some order and only those.
- Matiyasevich’s Theorem: There exists a system of 14 polynomials and 26 variables such that the coefficients belong to the Diophantine Set and solutions are effectively enumerable, then the number \( k+2 \) is prime, where \(k\) is a factor that appears multiple times in the given system

Introduction to Probability Theory

Stochastic Experiment: An experiment that can have multiple outcomes. There is no deterministic rule governing the randomness of the outcomes. Even if there exists such a rule, there is some randomness. e.g. tossing a coin, rolling a dice, etc.

We’ll be considering the coin toss as the example stochastic experiment for further studying the concepts in probability theory.

Sample Space: A set \( \mathbb S \) of all possible outcomes in the stochastic experiment. Here, \( \mathbb S = \{H, T\} \)

Event: A subset of the set of all possible subsets(Power Set) of \( \mathbb S\). Suppose \( |\mathbb S| = n \), where \( |\cdot|\) is the cardinality(number of elements in) of the set, then a total of \(2^n\)events are plausible.

Here, the power set(the set of all possible subsets) of \( \mathbb S \) is :

\[\textbf{P} = \{\phi, \{H\}, \{T\}, \{H, T\}\}\]

Where \(\phi\) is the null set. It stands for an impossible event. Notice the presence of \(\mathbb S\) in the power set \(\textbf{P}\). This element stands for the certain event.

Let \( A, B \) be events of a stochastic experiment. It follows that \( A, B \in \textbf{P} \) and \( A, B \subseteq \mathbb S \).

Then,

\( A \cup B \in \textbf{P} \): The event that the event \(A\) OR event \(B\) occurs.
\( A \cap B \in \textbf{P} \): The event that both events \(A\) AND \(B\) occur.
\( A^c \in \textbf{P} \): The event that the event \(A\) does NOT occur.
If \(A, B\) are mutually exclusive, then \(A \cap B = \phi \).

Venn Diagrams can be useful for visualising sets. They are used extensively in the set theory.

Axioms of Probability

For any event \( A \), \( P(A) \geq 0\)
\(P(\mathbb S) = 1\)
If \(A, B\) are mutually exclusive, then \( P(A \cup B) = P(A) + P(B) \)

where \(\mathbb S\) is finite, and \(P(event) = \frac{\text{number of favourable outcomes}}{\text{total number of outcomes}} \)

Example: A number is chosen at random from the set \(\{1, 2, 3, \ldots, 25\}\). Let the selected digit be \(n\)

Here, the symbol \(|\) stands for “divides”. This symbol is usually used in conditional probability. Hence, if not specified, always use the \(|\) symbol for conditional probability.

For even numbers, \( P(2|n) = \frac{|\{2, 4, 6, \ldots, 24\}|}{|\{1, 2, 3, \ldots, 25\}|} = \frac{12}{25} \)
\( P(3|n) = \frac{|\{3, 6, 9, \ldots, 24\}|}{|\{1, 2, 3, \ldots, 25\}|} = \frac{8}{25} \)
\( P(2|n \cap 3|n) = \frac{|\{6, 12, 18, 24\}|}{|\{1, 2, 3, \ldots, 25\}|} = \frac{4}{25}\)
\( P(2|n \cup 3|n) = P(2|n) + P(3|n) - P(2|n \cap 3|n) = \frac{12}{25} + \frac{8}{25} - \frac{4}{25} = \frac{16}{25} \)
\( P(n \text{ ends in digit 2}) = \frac{|\{2, 12, 22\}|}{|\{1, 2, 3, \ldots, 25\}|} = \frac{3}{25} \)

Conditional Probability

Given events \(A, E \subset \mathbb S \) and \(P(E) > 0 \), then, \(P(A)\) given the event \(E\) has occurred is

\[P(A|E) = \frac{P(A \cap E)}{P(E)}\]

\(P(E)\) is known as the normalisation factor. Hence, if \(A = E\), then \(P(A|E) = 1\).

Essentially, we are shrinking the sample space. \(P(A) = P(A|\mathbb S) \).

For example, in a deck of 52 cards, the probability that a selected card is the King, given that it is a spade is

\[P(K|\spadesuit) = \frac{P(K \cap \spadesuit)}{P(\spadesuit)} = \frac{1/52}{13/52} = \frac{1}{13}\]

If \(A, E\) are disjoint i.e. mutually exclusive events, the \(P(A|E) = 0\). For example, in a coin toss, given that tails has occurred, what is the probability that it is a head? Zero, since now our sample space has shrunk to \(\{T\}\).

If \(A, E\) are independent events i.e. if the occurrence of the event \(E\) has no effect on the probability of the occurrence of the event \(A\), then,

\[P(A \cap E) = P(A) \cdot P(E)\]

In such a case, \( P(A|E) = P(A) \).

For example, consider a roll of a dice. The sample space is \(\mathbb S = \{1, 2, 3, 4, 5, 6\} \). Let \(n\) be the number on the dice after the roll. Consider the event (\(A\)) such that \(n\) is a multiple of 3. Consider another event (\(E\)) such that \(n\) is an even number. Then, \(P(A \cap E) = \frac{1}{3} \cdot \frac{1}{2} = \frac{1}{6} \). Notice that divisibility by 3 and divisibility by 2 are two independent properties.

Introduction to Information Theory

Information

Consider the above image. The constructs are clearly distinguished by the abstract nature of properties in the left side block, as compared to the concrete nature of properties on the right. The properties on the right block are measurable, quantitative, and objective. On the other hand, the properties on the left are immeasurable, qualitative and subjective. Most of the Information in the real world would fall on the left block. Claude Shannon tried to concretise information by using axiomatic approach. According to Shannon, “Information is what Information does”.

Self Information

Self Information of an event \(A\) is defined with the following axioms:

\(i(P(A)) \geq 0\)

Information by definition is positive
If \(P(A) \leq P(B) \iff i(P(A)) \geq i(P(B)) \) if \(A = \phi, i(P(A)) \to \infty \) \(i(P(\mathbb S)) = 0 \)

This axiom is trying to state that probability and uncertainty are inversely related. Something which is more probable to occur is less surprising and vice versa. Surprising events offer more information.
If \(A, B \) are independent events, then \(i(P(A \cap B)) = i(P(A)) + i(P(B)) \)

From the above axioms, it can be inferred that \(i(P) = -\log P\)

Average Self Information/Shannon entropy

\begin{align} H(\textbf{X}) &= \sum_{j=1}^n P_j \cdot i(P_j)
&= \sum_{j=1}^n -P_j \cdot \log_2(P_j) \end{align}

Where, \(\textbf{X}\) is a discrete random variable. A discrete random variable maps the outputs of the sample space to \(\mathbb R \).

Let \(S = \{a_1, a_2, \ldots, a_n\}\) where each element has probability \(p_1, p_2, \ldots, p_n\). Then the self information of every element is \(-\log_2 p_1, -\log_2 p_2, \ldots, -\log_2 p_n \). Then,

\[H(S) = -\sum_{k=1}^n p_k \log_2 p_k\]

Units of Information: bits, shannons (when base of the logarithm is 2)

1 bit = 1 shannon = Average uncertainty revealed by a sample space of \(2^n\) outcomes, where each outcome is equally likely.

when base of the logarithm is e, unit to be used is “nat”. Similarly, if base of the logarithm is 3, unit to be used is “trit”. But usually, \(\log_2\) is considered the standard.

Claude Shannon, in his original 1948 paper on information theory, posts the axioms for average self information for a set of events, instead of the approach used here.

Grouping Axiom

Let \(S = \{1, 2, 3, \ldots, 8\}\). A person is asked to choose a number at random, and keep it a secret. Then the minimum number of questions we need to ask that person to correctly guess the number they chose is the ceiling of average self information of \(S\)

\[H(S) = \sum_{k=1}^8 -\frac{1}{8} \log_2 \frac{1}{8} = -\log_2 \frac{1}{8} = 3 \text{ bits}\]

No matter how you ask the questions, you can never beat the entropy.

Now lets say we know that the person is more likely to choose 1 \((p_1 = 1/2) \) and 2 \((p_2 = 1/8)\) than any of the other numbers \(p_k = 1/16, k \in \{3, \ldots, 8\}\). Then the average self information is:

\(H(S) = -\frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{8} \log_2 \frac{1}{8} - 6(\frac{1}{16} \log_2 \frac{1}{16})\) \(= \frac{1}{2} + \frac{3}{8} + \frac{6}{4}\) \(= \frac{19}{8} = 2.375 \text{ bits}\)