Lecture 5: Shannon’s Source Coding Theorem

11 Feb 2020 - Gulshan

Some assumptions for source coding:

We assume that there is no noise that’s the assumptions we gonna make because we only want to look at compression alone and decompression so we do not worry about noise. So other assumptions is that channel coding or source coding are treated independently because there is something called Shannon’s separation theorem so under certain conditions you can do them independently. You don’t have to compress both of them together.

”I am the Alpha to Zeta

of systems - Micro to Meta,

I grow daily by the Quanta”

Who am I?

Student: Information.

Teacher: not information, information is very technical. Right.

Student: data

Data Right

Whether information increases or entropy increases. We don’t know but definitely data.

Showing slide: **

Sarvam Kalvidam Data

** This is a part of mahabhkaya i.e. data everywhere.

These are some notations

Kilo (103)< Mega (106)< Giga (109) < Tera (1012) < Peta (1015) < Exa (1018)< Zetta (1021) < Yotta (1024) …

All these are old stuff. I guess these numbers will change now.

The Need for Data Compression

For storage and transmission.

Mobile devices have limited storage capacity.

Medical data is increasing in resolution and dimensions (eg. 4D Ultrasound). Data storage and transmission needs to be “lossless”.

The Solution:

Text Comperssion (eg. Gzip, Tar)

Image Compression (eg. JPEG)

Audio/video Compression(eg. MPEG)

Codes for Lossless Data Compression

Start with a sequence of symbols X = $X_1,X_2,X_3,....X_n$ from a finite source alphabet:

$A_X$= ${(a_1, a_2, a_3,.......)}$

Examples: $A_X$ = (A,B,……..,Z,_);

Every one of them is coming from some alphabet. In case of english language, it will be from A to Z.

if you have image, every pixel is value from 0 to 255 so it means capital $X_1,X_2$ are the pixels of the image and Capital X is the image.

$A_X$ = (0,1,2,……255);

$A_X$ = (C,G,T,A);

$A_X$ = (0,1). here it would be a binary file or binary sequence or binary image.

so what encoder does?

Encoder: Outputs a new sequence: $Z$= $Z_1,Z_2,.....Z_M$

Using possibly a different code alphabet. Typically the alphabet is {0,1}.

Decoder: tries to convert Z back to X.

What is Code C:

when we use the word code. we might use small c or big C.

Small c is actually mean codeword

Big C actually means codebook.

codebook is collection of all codewords.

codeword actually lets say $X_i$ is input and mapping into $Z_i$ .

$X_i$ $\frac{C}{}$ $Z_i$

It could be in a table form.

input	output
$X_1$	$Z_1$
$X_2$	$Z_2$
$X_n$	$Z_n$

think of c like a function that its mapping input and output.

There are two kind of codes i.e. symbol code and stream code.

Symbol code will map a single symbol, it could also be an alphabet

mapping to a single codeword.

Stream code: it can take multiple symbol from a message and map into a single codeword.

like $X_1, X_2, X_3$}$-Z_1$

$X_4 , X_5 -Z_2$

here entire message is one codeword.

Block cipher: are like symbol code but they are not exactly symbol code but similar to symbol code

Stream Cipher: they are stream code.

Morse code: in Morse code the letter ‘e’ is given ‘.’ and ‘o’ is given ‘_ _ _ ‘ . one dash is equal to three dots in terms of duration .Morse code works basically as an electrical telegraph system. it works by making or breaking circuits.

The idea is to assign shorter codewords for frequently occurring symbols

Compression ratio = original size /compressed file size

Morse code is better to hear them not to see them. people found its faster when you just hear it.

interestingly this is using one of the sensor. here it is hearing or you can use visual things or touch.

Example of touch is braille

Every touch sensation is uniquely map to a single input symbols.

So, What does alphabet mean?

Alphabet is out of which symbols in the output space is made.

what is alphabet for the output? where as input alphabet is A to Z.

its a binary code. word binary means attribute as alphabet. here alphabet is two possibilities

Input alphabet has depression or mounts. so 0 or 1.

How many possible codewords are there?

$2^6$ so 64 possible codewords.

Examples of Codes.

Let us assign a code to each symbol of the alphabet (Symbol Code).

Example: $A_X$ = {C,G,T,A}

Assign: C = 0, G = 01, T = 001, A = 111

Therefore the sequence ‘CGT’ is encoded as ‘001001’. (I)

This shows the ENCODING process. The Decoding has to ensure that we recover the message uniquely always.

A code C is uniquely decodable if every sequence of codewords can be uniquely decoded without any ambiguity.

Example : ‘CGT’ is encoded as ‘001001’ NOT uniquely decodable. Why?

Because the received sequence ‘001001’ could be decoded as ‘CGT’ or as ‘TT’. There is no way to distinguish between the two cases.

How to determine whether a Code is uniquely decodable or not?

Sardinas -Patterson Theorem (Refer to Jones & Jones )

what this theorem does? It takes a code whatever the code may be

and determine whether code is uniquely decodable or not.

Prefix_free Codes

No codeword is a ‘prefix’ of another

Recall: In our previous example,

C =0 and G = 01

Here, the codeword ’0’ is a prefix of ‘01’.

This created the ambiguity.

Example

Take:

‘C’ = ’0’

‘G’ =’10’

‘T’ = ’110’

‘A’ = ‘1110’

The above Code is a Prefix-free code.

Hence, it is uniquely decodable.

Example

A=0 A is a prefix of B, not instantaneous

B=01

C=11

If we receive “01111111111111111…..”, we can’t decode until we get the last bit.

If we receive “0111”, …..the message is “BC”.

If we receive “01111”, the message is “ACC”.

This code is not prefix-free, but uniquely decodable. It is not instantaneous. There is a delay in the decoding.

Morse code is not instantaneous.

Prefix-free codes are always instantaneous.

“the bookie appears silent and shoddy indeed they are found in with bomb” but this is not a full transmission but actually its bombay instead bomb

In both cases, we act differently.

Prefix has problem, like when bomb is a prefix of bombay

Theorem: Prefix-free codes are always uniquely decodable

Instantaneous codes =Prefix free codes= self punctuating codes (all these are equivalent)

that mean you don’t need to punctuate. Just by ensuring that no codeword is a prefix of other codeword, you don’t need commas, full stop, colon.

	Code A	Code B	Code C	Code D
a	10	0	0	0
b	11	110	011	11
c	111	110	011	11

Code A: Not uniquely decodable

Both bbb and cc encode as 11111

Code B: Instantaneously decodable

End of codeward marked by 0

Code C: Decodable with one symbol delay

End of codeward marked by following 0

Code D: Uniquely decodable, but with unbounded delay:

011111111111111 decodes as accccccc

01111111111111 decodes as bcccccc

	Code E	Code F	Code G
a	100	0	0
b	101	001	01
c	010	010	011
d	011	100	1110

Code E: Instantaneously decodable

All codewards same length

Code F: Not uniquely decodable

e.g. baa, aca, aad all encode as 00100

Code G: Decodable with six symbol

Mcmillan Inequality

There is a binary uniquely decodable code C with codeword lengths {$l_1, l_2, l_3.....l_n$}if and only if:

`$\sum_{i=1}^n$ $2^-li$ _<1

It does not give an explicit construction of the code, but only existence of the uniquely decodable code

There is a uniquely decodable code with codeword lengths 1,2,3,3 since ½ + ¼ + 1/8 + 1/8 =1

Code for the above case: {0, 01, 011, 111}

There is no uniquely decodable code with codeword lengths 2, 2, 2, 2, 2 since ¼ + ¼ + ¼ + ¼ + ¼ > 1

Kraft’s Inequality

There is a binary prefix-free code C for codeword lengths {$l_1, l_2, l_3.....l_n$} if and only if:

`$\sum_{i=1}^n$ $2^-li$ _<1

The same condition as that of McMillan’s inequality => no point in going for uniquely decodable codes which are not prefix-free.

Example: {0, 10, 110, 111} for lengths 1,2,3,3

Shannon-Fano, Arithmetic code, Huffman’s codes: all are prefix-free codes.

How to design Good Codes:

We want small code-word lengths.
But KM inequality is saying that we can’t have all code-words as small, some can be small, others have to be long.
Best is to assign smaller codes to more frequently occurring symbols and longer codes to less frequently (rare) occurring symbols.
This is what Morse Code does !!
Take Morse code, assign smaller codes to more frequent occurring symbols
One major assumptions is two symbols are independent.
Hence the need to determine probabilities (frequencies) of the symbols of the source message.
Assume: Each symbol in the message is independent of others (not true in reality!!)

The average code word Length:

L(C) = `$\sum_{i=1}^n$$P_i l_i$ bits/symbol

The Goal is to minimize L(C)

		$l_i$
A	0	1
B	10	2
C	110	3
D	111	3

Total 4 symbols and 1+2+3+3/4 = 9/4 bits/symbol

Our interest is the average code word length

if there are N symbols, the average length of the compressed file will be

N*L(C) bits

Optimization Problem

Once the symbol probabilities are determined, we need to find code-word lengths such that L(C) is minimum, but subject to KM inequality:

$\sum_{i=1}^n$P_i l_i$ subject to $\sum_{i=1}^n$$2^-li$$ _<1

Once $l_i $ are determined, we need to find the actual codewords which are prefix-free and have these codeword lengths.

L(C) $=>$ H(X)

L(C)$=> $

H{$P_1, P_2, P_3,……,P_n$}

L = $\sum_{i=1}^n$P_i l_i$ $=>$\sum_i$p_i log$\frac1p_i$ = H

we can reach entropy limit by increasing complexity of algorithm

L(C) should be minimum and prefix free

Shannon-Fano codes(1948) used in ZIP file format

The codewords are terminating leaves and the prefix free nature comes from the tree structure.

If we choose one node, we can’t pick descendants

Huffman coding (1952)

Huffman codes are the most optimal symbol codes (each symbol is mapped to an unique code). These are prefix-free.

Huffman coding is very popular, easy to implement and hence very fast coding/decoding.

Used in JPEG – international standard for still image compression.

Krafts Inequality (binary) Proof

Given codewords length

$n_1=<n_2=<n3…..=<nl$

there exists a prefix free code with these length if

$\sum _{k=1}^L$ $2^{-nk}$ $=<1$ (1)

2 things to prove

sufficient condition given equation(1) we shall construct a prefix free code

e.g { $1-n_1, 2-n_2, 3-n_3$ }

L=3

Leaves lost due to the selection of codeword at depth,L=3

When $C_1$ =0 4

$C_2$ = 10 2

$C_3$= 110 1

Total 7

7<8

$2^{nl-n1}$ is the no. of leaves lost at depth L=3 when C1 =0 is assigned

$\sum_{i=1}^{L}$ $2^{nl-ni}$=< $2^{nl}$

$\sum_{i=1}^{L }$ $2^{-ni}$=< 1

The last depth, $nL^{th}$ depth is the algorithm’s heart

Necessary condition:

We start with codewords which are already prefix codes

Count the leaves left and keep a count of leaves lost

we have to hit the relation

$\sum_{i=1}^{L}$ $2^{nl-ni}$=< $2^{nl}$

$\sum_{i=1}^{L }$ $2^{-ni}$=< 1

1) verify if it is a prefix(string, substring match)

2) now count the leaves left with each assignment

we land up in

$\sum_{i=1}^{L}$ $2^{nl-ni}$=< $2^{nl}$

for symbol codes Huffman is best but stream codes are better than symbol codes

Blocking create blocks. this increase complexity at algorithm.

input	output
\(X_1\)	\(Z_1\)
\(X_2\)	\(Z_2\)
\(X_n\)	\(Z_n\)

Lecture 5: Shannon's Source Coding Theorem

Lecture notes on "Topics in Information Theory, Chaos and Causal Learning"

Some assumptions for source coding: