11 Feb 2020 - Gulshan
Some assumptions for source coding:
We assume that there is no noise that’s the assumptions we gonna make because we only want to look at compression alone and decompression so we do not worry about noise. So other assumptions is that channel coding or source coding are treated independently because there is something called Shannon’s separation theorem so under certain conditions you can do them independently. You don’t have to compress both of them together.
”I am the Alpha to Zeta
of systems - Micro to Meta,
I grow daily by the Quanta”
Who am I?
Student: Information.
Teacher: not information, information is very technical. Right.
Student: data
Data Right
Whether information increases or entropy increases. We don’t know but definitely data.
Showing slide: ** | Sarvam Kalvidam Data | ** This is a part of mahabhkaya i.e. data everywhere. |
These are some notations
Kilo (103)< Mega (106)< Giga (109) < Tera (1012) < Peta (1015) < Exa (1018)< Zetta (1021) < Yotta (1024) …
All these are old stuff. I guess these numbers will change now.
The Need for Data Compression
For storage and transmission.
Mobile devices have limited storage capacity.
Medical data is increasing in resolution and dimensions (eg. 4D Ultrasound). Data storage and transmission needs to be “lossless”.
The Solution:
Text Comperssion (eg. Gzip, Tar)
Image Compression (eg. JPEG)
Audio/video Compression(eg. MPEG)
Codes for Lossless Data Compression
Start with a sequence of symbols X = \(X_1,X_2,X_3,....X_n\) from a finite source alphabet:
\(A_X\)= \({(a_1, a_2, a_3,.......)}\)
Examples: \(A_X\) = (A,B,……..,Z,_);
Every one of them is coming from some alphabet. In case of english language, it will be from A to Z.
if you have image, every pixel is value from 0 to 255 so it means capital \(X_1,X_2\) are the pixels of the image and Capital X is the image.
\(A_X\) = (0,1,2,……255);
\(A_X\) = (C,G,T,A);
\(A_X\) = (0,1). here it would be a binary file or binary sequence or binary image.
so what encoder does?
Encoder: Outputs a new sequence: \(Z\)= \(Z_1,Z_2,.....Z_M\)
Using possibly a different code alphabet. Typically the alphabet is {0,1}.
Decoder: tries to convert Z back to X.
What is Code C:
when we use the word code. we might use small c or big C.
Small c is actually mean codeword
Big C actually means codebook.
codebook is collection of all codewords.
codeword actually lets say \(X_i\) is input and mapping into \(Z_i\) .
\(X_i\) \(\frac{C}{}\) \(Z_i\)
It could be in a table form.
input | output |
---|---|
\(X_1\) | \(Z_1\) |
\(X_2\) | \(Z_2\) |
\(X_n\) | \(Z_n\) |
think of c like a function that its mapping input and output.
There are two kind of codes i.e. symbol code and stream code.
Symbol code will map a single symbol, it could also be an alphabet
mapping to a single codeword.
Stream code: it can take multiple symbol from a message and map into a single codeword.
like \(X_1, X_2, X_3\)}\(-Z_1\)
\(X_4 , X_5 -Z_2\)
here entire message is one codeword.
Block cipher: are like symbol code but they are not exactly symbol code but similar to symbol code
Stream Cipher: they are stream code.
Morse code: in Morse code the letter ‘e’ is given ‘.’ and ‘o’ is given ‘_ _ _ ‘ . one dash is equal to three dots in terms of duration .Morse code works basically as an electrical telegraph system. it works by making or breaking circuits.
The idea is to assign shorter codewords for frequently occurring symbols
Compression ratio = original size /compressed file size
Morse code is better to hear them not to see them. people found its faster when you just hear it.
interestingly this is using one of the sensor. here it is hearing or you can use visual things or touch.
Example of touch is braille
Every touch sensation is uniquely map to a single input symbols.
So, What does alphabet mean?
Alphabet is out of which symbols in the output space is made.
what is alphabet for the output? where as input alphabet is A to Z.
its a binary code. word binary means attribute as alphabet. here alphabet is two possibilities
Input alphabet has depression or mounts. so 0 or 1.
How many possible codewords are there?
\(2^6\) so 64 possible codewords.
Examples of Codes.
Let us assign a code to each symbol of the alphabet (Symbol Code).
Example: \(A_X\) = {C,G,T,A}
Assign: C = 0, G = 01, T = 001, A = 111
Therefore the sequence ‘CGT’ is encoded as ‘001001’. (I)
This shows the ENCODING process. The Decoding has to ensure that we recover the message uniquely always.
A code C is uniquely decodable if every sequence of codewords can be uniquely decoded without any ambiguity.
Example : ‘CGT’ is encoded as ‘001001’ NOT uniquely decodable. Why?
Because the received sequence ‘001001’ could be decoded as ‘CGT’ or as ‘TT’. There is no way to distinguish between the two cases.
How to determine whether a Code is uniquely decodable or not?
Sardinas -Patterson Theorem (Refer to Jones & Jones )
what this theorem does? It takes a code whatever the code may be
and determine whether code is uniquely decodable or not.
Prefix_free Codes
No codeword is a ‘prefix’ of another
Recall: In our previous example,
C =0 and G = 01
Here, the codeword ’0’ is a prefix of ‘01’.
This created the ambiguity.
Example
Take:
‘C’ = ’0’
‘G’ =’10’
‘T’ = ’110’
‘A’ = ‘1110’
The above Code is a Prefix-free code.
Hence, it is uniquely decodable.
Example
A=0 A is a prefix of B, not instantaneous
B=01
C=11
If we receive “01111111111111111…..”, we can’t decode until we get the last bit.
If we receive “0111”, …..the message is “BC”.
If we receive “01111”, the message is “ACC”.
This code is not prefix-free, but uniquely decodable. It is not instantaneous. There is a delay in the decoding.
Morse code is not instantaneous.
Prefix-free codes are always instantaneous.
“the bookie appears silent and shoddy indeed they are found in with bomb” but this is not a full transmission but actually its bombay instead bomb
In both cases, we act differently.
Prefix has problem, like when bomb is a prefix of bombay
Theorem: Prefix-free codes are always uniquely decodable
Instantaneous codes =Prefix free codes= self punctuating codes (all these are equivalent)
that mean you don’t need to punctuate. Just by ensuring that no codeword is a prefix of other codeword, you don’t need commas, full stop, colon.
Code A | Code B | Code C | Code D | |
---|---|---|---|---|
a | 10 | 0 | 0 | 0 |
b | 11 | 110 | 011 | 11 |
c | 111 | 110 | 011 | 11 |
Code A: Not uniquely decodable
Both bbb and cc encode as 11111
Code B: Instantaneously decodable
End of codeward marked by 0
Code C: Decodable with one symbol delay
End of codeward marked by following 0
Code D: Uniquely decodable, but with unbounded delay:
011111111111111 decodes as accccccc
01111111111111 decodes as bcccccc
Code E | Code F | Code G | |
---|---|---|---|
a | 100 | 0 | 0 |
b | 101 | 001 | 01 |
c | 010 | 010 | 011 |
d | 011 | 100 | 1110 |
Code E: Instantaneously decodable
All codewards same length
Code F: Not uniquely decodable
e.g. baa, aca, aad all encode as 00100
Code G: Decodable with six symbol
Mcmillan Inequality
There is a binary uniquely decodable code C with codeword lengths {\(l_1, l_2, l_3.....l_n\)}if and only if:
`$\sum_{i=1}^n$ \(2^-li\) _<1
It does not give an explicit construction of the code, but only existence of the uniquely decodable code
There is a uniquely decodable code with codeword lengths 1,2,3,3 since ½ + ¼ + 1/8 + 1/8 =1
Code for the above case: {0, 01, 011, 111}
There is no uniquely decodable code with codeword lengths 2, 2, 2, 2, 2 since ¼ + ¼ + ¼ + ¼ + ¼ > 1
Kraft’s Inequality
There is a binary prefix-free code C for codeword lengths {\(l_1, l_2, l_3.....l_n\)} if and only if:
`$\sum_{i=1}^n$ \(2^-li\) _<1
The same condition as that of McMillan’s inequality => no point in going for uniquely decodable codes which are not prefix-free.
Example: {0, 10, 110, 111} for lengths 1,2,3,3
Shannon-Fano, Arithmetic code, Huffman’s codes: all are prefix-free codes.
How to design Good Codes:
- We want small code-word lengths.
- But KM inequality is saying that we can’t have all code-words as small, some can be small, others have to be long.
- Best is to assign smaller codes to more frequently occurring symbols and longer codes to less frequently (rare) occurring symbols.
- This is what Morse Code does !!
- Take Morse code, assign smaller codes to more frequent occurring symbols
- One major assumptions is two symbols are independent.
- Hence the need to determine probabilities (frequencies) of the symbols of the source message.
- Assume: Each symbol in the message is independent of others (not true in reality!!)
The average code word Length:
L(C) = `$\sum_{i=1}^n$$P_i l_i$ bits/symbol
The Goal is to minimize L(C)
$l_i$ | ||
---|---|---|
A | 0 | 1 |
B | 10 | 2 |
C | 110 | 3 |
D | 111 | 3 |
Total 4 symbols and 1+2+3+3/4 = 9/4 bits/symbol
Our interest is the average code word length
if there are N symbols, the average length of the compressed file will be
N*L(C) bits
Optimization Problem
Once the symbol probabilities are determined, we need to find code-word lengths such that L(C) is minimum, but subject to KM inequality:
$\sum_{i=1}^n\(P_i l_i$ subject to $\sum_{i=1}^n\)$2^-li$$ _<1
Once $l_i $ are determined, we need to find the actual codewords which are prefix-free and have these codeword lengths.
L(C) $=>$ H(X)
L(C)$=> $
H{$P_1, P_2, P_3,……,P_n$}
L = $\sum_{i=1}^n\(P_i l_i$ $=>\)\sum_i\(p_i log\)\frac1p_i$ = H
we can reach entropy limit by increasing complexity of algorithm
L(C) should be minimum and prefix free
Shannon-Fano codes(1948) used in ZIP file format
The codewords are terminating leaves and the prefix free nature comes from the tree structure.
If we choose one node, we can’t pick descendants
Huffman coding (1952)
Huffman codes are the most optimal symbol codes (each symbol is mapped to an unique code). These are prefix-free.
Huffman coding is very popular, easy to implement and hence very fast coding/decoding.
Used in JPEG – international standard for still image compression.
Krafts Inequality (binary) Proof
Given codewords length
$n_1=<n_2=<n3…..=<nl$
there exists a prefix free code with these length if
$\sum _{k=1}^L$ $2^{-nk}$ $=<1$ (1)
2 things to prove
sufficient condition given equation(1) we shall construct a prefix free code
e.g { $1-n_1, 2-n_2, 3-n_3$ }
L=3
Leaves lost due to the selection of codeword at depth,L=3
When $C_1$ =0 4
$C_2$ = 10 2
$C_3$= 110 1
Total 7
7<8
$2^{nl-n1}$ is the no. of leaves lost at depth L=3 when C1 =0 is assigned
L
$\sum_{i=1}^{L}$ $2^{nl-ni}$=< $2^{nl}$
$\sum_{i=1}^{L }$ $2^{-ni}$=< 1
The last depth, $nL^{th}$ depth is the algorithm’s heart
Necessary condition:
We start with codewords which are already prefix codes
Count the leaves left and keep a count of leaves lost
we have to hit the relation
$\sum_{i=1}^{L}$ $2^{nl-ni}$=< $2^{nl}$
$\sum_{i=1}^{L }$ $2^{-ni}$=< 1
1) verify if it is a prefix(string, substring match)
2) now count the leaves left with each assignment
we land up in
$\sum_{i=1}^{L}$ $2^{nl-ni}$=< $2^{nl}$
for symbol codes Huffman is best but stream codes are better than symbol codes
Blocking create blocks. this increase complexity at algorithm.