Skip to the content.

06 Mar 2020 - Dr. Nithin Nagaraj

Lecture No. 6 & 7: Verifying Shannon’s Source Coding Theorem and Entropy of English

[Deadline: March 6, 2020]

Instructions: Do not assume anything over and beyond what was defined in the class (unless explicitly asked in the assignment to do so). In case you use any identity/theorem outside what was done in the class, you will have to show the proof of the same. But, better to avoid this as much as possible. Show your calculations in entirety. Justify all your answers with appropriate reasoning and arguments.

  1. Entropy of English language. Write a program (in MATLAB/Python/C/C$++$ or any other programming language of your choice) to compute the Shannon Entropy of an input sequence of symbols. Take at least 5 different articles/texts/essays in English language which has at least 1000 words each and compute the Entropy of each of them. Repeat the above exercise for block-sizes: 1, 2 and 3 symbols. Suggestion: You can ignore all special characters in the document and convert all uppercase symbols to lowercase so that you just have 27 alphabets - ‘a’ to ‘z’ and whitespace.

  2. For the same documents that you have chosen for the above task, perform a Gzip compression (or any equivalent lossless compression program on Windows/Linux)and report the compressed file sizes. Note: Both MATLAB and Python have in-built programs for Zip/Gzip compression.

  3. Empirically verify Shannon’s lossless source coding theorem from (1) and (2) above. Report any observations/analysis that you care to do on this exercise.