A blog about software and making.

Exploring Shannon's Prediction And Entropy Of Printed English Paper

Tried and mostly failing to replicate the results from Shannon’s paper using War and Peace as a stand in for “The English Language”.

I say mostly because I’m pretty sure I am measuring the entropy measurements for different length n-grams in the English language but it turns out I’m calculating isolated symbols entropy while Shannon is calculating conditional n-gram entropy.
Helpful SO answer

I did get interesting results for the top 10 unigrams and bigrams:

Source Code