Tried and mostly failing to replicate the results from Shannon’s paper using War and Peace as a stand in for “The English Language”.
I say mostly because I’m pretty sure I am measuring the entropy measurements for different length n-grams in the English language but it turns out I’m calculating isolated symbols entropy while Shannon is calculating conditional n-gram entropy.
Helpful SO answer
I did get interesting results for the top 10 unigrams and bigrams: