David Forshner's Blog

Efficient Analysis with SQL

16-03-31

The main part of the meetup was a presentation on tally tables, cross apply and normalization in SQL.

Talley tables are used to replace while loops and fill in missing data. Examples are shift tables, fiscal day tables, and holiday tables.
When creating tally tables set fill-rate to 100% as you won’t be adding/removing from them.
Examples of using cross apply to pivot and unpivot (fold) data.
Activity-based costing (task/material column/labor column/burden column) is denormalized whereas a debit-credit view is normalized (separate material row/labor row/ burden row).
Interesting idea that normalization is about ‘fidelity’ and accurately modeling the real world. I’ve always felt normalization was more a way of storing data in a general application-agnostic way so it didn’t need to change as the application evolves.

This was followed by a short talk on scaling.

Scaling is changing all the weights by the same factor so the ratio between weights remains unchanged.
Scaling weights help pull them towards the unit circle. If all the points are large values the get reduced and if they are all small they get increased. Prevents numerical stability problems.
If one axis has a small scale and the there large it can cause clustering algorithms to accidently combine clusters along the small scale axis.
Other benefits of scaling are that gradient descent and learning rates will be faster. It’s easier to find the minimum error of a circular cone.
When scaling the mean and variance can be approximated by using a random sample of the data points.
Interesting idea on finding anomalous events by treating them like n-grams and finding improbable chains of events. For example, if event C has a high probability given event A and B then it not occurring could be considered anomalous.

Meetup Event

Exploring Shannon's Prediction And Entropy Of Printed English Paper

16-03-29

Tried and mostly failing to replicate the results from Shannon’s paper using War and Peace as a stand in for “The English Language”.

I say mostly because I’m pretty sure I am measuring the entropy measurements for different length n-grams in the English language but it turns out I’m calculating isolated symbols entropy while Shannon is calculating conditional n-gram entropy.
Helpful SO answer

I did get interesting results for the top 10 unigrams and bigrams:

Source Code

Left Pad Liberation

16-03-23

It looks like npm removed a module to avoid possible brand infringement leading to the author removing all of his modules from npm. Among the modules removed is a widely used module called ‘left-pad’ which triggered a cascade of build failures in some of the most popular projects on npm.

There was an interesting point made that this may be a consequence having a global namespace for package names instead of an origin.package style namespace. The argument being it would make it more obvious that Azer.kik didn’t come from Kik (the company) removing possible confusion and weakening any brand infringement claims.

Author’s Post
Register UK