A blog about software and making.

Exploring Probabilistic Data Structures

The Bloom filter and Count-Min Sketch data structures keep cropping while I’m reading so I figured I would try to implement them. They are both part of a family of probabilistic data structures that give up accuracy to store a lot of information in a tiny amount of space.

Bloom Filter - Could I have seen this before?
Count–Min Sketch - What’s the maximum number of times I could have seen this?

It’s also worth looking at:
HyperLogLog - Approximately how many unique elements have I seen?
Great post on probabilistic data structures

Bloom Filter code
Count-Min Sketch code

Working Out Bayes' Theorem

This is mostly an excuse to play around with the MathJax library

$Num(A) = 50$
$Num(B) = 15$
$Num(A∩B) = 5$
$Num(A∪B) = 65$

Get probabilities from counts
$P(A) = \frac{50}{65} \approx 0.769$
$P(B) = \frac{15}{65} \approx 0.231$
$P(A∩B) = \frac{5}{65} \approx 0.0769$

Get conditional probabilities from counts
$P(A|B) = \frac{Num(A∩B)}{Num(B)} = \frac{5}{15} \approx 0.333$
$P(B|A) = \frac{Num(A∩B)}{Num(A)} = \frac{5}{65} \approx 0.077$

Use the above to get counts from probabilities
$Num(A) = P(A) \times Num(A∪B) = \frac{0.769}{65} = 50$
$Num(B) = P(B) \times Num(A∪B) = \frac{0.231}{65} = 15$
$Num(A∩B) = P(A|B) \times Num(B) = \frac{0.333}{15} = 5$
$Num(A∩B) = P(B|A) \times Num(A) = \frac{0.077}{65} = 5$

Turn counts into probabilities to get the conditional probability from other probabilities
$P(A|B) = \frac{Num(A∩B)}{Num(B)}$
$P(A|B) = \frac{P(B|A) \times Num(A)}{P(B) \times Num(A∪B)}$
$P(A|B) = \frac{P(B|A) \times P(A) \times Num(A∪B)}{P(B) \times Num(A∪B)}$

$Bayes’ Theorem = P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$

Heterogeneous Parallel Programming

  • About two weeks in I realized I had bitten off more than I could chew. I passed (with distinction even) but I burned way more hours on this than I had budgeted for. The course says 6-8hr/week but I was easily doubling that. I really should have brushed up on my C and explored CUDA on my own first.
  • Changed how I look at processors and memory and I’m not sure that I’m better off after seeing how the sausage is made. It’s amazing computers work half as well as they do.

Certificate
Course Link

Odds & Ends - March 2014

Thoughts, terms, and ideas I’ve come across over the last few months.

  • 5 questions to ask when creating tests:
    • What if null?
    • What if zero?
    • What if one item?
    • What if three items? (Bugs often show up on middle rows not the first/last)
    • What if many items?
  • Don’t add cross-cutting concerns via inheritance. If you extend a class to add support for logging/caching you are adding a new responsibility (violating SRP).
  • Update performance suffers from denormalization. When you update a duplicated value on a 1NF table you will have to update every row that has that value instead of just one row in 3NF. Solutions:
    • Don’t update - It’s not a problem if you are only inserting into an OLAP database.
    • Normalize your schema (3NF) - Why are you updating a denormalized data warehouse style database?
    • Accept inconsistent data - Update a few rows at a time in a commit.
    • Accept locking penalty - Page/Row locks will have to be taken out for the row being updated potentially blocking other sessions.
  • Normalized Schema
    • Less storage used
    • Affect fewer records when updating
  • Denormalized Schema
    • Possibly better at PK lookups as you may only need to hit one table.
    • Suffers from concurrency and deadlock issues if updating.
  • Distributed Consensus Algorithm - Agreement on a single value by multiple nodes.