A blog about software and making.

Data Structures for Web Devs Presentation

I did a ½ presentation ½ discussion on data structures for our web developer team.

Where I work developers come from different backgrounds and not everyone has a computer science vocabulary. It was interesting trying to talk about data structures without cheating by using CS jargon. It was surprising how much I depend on having a shared specialized language when discussing algorithms and data structures. They weren’t joking when they said the easiest way to learn something is to explain it to someone else.

I only had 1.5 hours, so it was tricky fitting in everything I wanted to go over with enough time to discuss each slide. My goal was to persuade people to use ES6’s new Map() and Set() data structures instead of arrays when appropriate. We are also planning to increase the amount we use Immutable.js with React so I accepted to give a hand-wavey explanation of persistent data structures.

I should have spent more time the persistent data structure section. I think people got that we can check for changes in constant time (by reference equality) but I don’t think I did a great job of giving people a rough idea of how the internal bookkeeping works.

An infinite amount of time to fact check my slides would also have been nice :-).

Presentation

Card Sorting Meetup

The session centered on using the card sorting technique to develop intuitive layouts. For example, developing a website navigation structure that is obvious to users. Card sorting is considered a brainstorming step, so it’s important that users focus on grouping into categories and not on designing the site.

Card sorting comprises grouping cards that describe features and content into groups. The number of cards ranges from 30 Cards for a general user to 60 cards for a domain expert. Users should aim for 5+ groupings.

In the first half, we got into groups and did an open card sort in groups of 2-4 people. In an open card sort, users generate their own categories when grouping the cards. It’s a generative tool used to discover the categories and vocabulary that users use.

The discussions that occur while a group is doing the sort are a valuable source for information you won’t get from an individual card sort. Disagreements can be a good thing that allows the best ideas to rise to the top, but you want to be careful to prevent an individual group member from dominating the conversation.

In the second half, we did a closed card sort as individuals using optimal workshop. Closed card sorts have pre-defined categories and are best for validating your design.

The presenter pointed out that prototypes are the best way of validating that users can find stuff. They use card sorting to help develop a prototype and then use the prototype to do their validation. As an example, they showed a series of paper screen mock-ups they asked users to navigate while performing different tasks.

Meetup Event

Exploring Classifying Documents Using Distribution Of Term Probabilities

I’m looking a way to classify product listings as either a product or a product accessory. My current idea is to classify listings based on how their term probabilities are distributed.

The idea is to find the probability of each term in a listing and use them to build a histogram to get an idea how the common and unique terms are distributed for a listing.

Examples of common words: camera, digital, zoom, optical, with, lcd, megapixel, lens, canon, black, and, mp, digitalkamera, cm, 27

Examples of unique words: rings, eb575152vu, i9000, galaxys, 1080mah, funtionality, bp511a, zs7, enel5, 1100mah, mll3, 228825, np20, negative, scanner

I’m assuming that a typical product listing generally has one model number (unique) and a bunch of common terms while an accessory listing usually has multiple model numbers. If this is true it should be possible to classify a listing as either a product or a product accessory from the distribution of term probabilities.

From what I’m seeing so far, it seems to be possible to classify accessory listings that have a high ratio of unique terms.

Note: I wanted to use a histogram but I couldn’t get the hexo google charts plugin to make a histogram with a custom scale 😕

Examples of product listings:

samsung sh100 142mp wifi digital camera with 5x optical zoom in silver 8gb accessory kit

canon eos rebel t3i 18 mp cmos digital slr camera and digic 4 imaging body only

Examples of accessory listings:

310 digital camera video mask now rated to 65 feet)

optekas extreme travelers essentials kit by opteka package inlcudes excursion series c900 fullsize waterproof canvas bag 6501300mm and 500mm telephoto lenses heavy duty tripod and monopod and much more for pentax k10d k20d k100d k110d k200d ist digital slr cameras

I had to use a nonlinear scale (I used powers of 2) for the histogram buckets or all the unique (small probability) words ended up in the same bucket.

There is a high rate of false positives (products being identified as accessories) when the listings are in languages other than English. With so few non-English listings every word in these listings is unique across all listings. It may be possible to figure out what language the listings are by using identifying listings which have unusual character n-grams distributions but I don’t now if there will be enough text per listing to do this reliably.

jendigital jd 5200 z3 digitalkamera 50 2560 x 1920 32mb

The messy code

Designing Data-Intensive Applications Review

This book is still in pre-release so I’ve only been able to go through the first nine chapters. It breaks a confusing field into logical pieces and then shows you how they fit together. I’d recommend it to anyone who interested in databases or distributed systems.

  • Does an amazing job of explaining the algorithms and data structures used to implement systems and what problems they are best at solving.
  • Focuses on concepts and avoids the mistake of only talking about particular products or implementations.
  • Explains different guarantees and the tradeoffs made in providing them.
  • Lots of references and examples. The author has pulled together a lot of material for this book.

Chapter Notes

  1. Reliable, Scalable, and Maintainable Applications
    • Think of response time not a single number but as a distribution of values you can measure.
    • The mean is not a good metric if you want to know your “typical” response time because it doesn’t tell you how many users actually experience that delay.
    • The architecture of large scale systems is usually highly specific to the application. There is not such thing as magic scaling sauce (a generic one-size-fits-all scalable architecture).
  2. The Battle of the Data Models
    • Many to one relationships - Prefer relational model
    • One to many (hierarchy) relationships - Prefer document model
    • Many to many relationships - Prefer graph model
    • The advantage of using and ID is that it has no reason to change even in the information it identifies changes.
    • Document databases don’t need joins to work with one-to-many tree structures, and support for joins if often weak.
  3. Storage and Retrieval
    • Storage engines fall into two broad categories:
      • Transaction processing systems - Many queries each accessing a small number of records using some kind of key. Disk seek time is often the bottleneck.
      • Analytic system - Few queries each accessing a large number of records. Disk bandwidth is often the bottleneck. Column databases work best for computing aggregated statistics.
    • Two major implementations of storage engines are:
      • log-structured - Only ever appends to files (Cassandra).
      • update-in-place - Treat disk as fixed size pages that can be overwritten (Traditional RDMS/B-Trees).
    • The performance of in-memory databases is more because they avoid the overhead of encoding the data into data structures that can be written to disk than the actual reading from disk.
    • Easier to maintain sorted order in memory with Red-Black/AVL trees which allow keys to be inserted in any order and read back in sorted order.
    • The star schema divides data into facts and dimensions. Each row in a fact table represents and event that occurred at a particular time. Each row in a dimension table represents the who, what, where, when, how, and why of the event. Visually the fact table is in the middle, surrounded by its foreign keys to dimension tables like rays of a star.
    • The key idea of log-structured storage engines is to turn random-access writes into sequential writes on disk which are faster on HDD/SSD.
    • When queries involve sequentially scanning across a large number of rows, indexes are much less relevant. Instead, it becomes important to encode data compactly, to minimize the amount of data the query needs to read from disk.
  4. Encoding and Evolution
    • Data flow:
      • Databases - Different applications read/write to a shared database.
      • REST/RPC - Clients connect to a server that exposes API (service). The server may talk to other servers to handle client request (microservices).
      • Message Passing - Asynchronous message passing systems (Message brokers or distributed actors) where nodes send each other messages.
    • During rolling upgrades different nodes will be running different versions of code so we need the data flowing in the system to have both backward compatibility and forward compatibility.
  5. Replication
    • Reasons to replicate:
      • Keep working even when machines go down (availability)
      • Continue to work during network partitions (availability)
      • Place data geographically close to users (latency)
      • Handle higher volume of writes than possible with one machine (scalability).
    • Replication types:
      • Single-leader - Clients sent all writes to single leader node which streams changes to followers. Can read from any node but followers may be stale.
      • Multi-leader - Clients send each write to one of several leader nodes. Leaders send streams of changes to each other and followers. Need to handle write concurrency conflicts.
      • Leaderless replication - Clients read and write to/from several nodes in parallel. Need to handle write concurrency conflicts.
    • Consistency models:
      • Read-after-write - User should see the latest copy of any data they submitted themselves.
      • Monotonic reads - After user sees data from one point in time they should never see data from an earlier point in time.
      • Consistent prefix reads - Users see data in casual order. Should not see the answer before the question.
  6. Partitioning
    • Partitioning Strategies:
      • Key Range - assign a continuous range of keys. Allows efficient range queries but increases the risk of hot spots.
      • Hash - assign based on the hash of the key. Uniformly distribute skewed data preventing hot spots but inefficient range queries.
      • Compound Key - Hybrid approach. The first part of the key is hashed to determine partition and the second part is used to sort the data within the partition. Allows efficient range queries on the columns in the second part of the key.
  7. Transactions
    • An abstraction layer that allows applications to pretend certain types of concurrency problems and faults don’t exist.
  8. The Trouble with Distributed Systems
    • Typical partial failures include:
      • Both the sent packet and its reply can be lost or delayed.
      • Node clocks being out of sync with each other.
      • Clocks jumping forward or back in time due to NTP.
      • Processes pausing for long periods and not realizing they have been paused when resumed.
    • To tolerate faults you first need to detect them and the only tool we have is timeouts which tell us nothing about the nature of the fault.
  9. Consistency and Consensus
    • Shows how the CAP theorem is misleading when applied to real-world systems. The author also has a Blog Post. Before this, I thought that quorum style systems with a majority were consistent but it turns out I didn’t even understand what the word consistent means.

Goodreads