A blog about software and making.

Record Linkage Pipeline

I’ve been working away on a coding challenge to match records from two different data sources, and I think I’m close to being done. I haven’t done much work involving processing text documents, so it’s been a learning experience. Given more time I’d like to look into classifying listings based on their sentence structure, but the weather is getting nice enough that I’m going to have put side projects on the back burner for a bit ☺.

GitHub Repo

Exploring Classifying Listings Using A Naive Bayes Classifier

I built a Naive Bayes classifier listings as being either for a camera or an accessory. To generate the training set, I’m using a heuristic classifier I developed earlier and then going through the results manually.

Steps to run:
1) Create a preliminary training set using a heuristic based classifier. Only a randomly selected subset of the listings should be used to build the training set. As there are far more camera listings than accessory listings, it’s important to reflect that ratio in the training set.
2) Manually go through the training data to remove any entries. Try to remove incorrect classifications in pairs to keep the ratio of camera listings to accessory listings the same.
3) Run the Bayes classifier using the training set you created. To not be considered a camera a listing must contain more than 2 “accessory” words and more than 90% of the words need to be “camera” words.

The +/- signs prepended to each term show how they contributed to a listing being classified as a camera

Positives:

Listing Classification
+fujifilm +finepix +z70 +12 +mp +digital +camera +with +5x +optical +zoom +and +2 +7 +inch +lcd +bronze [Camera] Camera
+pentax +k +x +12 +4 +mp +digital +slr +with +2 +7 +inch +lcd +and +18 +55mm +f +3 +5 +5 +6 +al +and +50 +200mm +f +4 +5 +6 +ed +lenses +black Camera
+vivitar -vivicam vx029bl +10 +1mp +digital +camera +blue Camera
+sony dsct2b +digital +camera +black +8 +1mp +3x +optical +zoom +2 +7 +lcd +4gb internal +memory Camera
+agfaphoto +precisa 107 +digitalkamera +12 +megapixel +5 +fach +opt +zoom +6 +8 +cm +2 +7 +zoll +display +bildstabilisiert +schwarz Camera
lenco +dc 511 +digitalkamera +12 +megapixel +8 +fach +digital +zoom +6 +cm +2 +4 +zoll +tft +lcd +orange Camera

Negatives:

Listing Classification
-duragadget premium wrist +camera +carrying -strap +with +2 +year +warranty -for +panasonic +lumix -fh27 -fh25 -fp5 -fp7 -fh5 -fh2 -s3 +s1 Accessory
cushioned neo absorption +camera -strap -for +nikon +canon +pentax +panasonic +olympus +fujifilm +kodak +sony +and +more +digital +slr -cameras +card +reader +included Accessory
-sigmatek -ds 740 +7 +digital +photo -frame +black -ds -240 +2 +4 +mini +digital +photo -frame +2 -covers Accessory
biostek -ds -240 +2 +4 +mini +digital +photo -frame +2 -covers +cleaning -applicator -for +digital +photo -frames Accessory
-fototasche -kameratasche -typ hardbox hellblau -set +mit +4 +gb +sd -karte -für +samsung st60 es55 es60 +es65 es70 Accessory

False positives:

Listing Classification
kingston kingston valueram 512 mo ddr sdram pc3200 cas3 wet wipe dispenser +100 wipes dust removal spray +250 ml foam -cleaner -for screens +and keyboards +150 ml Camera
-duragadget +deluxe +mini flat folding +camera +camcorder +tripod stand -for +canon +ixus 1000hs +ixus 300hs +ixus +210 +ixus 200is +ixus 130 +ixus 120is Camera

Source Code

Exploring Finding Entity Aliases Using N-Gram Similarity

This is my second attempt at generating entity aliases from a set of product listings. In my first attempt I tried to use MinHash to group similar product listings, but it didn’t work out. This time, I’m going to be joining on n-grams of the manufacturer name and model number and then scoring the joins by the inverse term probability of the model n-grams.

I’m planning to use this in a pre-processing stage so that when I block by canonical manufacturer names I can include listings that have different variations of the canonical name.

For each listing, I’m attempting to find a similar canonical product record. The criteria I’m using for something being similar are:

  1. Similar manufacture using character n-gram similarity
  2. Similar model using shingles made from model parts

If I find enough instances where a listing is similar to a canonical product, I’ll consider the listing’s manufacturer name to be an alias for the canonical product’s manufacturer name.

Here are some aliases I was able to find:

Canonical Alias
fujifilm fuji
kodak eastman kodak company
canon canon canada
fujifilm fujifilm electronic imaging europe gmbh firstorder
panasonic panasonic deutschland gmbh
sony sony uk consumer electronics instock account
fujifilm fuji photo film europe gmbh
fujifilm fujifilm imaging systems
kodak kodak stock account
fujifilm fujifilm canada
canon canon uk ltd
olympus olympus canada

My main goal was to relate listings with a manufacturer of fuji to fujifilm so it looks like this is going to work. I had to change the scoring method to use the inverse term probably because some of the model names have common numbers or words (ex: “zoom”) which caused lots of false positives.

Messy Source Code

Text Mining with Twitter and R Meetup

A follow along workshop on using R to do data mining on tweets.

Notes

  • Data pipeline: Extract → Clean → Transform → Analyize
  • Clean
    • Remove: Unicode emoji, punctuation, links, stop words, numbers, dates, URLs, white space
    • Normalize to lowercase
    • Stemming - Truncate words to their radicals
      • Examples: cats → cat, ponies → poni
  • Analysis
    • Part of speech tagging - Tag each word with its part of speech in a sentence using definition and context.
      • Example: They refuse to permit -> [pronoun] [verb] [to] [verb]
    • Word association - Dot product of two words in a term document matrix gives correlation coefficient.
    • Clustering - Grouping similar tweets(docs) together
      • K-means - Centroidal model, each cluster is represented by a single mean vector.
        • Algorithm: create random clusters → assign points to nearest cluster centroid → recalculate cluster centroids to the average of assigned data points → repeat.
      • Hierarchical - Connectivity model
      • Use cosine similarity as the distance function. Cosine similarity normalizes for tweet(doc) length during the comparison.
    • Topic Mining - Use probability of terms to discover information from documents. Documents may belong to multiple topics.

Word association example:

Docs word 1 word 2 word 3 word 4
1 0 0 0 0
2 1 0 0 0
3 1 1 0 0
4 1 1 1 1

Correlation(word 2, word 3) = DotProduct([1,0,0,0], [1,1,0,0]) = 0.5

Meetup Event