David Forshner's Blog

Record Linkage Pipeline

I’ve been working away on a coding challenge to match records from two different data sources, and I think I’m close to being done. I haven’t done much work involving processing text documents, so it’s been a learning experience. Given more time I’d like to look into classifying listings based on their sentence structure, but the weather is getting nice enough that I’m going to have put side projects on the back burner for a bit ☺.

GitHub Repo

Exploring Classifying Listings Using A Naive Bayes Classifier

16-05-06

I built a Naive Bayes classifier listings as being either for a camera or an accessory. To generate the training set, I’m using a heuristic classifier I developed earlier and then going through the results manually.

Steps to run:
1) Create a preliminary training set using a heuristic based classifier. Only a randomly selected subset of the listings should be used to build the training set. As there are far more camera listings than accessory listings, it’s important to reflect that ratio in the training set.
2) Manually go through the training data to remove any entries. Try to remove incorrect classifications in pairs to keep the ratio of camera listings to accessory listings the same.
3) Run the Bayes classifier using the training set you created. To not be considered a camera a listing must contain more than 2 “accessory” words and more than 90% of the words need to be “camera” words.

The +/- signs prepended to each term show how they contributed to a listing being classified as a camera

Positives:

Listing	Classification
+fujifilm +finepix +z70 +12 +mp +digital +camera +with +5x +optical +zoom +and +2 +7 +inch +lcd +bronze [Camera]	Camera
+pentax +k +x +12 +4 +mp +digital +slr +with +2 +7 +inch +lcd +and +18 +55mm +f +3 +5 +5 +6 +al +and +50 +200mm +f +4 +5 +6 +ed +lenses +black	Camera
+vivitar -vivicam vx029bl +10 +1mp +digital +camera +blue	Camera
+sony dsct2b +digital +camera +black +8 +1mp +3x +optical +zoom +2 +7 +lcd +4gb internal +memory	Camera
+agfaphoto +precisa 107 +digitalkamera +12 +megapixel +5 +fach +opt +zoom +6 +8 +cm +2 +7 +zoll +display +bildstabilisiert +schwarz	Camera
lenco +dc 511 +digitalkamera +12 +megapixel +8 +fach +digital +zoom +6 +cm +2 +4 +zoll +tft +lcd +orange	Camera

Negatives:

Listing	Classification
-duragadget premium wrist +camera +carrying -strap +with +2 +year +warranty -for +panasonic +lumix -fh27 -fh25 -fp5 -fp7 -fh5 -fh2 -s3 +s1	Accessory
cushioned neo absorption +camera -strap -for +nikon +canon +pentax +panasonic +olympus +fujifilm +kodak +sony +and +more +digital +slr -cameras +card +reader +included	Accessory
-sigmatek -ds 740 +7 +digital +photo -frame +black -ds -240 +2 +4 +mini +digital +photo -frame +2 -covers	Accessory
biostek -ds -240 +2 +4 +mini +digital +photo -frame +2 -covers +cleaning -applicator -for +digital +photo -frames	Accessory
-fototasche -kameratasche -typ hardbox hellblau -set +mit +4 +gb +sd -karte -für +samsung st60 es55 es60 +es65 es70	Accessory

False positives:

Listing	Classification
kingston kingston valueram 512 mo ddr sdram pc3200 cas3 wet wipe dispenser +100 wipes dust removal spray +250 ml foam -cleaner -for screens +and keyboards +150 ml	Camera
-duragadget +deluxe +mini flat folding +camera +camcorder +tripod stand -for +canon +ixus 1000hs +ixus 300hs +ixus +210 +ixus 200is +ixus 130 +ixus 120is	Camera

Source Code

Exploring Finding Entity Aliases Using N-Gram Similarity

16-05-01

This is my second attempt at generating entity aliases from a set of product listings. In my first attempt I tried to use MinHash to group similar product listings, but it didn’t work out. This time, I’m going to be joining on n-grams of the manufacturer name and model number and then scoring the joins by the inverse term probability of the model n-grams.

I’m planning to use this in a pre-processing stage so that when I block by canonical manufacturer names I can include listings that have different variations of the canonical name.

For each listing, I’m attempting to find a similar canonical product record. The criteria I’m using for something being similar are:

Similar manufacture using character n-gram similarity
Similar model using shingles made from model parts

If I find enough instances where a listing is similar to a canonical product, I’ll consider the listing’s manufacturer name to be an alias for the canonical product’s manufacturer name.

Here are some aliases I was able to find:

Canonical	Alias
fujifilm	fuji
kodak	eastman kodak company
canon	canon canada
fujifilm	fujifilm electronic imaging europe gmbh firstorder
panasonic	panasonic deutschland gmbh
sony	sony uk consumer electronics instock account
fujifilm	fuji photo film europe gmbh
fujifilm	fujifilm imaging systems
kodak	kodak stock account
fujifilm	fujifilm canada
canon	canon uk ltd
olympus	olympus canada

My main goal was to relate listings with a manufacturer of fuji to fujifilm so it looks like this is going to work. I had to change the scoring method to use the inverse term probably because some of the model names have common numbers or words (ex: “zoom”) which caused lots of false positives.

Messy Source Code

Text Mining with Twitter and R Meetup

16-04-29

A follow along workshop on using R to do data mining on tweets.

Notes

Data pipeline: Extract → Clean → Transform → Analyize
Clean
- Remove: Unicode emoji, punctuation, links, stop words, numbers, dates, URLs, white space
- Normalize to lowercase
- Stemming - Truncate words to their radicals
  - Examples: cats → cat, ponies → poni
Analysis
- Part of speech tagging - Tag each word with its part of speech in a sentence using definition and context.
  - Example: They refuse to permit -> [pronoun] [verb] [to] [verb]
- Word association - Dot product of two words in a term document matrix gives correlation coefficient.
- Clustering - Grouping similar tweets(docs) together
  - K-means - Centroidal model, each cluster is represented by a single mean vector.
    - Algorithm: create random clusters → assign points to nearest cluster centroid → recalculate cluster centroids to the average of assigned data points → repeat.
  - Hierarchical - Connectivity model
  - Use cosine similarity as the distance function. Cosine similarity normalizes for tweet(doc) length during the comparison.
- Topic Mining - Use probability of terms to discover information from documents. Documents may belong to multiple topics.

Word association example:

Docs	word 1	word 2	word 3	word 4
1	0	0	0	0
2	1	0	0	0
3	1	1	0	0
4	1	1	1	1

Correlation(word 2, word 3) = DotProduct([1,0,0,0], [1,1,0,0]) = 0.5

Meetup Event

Record Linkage Pipeline

Exploring Classifying Listings Using A Naive Bayes Classifier

Exploring Finding Entity Aliases Using N-Gram Similarity

Text Mining with Twitter and R Meetup

Blog Ingredients

Tag Cloud

Tags