A blog about software and making.

Exploring Classifying Documents Using Distribution Of Term Probabilities

I’m looking a way to classify product listings as either a product or a product accessory. My current idea is to classify listings based on how their term probabilities are distributed.

The idea is to find the probability of each term in a listing and use them to build a histogram to get an idea how the common and unique terms are distributed for a listing.

Examples of common words: camera, digital, zoom, optical, with, lcd, megapixel, lens, canon, black, and, mp, digitalkamera, cm, 27

Examples of unique words: rings, eb575152vu, i9000, galaxys, 1080mah, funtionality, bp511a, zs7, enel5, 1100mah, mll3, 228825, np20, negative, scanner

I’m assuming that a typical product listing generally has one model number (unique) and a bunch of common terms while an accessory listing usually has multiple model numbers. If this is true it should be possible to classify a listing as either a product or a product accessory from the distribution of term probabilities.

From what I’m seeing so far, it seems to be possible to classify accessory listings that have a high ratio of unique terms.

Note: I wanted to use a histogram but I couldn’t get the hexo google charts plugin to make a histogram with a custom scale 😕

Examples of product listings:

samsung sh100 142mp wifi digital camera with 5x optical zoom in silver 8gb accessory kit

canon eos rebel t3i 18 mp cmos digital slr camera and digic 4 imaging body only

Examples of accessory listings:

310 digital camera video mask now rated to 65 feet)

optekas extreme travelers essentials kit by opteka package inlcudes excursion series c900 fullsize waterproof canvas bag 6501300mm and 500mm telephoto lenses heavy duty tripod and monopod and much more for pentax k10d k20d k100d k110d k200d ist digital slr cameras

I had to use a nonlinear scale (I used powers of 2) for the histogram buckets or all the unique (small probability) words ended up in the same bucket.

There is a high rate of false positives (products being identified as accessories) when the listings are in languages other than English. With so few non-English listings every word in these listings is unique across all listings. It may be possible to figure out what language the listings are by using identifying listings which have unusual character n-grams distributions but I don’t now if there will be enough text per listing to do this reliably.

jendigital jd 5200 z3 digitalkamera 50 2560 x 1920 32mb

The messy code