This is my second attempt at generating entity aliases from a set of product listings. In my first attempt I tried to use MinHash to group similar product listings, but it didn’t work out. This time, I’m going to be joining on n-grams of the manufacturer name and model number and then scoring the joins by the inverse term probability of the model n-grams.
I’m planning to use this in a pre-processing stage so that when I block by canonical manufacturer names I can include listings that have different variations of the canonical name.
For each listing, I’m attempting to find a similar canonical product record. The criteria I’m using for something being similar are:
- Similar manufacture using character n-gram similarity
- Similar model using shingles made from model parts
If I find enough instances where a listing is similar to a canonical product, I’ll consider the listing’s manufacturer name to be an alias for the canonical product’s manufacturer name.
Here are some aliases I was able to find:
Canonical |
Alias |
fujifilm |
fuji |
kodak |
eastman kodak company |
canon |
canon canada |
fujifilm |
fujifilm electronic imaging europe gmbh firstorder |
panasonic |
panasonic deutschland gmbh |
sony |
sony uk consumer electronics instock account |
fujifilm |
fuji photo film europe gmbh |
fujifilm |
fujifilm imaging systems |
kodak |
kodak stock account |
fujifilm |
fujifilm canada |
canon |
canon uk ltd |
olympus |
olympus canada |
My main goal was to relate listings with a manufacturer of fuji to fujifilm so it looks like this is going to work. I had to change the scoring method to use the inverse term probably because some of the model names have common numbers or words (ex: “zoom”) which caused lots of false positives.
Messy Source Code