This is my second attempt at generating entity aliases from a set of product listings. In my first attempt I tried to use MinHash to group similar product listings, but it didn’t work out. This time, I’m going to be joining on n-grams of the manufacturer name and model number and then scoring the joins by the inverse term probability of the model n-grams.
I’m planning to use this in a pre-processing stage so that when I block by canonical manufacturer names I can include listings that have different variations of the canonical name.
For each listing, I’m attempting to find a similar canonical product record. The criteria I’m using for something being similar are:
- Similar manufacture using character n-gram similarity
- Similar model using shingles made from model parts
If I find enough instances where a listing is similar to a canonical product, I’ll consider the listing’s manufacturer name to be an alias for the canonical product’s manufacturer name.
Here are some aliases I was able to find:
Canonical | Alias |
---|---|
fujifilm | fuji |
kodak | eastman kodak company |
canon | canon canada |
fujifilm | fujifilm electronic imaging europe gmbh firstorder |
panasonic | panasonic deutschland gmbh |
sony | sony uk consumer electronics instock account |
fujifilm | fuji photo film europe gmbh |
fujifilm | fujifilm imaging systems |
kodak | kodak stock account |
fujifilm | fujifilm canada |
canon | canon uk ltd |
olympus | olympus canada |
My main goal was to relate listings with a manufacturer of fuji to fujifilm so it looks like this is going to work. I had to change the scoring method to use the inverse term probably because some of the model names have common numbers or words (ex: “zoom”) which caused lots of false positives.