A follow along workshop on using R to do data mining on tweets.
Notes
- Data pipeline: Extract → Clean → Transform → Analyize
- Clean
- Remove: Unicode emoji, punctuation, links, stop words, numbers, dates, URLs, white space
- Normalize to lowercase
- Stemming - Truncate words to their radicals
- Examples: cats → cat, ponies → poni
- Analysis
- Part of speech tagging - Tag each word with its part of speech in a sentence using definition and context.
- Example: They refuse to permit -> [pronoun] [verb] [to] [verb]
- Word association - Dot product of two words in a term document matrix gives correlation coefficient.
- Clustering - Grouping similar tweets(docs) together
- K-means - Centroidal model, each cluster is represented by a single mean vector.
- Algorithm: create random clusters → assign points to nearest cluster centroid → recalculate cluster centroids to the average of assigned data points → repeat.
- Hierarchical - Connectivity model
- Use cosine similarity as the distance function. Cosine similarity normalizes for tweet(doc) length during the comparison.
- K-means - Centroidal model, each cluster is represented by a single mean vector.
- Topic Mining - Use probability of terms to discover information from documents. Documents may belong to multiple topics.
- Part of speech tagging - Tag each word with its part of speech in a sentence using definition and context.
Word association example:
Docs | word 1 | word 2 | word 3 | word 4 |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 1 | 1 | 0 | 0 |
4 | 1 | 1 | 1 | 1 |
Correlation(word 2, word 3) = DotProduct([1,0,0,0], [1,1,0,0]) = 0.5