Text Mining with Twitter and R Meetup

A follow along workshop on using R to do data mining on tweets.

Notes

Data pipeline: Extract → Clean → Transform → Analyize
Clean
- Remove: Unicode emoji, punctuation, links, stop words, numbers, dates, URLs, white space
- Normalize to lowercase
- Stemming - Truncate words to their radicals
  - Examples: cats → cat, ponies → poni
Analysis
- Part of speech tagging - Tag each word with its part of speech in a sentence using definition and context.
  - Example: They refuse to permit -> [pronoun] [verb] [to] [verb]
- Word association - Dot product of two words in a term document matrix gives correlation coefficient.
- Clustering - Grouping similar tweets(docs) together
  - K-means - Centroidal model, each cluster is represented by a single mean vector.
    - Algorithm: create random clusters → assign points to nearest cluster centroid → recalculate cluster centroids to the average of assigned data points → repeat.
  - Hierarchical - Connectivity model
  - Use cosine similarity as the distance function. Cosine similarity normalizes for tweet(doc) length during the comparison.
- Topic Mining - Use probability of terms to discover information from documents. Documents may belong to multiple topics.

Word association example:

Docs	word 1	word 2	word 3	word 4
1	0	0	0	0
2	1	0	0	0
3	1	1	0	0
4	1	1	1	1

Correlation(word 2, word 3) = DotProduct([1,0,0,0], [1,1,0,0]) = 0.5