A blog about software and making.

Text Mining with Twitter and R Meetup

A follow along workshop on using R to do data mining on tweets.

Notes

  • Data pipeline: Extract → Clean → Transform → Analyize
  • Clean
    • Remove: Unicode emoji, punctuation, links, stop words, numbers, dates, URLs, white space
    • Normalize to lowercase
    • Stemming - Truncate words to their radicals
      • Examples: cats → cat, ponies → poni
  • Analysis
    • Part of speech tagging - Tag each word with its part of speech in a sentence using definition and context.
      • Example: They refuse to permit -> [pronoun] [verb] [to] [verb]
    • Word association - Dot product of two words in a term document matrix gives correlation coefficient.
    • Clustering - Grouping similar tweets(docs) together
      • K-means - Centroidal model, each cluster is represented by a single mean vector.
        • Algorithm: create random clusters → assign points to nearest cluster centroid → recalculate cluster centroids to the average of assigned data points → repeat.
      • Hierarchical - Connectivity model
      • Use cosine similarity as the distance function. Cosine similarity normalizes for tweet(doc) length during the comparison.
    • Topic Mining - Use probability of terms to discover information from documents. Documents may belong to multiple topics.

Word association example:

Docs word 1 word 2 word 3 word 4
1 0 0 0 0
2 1 0 0 0
3 1 1 0 0
4 1 1 1 1

Correlation(word 2, word 3) = DotProduct([1,0,0,0], [1,1,0,0]) = 0.5

Meetup Event