Previously I showed how to use twitteR to mine twitter topic streams for text. I will build on what I learned and take this type of analysis deeper and try to answer the following questions:

Is the frequency of words different in twitter posts containing the words #NPR and #NRA?
Is the sentiment of tweets containing #NPR and #NRA different?

NPR and NRA differ by one letter, however they represent very different entities and I would like to how tweets containing them are different.

(Side note - you can find all of this code on my Github: https://github.com/JTDean123)

To answer this question I can use many of the libraries and techniques described in my previous posting, however before I get into that I would like to address a few points/concerns. First of all, it is important to point out that this is an observational study so I can not make any claims about causation. This means that I will be able to make claims such as: * tweets containing ‘#NRA’ have more/less/the same frequency of a given word compared to those containing ‘#NPR’.

I will not be able to make claims such as: * people who support the NRA like/dislike/etc something based on word frequency compared to those that support NPR.

To make claims rooted in causation I would have to perform a controlled experiment with test and experimental groups to try to eliminate bias and confounding variables as much as possible. There are potentially many biases and confounding variables in the study I am embarking on here. I would love to hear any that you can identify.


Pull Text From Twitter Topic Streams

To start with this analysis we need two libraries- twitteR and ROAuth. I described these in my previous article.

library(twitteR)
library("ROAuth")
## [1] "Using direct authentication"

I connected to twitter as described before and now I am ready to start pulling data from twitter. First I will collect 2000 tweets contain either, #NRA, #NPR, or #sunshine.

Wait, why am I pulling twitter tweets contain the word #sunshine? This will serve as a positive control of sorts, as I suspsect that tweets containing #sunshine may be enriched with positive words.

nra <- searchTwitter("#NRA", n=2000, lang='en')
npr <- searchTwitter("#NPR", n=2000, lang='en')
sunshine <- searchTwitter("#sunshine", n=2000, lang='en')


Great, now I have 2000 tweets containing our search terms. Before I do any type of analysis I need to do some text pre-processing. This is done by first extracting text from the tweets and creating a Corpus. Also, I found that non ASCII characters in the tweets created a lot of problems when attempting to stem and convert the text to lowercase, so before moving forward I removed these problematic characters with the iconv function. The stemming and stopwords processing that I did in the next steps search for lowercase words only, so I also converted the text to lowercase before moving forwards. Finally, I noticed that a lot of tweets contain websites, so I removed any tweets that contain ‘http’ using the gsub function.

Next I created a term document matrix and cleaned up the text by:

nra.data = TermDocumentMatrix(nra.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("the", "nra", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))

npr.data = TermDocumentMatrix(npr.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("the", "npr", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))

sunshine.data = TermDocumentMatrix(sunshine.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("the", "sunshine", "sunshin", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))

Now that we have the text data from the tweets in a TermDocumentMatrix we can extract word frequencies. Before quantifying this we can visualize the data with a wordcloud. I really like wordclouds because they are intuitive and easy to understand.


NRA wordcloud

nra.matrix <- as.matrix(nra.data)
nra.word_freqs = sort(rowSums(nra.matrix), decreasing=TRUE)
nra.df <- data.frame(word=names(nra.word_freqs), freq=nra.word_freqs)

wordcloud(nra.df$word, nra.df$freq, scale=c(5,0.5), random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))