twitteR

Previously I showed how to use twitteR to mine twitter topic streams for text. I will build on what I learned and take this type of analysis deeper and try to answer the following questions:

Is the frequency of words different in twitter posts containing the words #NPR and #NRA?
Is the sentiment of tweets containing #NPR and #NRA different?

NPR and NRA differ by one letter, however they represent very different entities and I would like to how tweets containing them are different.

(Side note - you can find all of this code on my Github: https://github.com/JTDean123)

To answer this question I can use many of the libraries and techniques described in my previous posting, however before I get into that I would like to address a few points/concerns. First of all, it is important to point out that this is an observational study so I can not make any claims about causation. This means that I will be able to make claims such as: * tweets containing ‘#NRA’ have more/less/the same frequency of a given word compared to those containing ‘#NPR’.

I will not be able to make claims such as: * people who support the NRA like/dislike/etc something based on word frequency compared to those that support NPR.

To make claims rooted in causation I would have to perform a controlled experiment with test and experimental groups to try to eliminate bias and confounding variables as much as possible. There are potentially many biases and confounding variables in the study I am embarking on here. I would love to hear any that you can identify.

Pull Text From Twitter Topic Streams

To start with this analysis we need two libraries- twitteR and ROAuth. I described these in my previous article.

library(twitteR)
library("ROAuth")

## [1] "Using direct authentication"

I connected to twitter as described before and now I am ready to start pulling data from twitter. First I will collect 2000 tweets contain either, #NRA, #NPR, or #sunshine.

Wait, why am I pulling twitter tweets contain the word #sunshine? This will serve as a positive control of sorts, as I suspsect that tweets containing #sunshine may be enriched with positive words.

nra <- searchTwitter("#NRA", n=2000, lang='en')
npr <- searchTwitter("#NPR", n=2000, lang='en')
sunshine <- searchTwitter("#sunshine", n=2000, lang='en')

Great, now I have 2000 tweets containing our search terms. Before I do any type of analysis I need to do some text pre-processing. This is done by first extracting text from the tweets and creating a Corpus. Also, I found that non ASCII characters in the tweets created a lot of problems when attempting to stem and convert the text to lowercase, so before moving forward I removed these problematic characters with the iconv function. The stemming and stopwords processing that I did in the next steps search for lowercase words only, so I also converted the text to lowercase before moving forwards. Finally, I noticed that a lot of tweets contain websites, so I removed any tweets that contain ‘http’ using the gsub function.

Next I created a term document matrix and cleaned up the text by:

Stripping whitespace
Removing stop words
Apply stem word processing
Remove punctuation symbols
Remove numbers

nra.data = TermDocumentMatrix(nra.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("the", "nra", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))

npr.data = TermDocumentMatrix(npr.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("the", "npr", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))

sunshine.data = TermDocumentMatrix(sunshine.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("the", "sunshine", "sunshin", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))

Now that we have the text data from the tweets in a TermDocumentMatrix we can extract word frequencies. Before quantifying this we can visualize the data with a wordcloud. I really like wordclouds because they are intuitive and easy to understand.

NRA wordcloud

nra.matrix <- as.matrix(nra.data)
nra.word_freqs = sort(rowSums(nra.matrix), decreasing=TRUE)
nra.df <- data.frame(word=names(nra.word_freqs), freq=nra.word_freqs)

wordcloud(nra.df$word, nra.df$freq, scale=c(5,0.5), random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

NPR wordcloud

npr.matrix <- as.matrix(npr.data)
npr.word_freqs = sort(rowSums(npr.matrix), decreasing=TRUE)
npr.df <- data.frame(word=names(npr.word_freqs), freq=npr.word_freqs)

wordcloud(npr.df$word, npr.df$freq, scale=c(5,0.5), random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

sunshine wordcloud

sunshine.matrix <- as.matrix(sunshine.data)
sunshine.word_freqs = sort(rowSums(sunshine.matrix), decreasing=TRUE)
sunshine.df <- data.frame(word=names(sunshine.word_freqs), freq=sunshine.word_freqs)

wordcloud(sunshine.df$word, sunshine.df$freq, scale=c(5,0.5), random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

Word Frequency Analysis and Association

As shown above, the words appearing in these three sets of tweets is very different. I can next look at the most frequent terms appearing in each twitter feed.

# NRA
kable(head(as.data.frame(nra.word_freqs, 5)), format="html", align = 'c')

	nra.word_freqs
america	778
lapierr	758
wayn	758
cpac	725
media	682
campaign	657

# NPR
kable(head(as.data.frame(npr.word_freqs, 5)), format="html", align = 'c')

	npr.word_freqs
trump	417
new	264
cnn	177
stori	171
msnbc	134
nyt	98

# sunshine
kable(head(as.data.frame(sunshine.word_freqs, 5)), format="html", align = 'c')

	sunshine.word_freqs
come	462
soon	444
fbb	437
kerala	437
varundvn	437
buffalo	436

Next I looked for words that were ‘correlated’ with each other. The TermDocumentMatrix that I created above is essentially a matrix containing words (rows) and tweets (columns). Thus, the value at position (i,j) is the number of times that word i occurs in tweet j. By calculating a correlation matrix I can determine quantitatively how words appear (or don’t appear) together in different tweets. To do this I used the findAssoc function. To start I searched for terms in the three sets of tweets that are associated with the term ‘america’.

nra.america <- findAssocs(nra.data, 'america', 0.25)
npr.america <- findAssocs(npr.data, 'america', 0.25)
sunshine.america <- findAssocs(sunshine.data, 'america', 0.25)

# NRA
kable(head(as.data.frame(nra.america)), format="html", align = 'c')

	america
carpetbomb	0.84
sieg	0.84
countri	0.83
campaign	0.82
know	0.82
media	0.80

# NPR
kable(head(as.data.frame(npr.america)), format="html", align = 'c')

	america
adequ	0.28
americorp	0.28
angri	0.28
baldwin	0.28
beltway	0.28
beto	0.28

# sunshine
kable(head(as.data.frame(sunshine.america)), format="html", align = 'c')

america

As shown above, I found very different words associate with ‘america’ in these three sets of tweets! I next looked for words that are associated with the word ‘love’.

nra.love <- findAssocs(nra.data, 'love', 0.30)
npr.love <- findAssocs(npr.data, 'love', 0.30)
sunshine.love <- findAssocs(sunshine.data, 'love', 0.30)

# NRA
kable(head(as.data.frame(nra.love)), format="html", align = 'c')

	love
reason	0.41
ads	0.35
againdoesnt	0.35
almost	0.35
argue	0.35
believable	0.35

# NPR
kable(head(as.data.frame(npr.love)), format="html", align = 'c')

	love
slavery	0.39
search	0.36
dog	0.30

# sunshine
kable(head(as.data.frame(sunshine.love)), format="html", align = 'c')

love

Again, the set of words associated with love in each sets of tweets depends on the topic stream that the tweets were obtained from. Last I looked for words that are associated with the word ‘trump’.

nra.trump <- findAssocs(nra.data, 'trump', 0.2)
npr.trump <- findAssocs(npr.data, 'trump', 0.2)
sunshine.trump <- findAssocs(sunshine.data, 'trump', 0.2)

# NRA
kable(head(as.data.frame(nra.trump)), format="html", align = 'c')

	trump
american	0.65
cruz	0.61
usa	0.47
pjnet	0.44
congress	0.43
smartvalueblog	0.43

# NPR
kable(head(as.data.frame(npr.trump)), format="html", align = 'c')

	trump
hotel	0.32
fbi	0.31
campaign	0.30
govern	0.30
conflict	0.29
rais	0.28

# sunshine
kable(head(as.data.frame(sunshine.trump)), format="html", align = 'c')

trump

Interestingly, there were no words associated with trump with a correlation > 0.2 in tweets containing the word sunshine. These results will obviously vary day to day as different news topics come and go, so I encourage you to re-do this analysis and let me know what you find.

To find word associations above I had to search using a specific word. It might be more useful to find two words that are most associated. Quantitatively, two words are most positively associated if they have the highest correlation coefficient. As I mentioned above, a correlation matrix of a TermDocumentMatrix allows us to calculate exactly this. I will perform this analysis for both the NRA and NPR data sets.

The correlation of two variables is the covariance normalized by the product of the standard deviations, so first I removed any words that have a zero standard deviation across tweets.

# NRA
nra2 <- as.data.frame(nra.matrix)
# calculate the standard deviations of each word across tweets
nra.stdev <- as.numeric(apply(nra2, 1, sd))
nra2$stdev <- nra.stdev
# filter out words that have a standard deviation equal to zero
nra2 <- nra2 %>% filter(stdev>0)
nra2 <- nra2[,-2001]

# NPR
npr2 <- as.data.frame(npr.matrix)
# calculate the standard deviations of each word across tweets
npr.stdev <- as.numeric(apply(npr2, 1, sd))
npr2$stdev <- npr.stdev
# filter out words that have a standard deviation equal to zero
npr2 <- npr2 %>% filter(stdev>0)
npr2 <- npr2[,-2001]

The next step is to calculate a correlation matrix and find the maximum values. The diagonal values in a correlation matrix are equal to one (this is the correlation of a variable with itself), so I set these equal to zero. Next, I determined the words that were associated with the highest correlation coefficient.

# NRA
nra.corr <- cor(nra2)
nra.corr[is.na(nra.corr)] <- 0
nra.corr[nra.corr == 1] <- 0
# find the highest correlation coefficient 
nra.max <- as.matrix(nra.corr[as.numeric(which(nra.corr > 0.95 & nra.corr < 0.99))])
nra.max <- sort(nra.max, decreasing = TRUE)
nra.maximum <- nra.max[1]
# find where this maximum occurs in the correlation matrix
nra.loc <- which(nra.corr == nra.maximum, arr.ind = TRUE)
# and last find what words this correlation coeffient is calculated from
nra.words <- row.names(nra.matrix)
nra.top2 <- c(nra.words[nra.loc[1,1]], nra.words[nra.loc[1,2]])

# NPR
npr.corr <- cor(npr2)
npr.corr[is.na(npr.corr)] <- 0
npr.corr[npr.corr == 1] <- 0
# find the highest correlation coefficient 
npr.max <- as.matrix(npr.corr[as.numeric(which(npr.corr > 0.95 & npr.corr < 0.99))])
npr.max <- sort(npr.max, decreasing = TRUE)
npr.maximum <- npr.max[1]
# find where this maximum occurs in the correlation matrix
npr.loc <- which(npr.corr == npr.maximum, arr.ind = TRUE)
# and last find what words this correlation coeffient is calculated from
npr.words <- row.names(npr.matrix)
npr.top2 <- c(npr.words[npr.loc[1,1]], npr.words[npr.loc[1,2]])

Now I can display the results:

# "Top two associated words in the NRA tweet data set""
nra.top2

## [1] "holder"  "handgun"

# "Top two associated words in the NPR tweet data set""
npr.top2

## [1] "brotherhood" "bromodrosi"

As shown above, I found that the highest associated words in the two tweet sets were different.

Sentiment Analysis

Next I was interested performing sentiment analysis of these three sets of tweets. Sentiment analysis, in this context, aims to determine the attitude or emotions of the author of a particular tweet. For this type of analysis I used lexicon based sentiment analysis. This type of analysis requires a dictionary of words classified as positive or negative, and I found such a list from this research group:
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

Two text files, one containing positive words and one containing negative words, were downloaded for sentiment analysis.

positive <- readLines("positive-words.txt")
negative <- readLines("negative-words.txt")

For this analysis I want to determine the frequency of positive and negative words in each of the twitter streams. It is important to note that a negative word in a tweet does not necessarirly indicate a negative overall sentiment, however this analysis should give us a nice big picture about the sentiment of the tweets that we have captured. I started this analysis by looking for words in each twitter stream that were present in the positive word list.

nra.df$positive <- match(nra.df$word, positive)
npr.df$positive <- match(npr.df$word, positive)
sunshine.df$positive <- match(sunshine.df$word, positive)

Next I looked for words that were present in each set that were identified as negative.

nra.df$negative <- match(nra.df$word, negative)
npr.df$negative <- match(npr.df$word, negative)
sunshine.df$negative <- match(sunshine.df$word, negative)

Many of the words present in these twitter streams will not be identified as positive or negative (these are currently ‘NA’ in the dataframe), so I set their ‘positive’ or ‘negative’ value equal to zero. Also, the match function returns the index of a word identified as positive or negative in the queried word set, but for this analysis I only need to know if it is present (1) or not present (0). Therefore I assigned all non-zero matches to one and all others to zero. (I bet there is a way to do this with one line using one of the apply functions. If you know how, let me know!)

nra.df[is.na(nra.df)] <- 0
nra.df$positive[nra.df$positive != 0] <- 1
nra.df$negative[nra.df$negative != 0] <- 1

npr.df[is.na(npr.df)] <- 0
npr.df$positive[npr.df$positive != 0] <- 1
npr.df$negative[npr.df$negative != 0] <- 1

sunshine.df[is.na(sunshine.df)] <- 0
sunshine.df$positive[sunshine.df$positive != 0] <- 1
sunshine.df$negative[sunshine.df$negative != 0] <- 1

Now the only thing I have to do is calculate the frequency of positive and negative words occurring in the sets of tweets.

nra.positive <- sum((nra.df$positive*nra.df$freq))/sum(nra.df$freq)
nra.negative <- sum((nra.df$negative*nra.df$freq))/sum(nra.df$freq)

npr.positive <- sum((npr.df$positive*npr.df$freq))/sum(npr.df$freq)
npr.negative <- sum((npr.df$negative*npr.df$freq))/sum(npr.df$freq)

sunshine.positive <- sum((sunshine.df$positive*sunshine.df$freq))/sum(sunshine.df$freq)
sunshine.negative <- sum((sunshine.df$negative*sunshine.df$freq))/sum(sunshine.df$freq)

# format the data for plotting
nra.sents <- data.frame(positive = nra.positive, negative = nra.negative)
npr.sents <- data.frame(positive = npr.positive, negative = npr.negative)
sunshine.sents <- data.frame(positive = sunshine.positive, negative = sunshine.negative)
sentiments <- rbind(nra.sents, npr.sents, sunshine.sents)
names <- c("#NRA", "#NPR", "#sunshine")
sentiments$tweets <- names
#row.names(sentiments) <- names
sentiments.m <- melt(sentiments)

## Using tweets as id variables

colnames <- c("tweets", "sentiment", "fraction")
colnames(sentiments.m) <- colnames

# plot the data
ggplot(data=sentiments.m, aes(tweets, fraction, fill=sentiment)) + geom_bar(stat='identity') + ylab("fraction of tweets") + xlab("") + theme_bw()

As shown above, we found that < 10% of the words in the tweets were classified as positive or negative. I have a hard time believing that 10% of the words that we use in normal language are neutral. So, how do I explain this? I think that the sentiment of words in a tweet is not independent of the others words in a tweet. An example to explain: the word fish is not classified as positive or negative in this analysis, however in the sentence “your feet smell like fish!” the word is negative. This type of sentiment will not be captured in the analysis, and re-evaluating this data set with this in mind would be interesting but is beyond the scope of this project. Another key thing that I found is that the ratio of of positive to negative tweets seems to be much higher in the sunshine set. We can visualize this as shown below.

sentiments$ratio <- sentiments$positive/sentiments$negative
head(sentiments[,3:4])

##      tweets    ratio
## 1      #NRA 1.192846
## 2      #NPR 1.080851
## 3 #sunshine 4.391509

Wow! I found that, at least on Feb 26, 2017, that tweets containing #sunshine had five times more positive words than negative words. I also found that tweets containing #NRA and #NPR had about the same fraction of positive to negative tweets, however I suspect this will change based on daily news events.

Conclusion

In conclusion, I found that tweets containing #NRA and #NPR and #sunshine contained different words occurring at different frequencies. Also, I found that tweets containing #sunshine contained five times as many positive to negative words compared to tweets containing #NRA and #NPR. But tweets are dynamic! They change on a daily, or really on a minute by minute basis. These trends could be reversed by the time you run this script. Good luck and let me know what you find!