Here I will try to capture Twitter topic streams and perform some basic text mining. This can be done using the Twitter R package (twitteR).


Create Twitter Account and Set Up twitteR

We will need two libraries for this analysis: twitterR and ROAuth. twitteR is an interface between R and the twitter API. ROAuth allows users to authenticate via OAuth to the server of their choice.

library(twitteR)
library("ROAuth")


A twitter account is needed for this analysis, so I had to bite the bullet and create one. 20 million followers here I come. Next, I created a new application (https://apps.twitter.com) and obtained the keys needed to allow R to access twitter. Specifically, I needed:

  1. Consumer Key (API Key)

  2. Consumer Secret (API Secret)

click the “Create my access token” button and then obtain-

  1. Access Token

  2. Access Token Secret


Now I am ready to set up twitteR. This is done by storing the keys and tokens from above as follows:

consumer_key <- "your_consumer_key"

consumer_secret <- "your_consumer_secret"

access_token <- "your_access_token"

access_secret <- "your_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"


That is it! Now I am ready to pull data from twitter.


Pull Data From Twitter Using R

Now that we have a connection set up to twitter we are ready to do some searching. This searching is done using the searchTwitter function, and as a first pass I will search for 2000 tweets containing the word ‘Seattle’.

goose <- searchTwitter("Seattle", n=2000, lang='en')

Great, now we have 2000 tweets containing the term “Seattle”. What do we do with all of this data? That is always the question! A good place to start is with a wordcloud. A wordcloud is an easy, intuitive way to visualize word frequency in text data. For this we will use the tm and wordcloud packages.

library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)


It is important to do some text pre-processing and cleanup before analysis, and this is done by first extracting text from the tweets and creating a Corpus. Also, I found that non ASCII characters in the tweets created a lot of problems when attempting to stem and convert the text to lowercase, so before moving forward I removed these problematic characters with the iconv function. The stemming and stopwords processing that I did in the next steps search for lowercase words only, so I also converted the text to lowercase before moving forwards. Finally, I noticed that a lot of tweets contain websites, so I removed any tweets that contain ‘http’ using the gsub function.

goose.text = sapply(goose, function(x) x$getText())

# remove non-ascii characters
goose.text <- iconv(goose.text, "latin1", "ASCII", sub="")
goose.text <- tolower(goose.text)

# remove 'http'
goose.text <- gsub('http.* *', '', goose.text)

goose.corp <- Corpus(VectorSource(goose.text))


Next I created a term document matrix and cleaned up the text by:

goose.data = TermDocumentMatrix(goose.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("seattle", "seattl", "the", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))


The last step is to extract word frequencies and create a wordcloud plot.

goose.matrix <- as.matrix(goose.data)
word_freqs = sort(rowSums(goose.matrix), decreasing=TRUE)
goose.df <- data.frame(word=names(word_freqs), freq=word_freqs)

wordcloud(goose.df$word, goose.df$freq, scale=c(5,0.5), random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))


The top five words pulled from these 2000 tweets:

head(word_freqs, 5)
##     atlutd washington    sounder        job        mls 
##        166        114         98         90         88

Interestingly, the top word is ‘atlutd’ and the third most frequent word is ‘sounders’. It turns out that ‘atlutd’ is the Atlanta professional soccer team and ‘sounders’ is the home soccer team here in Seattle. Last night (2/22) these two teams had a game, so this result makes sense!


Anyways, this was an introduction into text mining from twitter streams in R. Hopefully I can use these new skills to do some more interesting things in the future.