Here I will try to capture Twitter topic streams and perform some basic text mining. This can be done using the Twitter R package (twitteR).


Create Twitter Account and Set Up twitteR

We will need two libraries for this analysis: twitterR and ROAuth. twitteR is an interface between R and the twitter API. ROAuth allows users to authenticate via OAuth to the server of their choice.

library(twitteR)
library("ROAuth")


A twitter account is needed for this analysis, so I had to bite the bullet and create one. 20 million followers here I come. Next, I created a new application (https://apps.twitter.com) and obtained the keys needed to allow R to access twitter. Specifically, I needed:

  1. Consumer Key (API Key)

  2. Consumer Secret (API Secret)

click the “Create my access token” button and then obtain-

  1. Access Token

  2. Access Token Secret


Now I am ready to set up twitteR. This is done by storing the keys and tokens from above as follows:

consumer_key <- "your_consumer_key"

consumer_secret <- "your_consumer_secret"

access_token <- "your_access_token"

access_secret <- "your_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"


That is it! Now I am ready to pull data from twitter.


Pull Data From Twitter Using R

Now that we have a connection set up to twitter we are ready to do some searching. This searching is done using the searchTwitter function, and as a first pass I will search for 2000 tweets containing the word ‘Seattle’.

goose <- searchTwitter("Seattle", n=2000, lang='en')

Great, now we have 2000 tweets containing the term “Seattle”. What do we do with all of this data? That is always the question! A good place to start is with a wordcloud. A wordcloud is an easy, intuitive way to visualize word frequency in text data. For this we will use the tm and wordcloud packages.

library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)


It is important to do some text pre-processing and cleanup before analysis, and this is done by first extracting text from the tweets and creating a Corpus. Also, I found that non ASCII characters in the tweets created a lot of problems when attempting to stem and convert the text to lowercase, so before moving forward I removed these problematic characters with the iconv function. The stemming and stopwords processing that I did in the next steps search for lowercase words only, so I also converted the text to lowercase before moving forwards. Finally, I noticed that a lot of tweets contain websites, so I removed any tweets that contain ‘http’ using the gsub function.

goose.text = sapply(goose, function(x) x$getText())

# remove non-ascii characters
goose.text <- iconv(goose.text, "latin1", "ASCII", sub="")
goose.text <- tolower(goose.text)

# remove 'http'
goose.text <- gsub('http.* *', '', goose.text)

goose.corp <- Corpus(VectorSource(goose.text))


Next I created a term document matrix and cleaned up the text by:

goose.data = TermDocumentMatrix(goose.corp, control = list(stemming = TRUE, removePunctuation = TRUE, stopwords = c("seattle", "seattl", "the", stopwords("english")),removeNumbers = TRUE, stripWhitespace = TRUE))


The last step is to extract word frequencies and create a wordcloud plot.

goose.matrix <- as.matrix(goose.data)
word_freqs = sort(rowSums(goose.matrix), decreasing=TRUE)
goose.df <- data.frame(word=names(word_freqs), freq=word_freqs)

wordcloud(goose.df$word, goose.df$freq, scale=c(5,0.5), random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))