Recurrent Neural Networks¶

Recurrent neural networks are a variation of feed forward artificial neural networks that have become exceedingly popular and have been used for many learning tasks related to genomics, NLP, and image classification. The distinguishing features of RNNs is that they possess memory, and here I will attempt to explain what this means and why it is useful.

To understand RNNs it is first useful to revisit standard feed forward ANNs. These type of neural networks feed (via matrix multiplication) feature information through a network of layers consisting of nodes. Each node receives input from previous nodes and passes an output to all of the nodes in the next layer, never moving backwards or sideways within a layer (hence the description feed-forward). When each node in each layer is connected to each node in the next layer we call this a fully connected network. This is illustrated in the diagram below.

On a side note, you can find all of the Tensorflow code used for this project in github:
https://github.com/JTDean123/tolstoyLSTM

(check out a previous post for more details: http://jasontdean.com/python/ann.html)

from IPython.display import Image

Image("ANN.png")

The model above learns via the inputs that are fed to it at a particular moment in time. This seems obvious, but the implication is that when the model is tasked with making a prediction it does not 'remember' the immediately previous data that is has seen. Put another way, this means that a decision made at time (t) does not influence a decision made at time (t+1), and this is suboptimal when classifying observations that occur in sequences (like data dependent on time). For example, a feed forward ANN trying to classify an apple will not know whether or not the previous images was an orange. Additionally, ANNs operate over a single input and generate an output.

Here I will show that the architectures of RNNs allow us to overcome these limitations: RNNs allow for both operations over multiple inputs in a sequence dependent fashion and for a network that retains memory of previous events. A figure will make this more clear (at least to me!). Take for example the potential RNN network structures shown below.

# figure from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Image('rnn.png')

A standard ANN is shown on the far left (one to one). This model structure generates a fixed sized output from a fixed sized input. Next turn your attention to the four networks shown on the right, particularly to the fact that the hidden layers, shown in green, are connected and that the three networks on the right take multiple inputs. The usage of the many to many model is better explained (like pretty much everything) with an example. Imagine the first input, the red box on the bottom left, as the letter 'A', the second 'B', and the third 'C'. Since the model first sees 'A' and passes this information to the hidden node receiving 'B' and then 'C' the model is able to learn the sequence A-B-C, rather than just the individual letters, and predict (if it's training has included the alphabet) 'D-E-F'.

In theory a large RNN can have access, or memory, from long sequences of inputs. However in practice a standard RNN will not be able to remember far back enough to (for example) remember sequences of words that occurred seven paragraphs ago. Thus the usage of RNNs was substantially limited until a type of RNN termed Long Short Term Memory (LSTM) was developed. I will refer you to this excellent post about the underlying mechanics of a LSTM, as the author explains it much better than I ever could!: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Here I will build a LSTM model to generate text on the single character level. Specifically, I will train a LSTM flavor of an RNN on Tolstoy's classic Anna Karenina and use this model to generate new text. I choose this text for this project because:

-this book is free and easily accessible
-neural networks require lots of data to train, and this book is long: 1,957,361 characters
-I really like this book and though it would be fun to experiment with

So, how exactly do we teach a model to learn how Tolstoy writes? Rather than jumping straight into the mathematics of recurrent neural networks it is best to start with a high level framework of model architecture, so lets start with a simple example. Consider the 'sentence' below:

i love you¶

This simple sentence contains ten characters (including spaces). To train a model to learn how to construct this sentence it must learn, on the single character level, patterns and sequences. We desire a trained model that is able to make the following predictions:

Image('fig1.png')

So, how do we train a model to learn that 'u' follows the characters 'i love yo'? As with all machine learning problems, how to structure the data for input is key. For this application we will build a model that accept characters in each cell of the LSTM as shown below.

Image('fig2.png')

In the example above we are feeding the model a sequence with length of nine and generating a single output. We next need to convert characters to numeric form via one-hot encoding. A one-hot encoded observation is a vector consisting of zeros with the exception of a single one in the index location corresponding to the class. For example, in the example above a character can be a letter of the alphabet or a space- 27 possibilities. This means that a one-hot encoded character (for this example) can be represented as a 27x1 column vector. One-hot encoding of the letter 'a' is shown below.

Image('fig3.png')

Once each character is one-hot encoded it can fed to an individual cell of an LSTM in a similar fashion to how data is fed to a standard feed forward neural network. The character 'a', after one hot-encoding, is fed to a mini two node feed forward neural network as shown below.

Image('fig4.png')

In essence, a single RNN cell as depicted by an orange box in figure 2 processes a single character as shown above. In contrast to a a single feed forward neural network, however, the outputs from each LSTM cell feed into an adjacent cell, allowing for the network to learn sequence depended information. For this project we also add an additional layer of model complexity and stack multiple layers of LSTM sequences.

Image('fig5.png')

Thats it! With this, we have the framework and the tools to build a stacked LSTM for text prediction. For this project, generating text based from a model trained on Anna Karenina, I used a sentence length of 50, a 3 layer stacked LSTM, each cell had 500 nodes, a dropout of 70% was implemented during training, and the Adam optimizer was used. Additionally, I found that I only needed to train on ~300k characters for a few hundred epochs (with a batch size of 128) to begin generating somewhat coherent text.

In 352356 there were 49 unique characters. To predict text a random 50 character sentence was selected from the text and fed to the model. It was interesting to observe the progression of generated text as learning progressed. For example, consider the following random seed (remember we are defining a sentence as 50 characters):

ppiness, he walked with a slight swing on each leg¶

sample of predicted text after epoch 1:¶

tond sot he tav ot an wad tit pore wod ti tar the to sis and not tot an te te toter on ter an the sor tev won tit nre tor tor nor at ton fao tin te des no no she tet tor te min dhe tor the der tit ter an se tos on tov op she inde to ter ter hin bite tar tot and tos tot nhe tou larite tot tir cors an tor te te lhe wa tar nhe let te the tin tee tet nhes ande tar te ton the an shet an on tin tot the in tor te tov nhe on not hee heun ter the ner nhe sher ser soter not no nhat teet an tas he at sit an tir tors ans sin ther or sis to ner de nhe no tin thi ther toss so nte te to the tins sar teut tit rint to thas tit tho lo dhe te te der an tin te tou het whe tor nis te ne in ton ter the tit son tore anish fare in on tot and to tin to ne ter nhe tou to nher note in sar tom tan to ne tted an tat tir tir the ser an ter and tir an to cor tore sir tass as the non te ines wote lre nhe nite tot ton thr tor tites the tit an te har bo ter tor ons mon an tot tot the dit she tiun se dire sir to the he

Although the model has not learned proper spelling or grammar, it is amazing (to me at least) that the model has learned spacing and is generating words that appear to be a suitable length. At least.¶

sample of predicted text after epoch 50:¶

"i have even to say and more than ever we must be aware that it was a man whos fond of it.

"what a force wont your position is in that about it. i am forgive it to the same time, and you mistel go away, and i want to do it all the thing i forget that its a minuses, the only just the better than she had not been the feeling of a man whos took that he was saying, and was her face.

"anna! well, what i cant be one of the favorite, most letting in her mother, and began to concolled a hand with him. but it was impossible, and that he was consisted in the light in the conversation.

anna had never come in. "though it doess the companions is that bed myself in

chapter 13

came in to her.

"i dont want to be disasted and see how i was to be absorbed as he asked, bent herepely simple and sociehis favilatina.

"like me?"

"i dont believe it, and i shall be no interesting in the mazurka in her mother.

"i dont say to her."

sample of predicted text after epoch 200:¶

alexey alexandrovitch was already in dissciesce, and he had to go out with her simple fan. "but what am i to be doing for a bad for the feeling, i see your wife and i may they are presenting together."

"what is it?" said stepan arkadyevitch, "youre weary for him."

"i dont understand. i expect his mothers are always dull".

"good-bye till at once, and short you though i made up of light how i could never the world", said levin, prince stepan arkadyevitch smiled.

"oh, what a gift for him, a list from the oblonskys house."

By this point the model has learned rather complext names and the grammar is likely better than mine.¶

Conclusions¶

In conclusion, this stacked LSTM was able to generate (semi) coherent text after just 200 epochs. Impressively, the model was able to learn names, some grammar, and generate sentences that had correct spacing and formatting. The generated text is admittedly not perfect, and a straightforward way to address this is to train for more epochs.