Language Modelling Example of RNN


In order to illustrate the workings of the RNN, we will use a toy example of a single sequence defined on a vocabulary of four words. Consider the sentence:

The cat chased the mouse.

In this case, we have a lexicon of four words, which are {“the,”“cat,”“chased,”“mouse”}. In Figure below, we have shown the probabilistic prediction of the next word at each of timestamps from 1 to 4. Ideally, we would like the probability of the next word to be predicted correctly from the probabilities of the previous words. Each one-hot encoded input vector $\bar{x}_t$ has length four, in which only one bit is 1 and the remaining bits are 0s. The main flexibility here is in the dimensionality $p$ of the hidden representation, which we set to 2 in this case. As a result, the matrix $W_{xh}$ will be a 2 × 4 matrix, so that it maps a one-hot encoded input vector into a hidden vector $h_t$ vector of size 2. As a practical matter, each column of $W_{xh}$ corresponds to one of the four words, and one of these columns is copied by the expression $W_{xh}\bar{x}_t$. Note that this expression is added to $W_{hh}h_t$ and then transformed with the tanh function to produce the final expression. The final output $y_t$ is defined by $W_{hy}h_t$. Note that the matrices $W_{hh}$ and $W_{hy}$ are of sizes 2 × 2 and 4 × 2, respectively.

In this case, the outputs are continuous values (not probabilities) in which larger values indicate greater likelihood of presence. These continuous values are eventually converted to probabilities with the softmax function, and therefore one can treat them as substitutes to log probabilities. The word “cat” is predicted in the first time-stamp with a value of 1.3, although this value seems to be (incorrectly) outstripped by “mouse” for which the corresponding value is 1.7. However, the word “chased” seems to be predicted correctly at the next time-stamp. As in all learning algorithms, one cannot hope to predict every value exactly, and such errors are more likely to be made in the early iterations of the backpropagation algorithm. However, as the network is repeatedly trained over multiple iterations, it makes fewer errors over the training data.

Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Introduction to neural networks