The Architecture of Recurrent Neural Network
The simplest recurrent neural network is shown in Figure (a) below. A key point here is the presence of the self-loop in Figure (a), which will cause the hidden state of the neural network to change after the input of each word in the sequence. In practice, one only works with sequences of finite length, and it makes sense to unfold the loop into a “time-layered” network that looks more like a feed-forward network. This network is shown in Figure (b). Note that in this case, we have a different node for the hidden state at each time-stamp and the self-loop has been unfurled into a feed-forward network. This representation is mathematically equivalent to Figure (a), but is much easier to comprehend because of its similarity to a traditional network. The weight matrices in different temporal layers are shared to ensure that the same function is used at each time-stamp. The annotations W_{xh},W_{hh}, and W_{hy}$ of the weight matrices in Figure (b) make the sharing evident.
It is noteworthy that Figure above shows a case in which each time-stamp has an input, output, and hidden unit. In practice, it is possible for either the input or the output units to be missing at any particular time-stamp. Examples of cases with missing inputs and outputs are shown in Figure below The choice of missing inputs and outputs would depend on the specific application at hand. For example, in a time-series forecasting application, we might need outputs at each time-stamp in order to predict the next value in the timeseries. On the other hand, in a sequence-classification application, we might only need a single output label at the end of the sequence corresponding to its class. In general, it is possible for any subset of inputs or outputs to be missing in a particular application.
The particular architecture shown in Figure 1 above is suited to language modeling. A language model is a well-known concept in natural language processing that predicts the next word, given the previous history of words. Given a sequence of words, their one-hot encoding is fed one at a time to the neural network in Figure (a). This temporal process is equivalent to feeding the individual words to the inputs at the relevant time-stamps in Figure (b). A time-stamp corresponds to the position in the sequence, which starts at 0 (or 1), and increases by 1 by moving forward in the sequence by one unit. In the setting of language modeling, the output is a vector of probabilities predicted for the next word in the sequence. For example, consider the sentence:
The cat chased the mouse.
When the word “The” is input, the output will be a vector of probabilities of the entire lexicon that includes the word “cat,” and when the word “cat” is input, we will again get a vector of probabilities predicting the next word. This is, of course, the classical definition of a language model in which the probability of a word is estimated based on the immediate history of previous words. In general, the input vector at time $t$ (e.g., one-hot encoded vector of the tth word) is $\bar{x_t}$, the hidden state at time $t$ is $\bar{h_t}$, and the output vector at time $t$ (e.g.,predicted probabilities of the $(t + 1)$th word) is $\bar{y_t}$. Both $\bar{x_t}$ and $\bar{y_t}$ are $d$-dimensional for a lexicon of size $d.$ The hidden vector $\bar{h_t}$ is $p$-dimensional, where $p$ regulates the complexity of the embedding. For the purpose of discussion, we will assume that all these vectors are column vectors. In many applications like classification, the output is not produced at each time unit but is only triggered at the last time-stamp in the end of the sentence. Although output and input units may be present only at a subset of the time-stamps, we examine the simple case in which they are present in all time-stamps. Then, the hidden state at time $t$ is given by a function of the input vector at time $t$ and the hidden vector at time $(t − 1)$:
$\bar{h}_t=f(\bar{h}_{t-1},\bar{x}_t)$
This function is defined with the use of weight matrices and activation functions (as used by all neural networks for learning), and the same weights are used at each time-stamp.Therefore, even though the hidden state evolves over time, the weights and the underlying function $f(·, ·)$ remain fixed over all time-stamps (i.e., sequential elements) after the neural network has been trained. A separate function $\bar{y}_t = g(\bar{h}_t)$ is used to learn the output probabilities from the hidden states.
Next, we describe the functions $f(·, ·)$ and $g(·)$ more concretely. We define a $p×d$ input-hidden matrix $W_{xh}$, a $p × p$ hidden-hidden matrix $W_{hh}$, and a $d × p$ hidden-output matrix $W_{hy}$. Then, one can expand above Equation and also write the condition for the outputs as follows:
$\bar{h}_t=tanh(W_{hx}\bar{x}_t+W_{hh}\bar{h}_{t-1})$
$\bar{y}_t=W_{hy}\bar{h}_t$
Here, the “tanh” notation is used in a relaxed way, in the sense that the function is applied to the $p$-dimensional column vector in an element-wise fashion to create a $p$-dimensional vector with each element in [−1, 1]. Throughout this section, this relaxed notation will be used for several activation functions such as tanh and sigmoid. In the very first time-stamp, $\bar{h}_{t−1}$ is assumed to be some default constant vector (such as 0), because there is no input from the hidden layer at the beginning of a sentence. One can also learn this vector, if desired. Although the hidden states change at each time-stamp, the weight matrices stay fixed over the various time-stamps. Note that the output vector $\bar{y}_t$ is a set of continuous values with the same dimensionality as the lexicon. A softmax layer is applied on top of $\bar{y}_t$ so that the results can be interpreted as probabilities. The $p$-dimensional output $\bar{h}_t$ of the hidden layer at the end of a text segment of $t$ words yields its embedding, and the $p$ dimensional columns of $W_{xh}$ yield the embeddings of individual words. The latter provides an alternative to word2vec embeddings.
Because of the recursive nature of Equation , the recurrent network has the ability to compute a function of variable-length inputs. In other words, one can expand the recurrence of Equation to define the function for $\bar{h}_t$ in terms of $t$ inputs. For example, starting at $\bar{h}_0$, which is typically fixed to some constant vector (such as the zero vector), we have $\bar{h}_1 = f(\bar{h}_0, \bar{x}_1)$ and $\bar{h_2} = f(f(\bar{h}_0, \bar{x}_1), \bar{x}_2)$. Note that $\bar{h}_1$ is a function of only $\bar{x}_1$, whereas $\bar{h}_2$ is a function of both $\bar{x}_1$ and $\bar{x}_2$. In general, $\bar{h}_t$ is a function of $\bar{x}_1 . . . \bar{x}_t$. Since the output $\bar{y}_t$ is a function of $\bar{h}_t$, these properties are inherited by $\bar{y}_t$ as well. In general, we can write the following:
$\bar{y}_t = F_t(\bar{x}_1, \bar{x}_2, . . . \bar{x}_t) $
Note that the function $F_t(·)$ varies with the value of $t$ although its relationship to its immediately previous state is always the same . Such an approach is particularly useful for variable-length inputs. This setting occurs often in many domains like text in which the sentences are of variable length. For example, in a language modeling application, the function $F_t(·)$ indicates the probability of the next word, taking into account all the previous words in the sentence.
Comments
Post a Comment