Multilayer Recurrent Networks ( Deep Recurrent Network)


In all the aforementioned applications, a single-layer RNN architecture is used for ease in understanding. However, in practical applications, a multilayer architecture is used in order to build models of greater complexity. Furthermore, this multilayer architecture can be used in combination with advanced variations of the RNN, such as the LSTM architecture or the gated recurrent unit. These advanced architectures are introduced in later sections.

An example of a deep network containing three layers is shown in Figure below. Note that nodes in higher-level layers receive input from those in lower-level layers. The relationships among the hidden states can be generalized directly from the single-layer network. First, we rewrite the recurrence equation of the hidden layers (for single-layer networks) in a form that can be adapted easily to multilayer networks:

Here, we have put together a larger matrix $W = [W_{xh},W_{hh}]$ that includes the columns of $W_{xh}$ and $W_{hh}$. Similarly, we have created a larger column vector that stacks up the state vector in the first hidden layer at time $t − 1$ and the input vector at time $t$. In order to distinguish between the hidden nodes for the upper-level layers, let us add an additional superscript to the hidden state and denote the vector for the hidden states at time-stamp $t$ and layer $k$ by $\bar{h}_t^{(k)}$.

 Similarly, let the weight matrix for the $k$th hidden layer be denoted by $W^{(k)}$. It is noteworthy that the weights are shared across different time-stamps (as in the single-layer recurrent network), but they are not shared across different layers. Therefore, the weights are superscripted by the layer index $k$ in $W^{(k)}$. The first hidden layer is special because it receives inputs both from the input layer at the current time-stamp and the adjacent hidden state at the previous time-stamp. Therefore, the matrices $W^{(k)}$ will have a size of $p×(d+p)$ only for the first layer (i.e., $k = 1$), where $d$ is the size of the input vector $\bar{x}_t$ and $p$ is the size of the hidden vector $\bar{h}_t$. Note that $d$ will typically not be the same as $p$.The recurrence condition for the first layer is already shown above by setting $W^{(1)} = W$.Therefore, let us focus on all the hidden layers $k$ for $k ≥ 2$. It turns out that the recurrence condition for the layers with $k ≥ 2$ is also in a very similar form as the equation shown above:
$\bar{h}_t^{(k)}=tanh \left( W^{(k)}  \begin{bmatrix}  \bar{h_t}^{(k-1)} \\ \bar{h}_{t-1}^{(k)} \end{bmatrix} \right)$

In this case, the size of the matrix $W^{(k)}$ is $p × (p + p) = p × 2p$. The transformation from hidden to output layer remains the same as in single-layer networks. It is easy to see that this approach is a straightforward multilayer generalization of the case of single-layer networks. It is common to use two or three layers in practical applications. In order to use a larger number of layers, it is important to have access to more training data in order to avoid overfitting.

Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Introduction to neural networks