Gated Recurrent Units-GRUs

The Gated Recurrent Unit (GRU) can be viewed as a simplification of the LSTM, which does not use explicit cell states. Another difference is that the LSTM directly controls the amount of information changed in the hidden state using separate forget and output gates.On the other hand, a GRU uses a single reset gate to achieve the same goal. However, the basic idea in the GRU is quite similar to that of an LSTM, in terms of how it partially resets the hidden states.  It was introduced by Kyunghyun Cho et al in the year 2014.


GRU does not have a separate cell state($C_t)$.It only has a hidden state($H_t$). Due to the simpler architecture, GRUs are faster to train.

They are almost similar to LSTMs except that they have two gates: reset gate and update gate. Reset gate determines how to combine new input to previous memory and update gate determines how much of the previous state to keep. Update gate in GRU is what input gate and forget gate were in LSTM. We don't have the second non linearity in GRU before calculating the outpu, .neither they have the output gate.
Update Gate(z): It determines how much of the past knowledge needs to be passed along into the future. It is analogous to the Output Gate in an LSTM recurrent unit.
Reset Gate(r): It determines how much of the past knowledge to forget. It is analogous to the combination of the Input Gate and the Forget Gate in an LSTM recurrent unit.
Current Memory Gate(): It is often overlooked during a typical discussion on Gated Recurrent Unit Network. It is incorporated into the Reset Gate just like the Input Modulation Gate is a sub-part of the Input Gate and is used to introduce some non-linearity into the input and to also make the input Zero-mean. Another reason to make it a sub-part of the Reset gate is to reduce the effect that previous information has on the current information that is being passed into the future.

Mathematically, for a given time step $t$, suppose that the input is a mini batch $X_t∈R^{n×d}$(number of examples: $n$, number of inputs: $d$) and the hidden state of the previous time step is $H_{t−1}∈R^{n×h}$ (number of hidden units: $h$). Then, the reset gate $R_t∈R^{n×h}$ and update gate $Z_t∈R^{n×h}$ are computed as follows:

$R_t=\sigma(X_tW_{xr}+H_{t-1}W_{hr}$
$Z_t=\sigma(X_tW_{xz}+H_{t-1}W_{hz}$

where $W_{xr},W_{xz}∈R^{d×h}$ and $W_{hr},W_{hz}∈R^{h×h}$ are weight parameters.

candidate hidden state $\tilde{H}_t∈R^{n×h}$ at time step $t$:
$\tilde{H}_t=tanh⁡(X_t W_{xh}+(R_t⊙H_{t−1})W_{hh})$

Finally, we need to incorporate the effect of the update gate $Z_t$. This determines the extent to which the new hidden state $H_t∈R^{n×h}$ matches the old state $H_{t−1}$ versus how much it resembles the new candidate state $\tilde{H}_t$. The update gate $Z_t$ can be used for this purpose, simply by taking elementwise convex combinations of $H_{t−1}$ and $\tilde{H}_t$. This leads to the final update equation for the GRU:
$H_t=Z_t⊙H_{t−1}+(1−Z_t)⊙\tilde{H}_t$.

Whenever the update gate $Z_t$ is close to 1, we simply retain the old state. In this case the information from $X_t$ is ignored, effectively skipping time step $t$ in the dependency chain. In contrast, whenever $Z_t$ is close to 0, the new latent state $H_t$ approaches the candidate latent state $\tilde{H}_t$


Let $\bar{h}_t^{(k)}$ represents the hidden states of the $k$th layer for $k ≥ 1$. For notational convenience, we also assume that the input layer $\bar{x}_t$ can be denoted by $\bar{h}_t^{(0)}$ (although this layer is obviously not hidden). As in the case of LSTM, we assume that the input vector $\bar{x}_t$ is $d$-dimensional, whereas the hidden states are $p$-dimensional. The sizes of the transformation matrices in the first layer are accordingly adjusted to account for this fact.In the case of the GRU, we use two matrices $W^{(k)}$ and $V^{(k)}$ of sizes $2p × 2p$ and $p × 2p$, respectively. Pre-multiplying a vector of size $2p$ with $W^{(k)}$ results in a vector of size $2p$, which will be passed through the sigmoid activation to create two intermediate,$p$-dimensional vector variables $\bar{z}_t$ and $\bar{r}_t$, respectively. The intermediate variables $\bar{z}_t$ and $\bar{r}_t$ are respectively referred to as update and reset gates. The determination of the hidden state vector $\bar{h}_t^{(k)}$ uses a two-step process of first computing these gates, then using them to decide how much to change the hidden vector with the weight matrix $V^{(k)}$:


Here, the element-wise product of vectors is denoted by $“\odot”$ and the notation “sigm” denotes a sigmoid operation. For the very first layer (i.e., $k = 1$), the notation $\bar{h}_t^{(k−1)}$ in the above equation should be replaced with $\bar{x}_t$.Furthermore, the matrices $W^{(1)}$ and $V^{(1)}$ are of sizes $2p × (p + d)$ and $p × (p + d)$, respectively. We have also omitted the mention of biases here, but they are usually included in practical implementations. In the following, we provide a further explanation of these updates and contrast them with those of the LSTM.

Just as the LSTM uses input, output, and forget gates to decide how much of the information from the previous time-stamp to carry over to the next step, the GRU uses the update and the reset gates. The GRU does not have a separate internal memory and also requires fewer gates to perform the update from one hidden state to another. Therefore, a natural question arises about the precise role of the update and reset gates. The reset gate $\bar{r}$ decides how much of the hidden state to carry over from the previous time-stamp for a matrix-based update (like a recurrent neural network). The update gate $\bar{z}$ decides the relative strength of the contributions of this matrix-based update and a more direct contribution from the hidden vector $\bar{h}_{t-1}^{(k)}$  at the previous time-stamp. By allowing a direct (partial) copy of the hidden states from the previous layer, the gradient flow becomes more stable during backpropagation. The update gate of the GRU simultaneously performs the role of the input and forget gates in the LSTM in the form of $\bar{z}$ and $1 − \bar{z}$, respectively.However, the mapping between the GRU and the LSTM is not precise, because it performs these updates directly on the hidden state (and there is no cell state). Like the input, output, and forget gates in the LSTM, the update and reset gates are intermediate “scratch-pad” variables.

In order to understand why GRUs provide better performance than vanilla RNNs, let us examine a GRU with a single layer and single state dimensionality $p = 1$. In such a case, the update equation of the GRU can be written as follows:

$h_t=z. h _{t-1}+(1-z).tanh[ v_1.x_t+v_2.r.h_{t-1}]$
Note that layer superscripts are missing in this single-layer case. Here, $v_1$and $v_2$ are the two elements of the $2 ×1$ matrix $V$ . Then, it is easy to see the following:

$\frac{\partial h_t}{\partial h_{t-1}}=z+(Additive Terms)$

Backward gradient flow is multiplied with this factor. Here, the term $z ∈ (0, 1)$ helps in passing unimpeded gradient flow and makes computations more stable. Furthermore, since the additive terms heavily depend on $(1−z)$, the overall multiplicative factor that tends to be closer to 1 even when $z$ is small. Another point is that the value of $z$ and the multiplicative factor $\frac{∂h_t}{∂h_{t−1}}$ is different for each time stamp, which tends to reduce the propensity for vanishing or exploding gradients.

Although the GRU is a closely related simplification of the LSTM, it should not be seen as a special case of the LSTM.The two models are shown to be roughly similar in performance, and the relative performance seems to depend on the task at hand. The GRU is simpler and enjoys the advantage of greater ease of implementation and efficiency. It might generalize slightly better with less data because of a smaller parameter footprint although the LSTM would be preferable with an increased amount of data.The LSTM has been more extensively tested than the GRU, simply because it is an older architecture and enjoys widespread popularity. As a result, it is generally seen as a safer option, particularly when working with longer sequences and larger data sets.It is also showed that none of the variants of the LSTM can reliably outperform it in a consistent way. This is because of the explicit internal memory and the greater gate-centric control in updating the LSTM.

Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Introduction to neural networks