Long Short-Term Memory- LSTM

Recurrent neural networks have problems associated with vanishing and exploding gradients . This is a common problem in neural network updates where successive multiplication by the matrix $W^{(k)}$ is inherently unstable; it either results in the gradient disappearing during backpropagation, or in it blowing up to large values in an unstable way. This type of instability is the direct result of successive multiplication with the (recurrent) weight matrix at various time-stamps. One way of viewing this problem is that a neural network that uses only multiplicative updates is good only at learning over short sequences, and is therefore inherently endowed with good short-term memory but poor long-term memory. To address this problem, a solution is to change the recurrence equation for the hidden vector with the use of the LSTM with the use of long-term memory. The operations of the LSTM are designed to have fine-grained control over the data written into this long-term memory.

LSTM ( Long Short Term Memory ) Networks are called fancy recurrent neural networks with some additional features.

Just like RNN, we have time steps in LSTM but we have extra piece of information which is called “MEMORY” in LSTM cell for every time step.

So the LSTM cell contains the following components
Forget Gate “f” ( a neural network with sigmoid)
Candidate layer “C`"(a NN with Tanh)
Input Gate “I” ( a NN with sigmoid )
Output Gate “O”( a NN with sigmoid)
Hidden state “H” ( a vector )
Memory state “C” ( a vector)

Here is the diagram for LSTM cell at the time step $t$
Inputs to the LSTM cell at any step are X (current input) , H ( previous hidden state ) and C ( previous memory state)
Outputs from the LSTM cell are H ( current hidden state ) and C ( current memory state)

Here is the diagram for a LSTM cell at $T$ time step.


Forget gate(f) , Cndate(C`), Input gate(I), Output Gate(O) are single layered neural networks with the Sigmoid activation function except candidate layer( it takes Tanh as activation function)

These gates first take input vector.dot(U) and previous hidden state.dot(W) then concatenate them and apply activation function finally these gate produce vectors ( between 0 and 1 for Sigmoid, -1 to 1 for Tanh) so we get four vectors f, C`, I, O for every time step.

There is an important piece called Memory state C

This is the state where the memory (context) of input is stored

Ex : Mady walks in to the room, Monica also walks in to the room. Mady Said “hi” to ____??

Inorder to predict correctly Here it stores “Monica” into memory C.

This state can be modified. I mean LSTM cell can add /remove the information.

Ex : Mady and Monica walk in to the room together , later Richard walks in to the room. Mady Said “hi” to ____??

The assumption I am making is memory might change from Monica to Richard.

so LSTM cell takes the previous memory state $C_{t-1}$ and does element wise multiplication with forget gate (f)

$C_t = C_{t-1}*f_t$

if forget gate value is 0 then previous memory state is completely forgotten if forget gate value is 1 then previous memory state is completely passed to the cell ( Remember f gate gives values between 0 and 1 )

Now with current memory state $C_t$ we calculate new memory state from input state and C layer.

$C_t= C_t + (I_t*C`_t)$
$C_t$ = Current memory state at time step $t$. and it gets passed to next time step.

Finally, we need to calculate what we’re going to output. This output will be based on our cell state $C_t$ but will be a filtered version. so we apply Tanh to $C_t$ then we do element wise multiplication with the output gate $O$, That will be our current hidden state $H_t$

$H_t = Tanh(C_t)$

We pass these two $C_t$ and $H_t$ to the next time step and repeat the same process.

As in the previous sections, the notation $\bar{h}_t^{(k)}$  represents the hidden states of the $k$th layer of a multi-layer LSTM. For notational convenience, we also assume that the input layer $\bar{x}_t$ can be denoted by $\bar{h}_t^{(0)}$  (although this layer is obviously not hidden). As in the case of the recurrent network, the input vector $\bar{x}_t$ is $d$-dimensional, whereas the hidden states are $p$-dimensional. The LSTM is an enhancement of the recurrent neural network architecture  in which we change the recurrence conditions of how the hidden states $\bar{h}_t^{(k)}$ are propagated. In order to achieve this goal, we have an additional hidden vector of  $p$ dimensions, which is denoted by $\bar{c}^{(k)}$  and referred to as the cell state. One can view the cell state as a kind of long-term memory that retains at least a part of the information in earlier states by using a combination of partial “forgetting” and “increment” operations on the previous cell states. It has been shown in  that the nature of the memory in $\bar{c}^{(k)}$  is occasionally interpretable when it is applied to text data such as literary pieces. For example, one of the $p$ values in $\bar{c}^{(k)}$ might change in sign after an opening quotation and then revert back only when that quotation is closed. The upshot of this phenomenon is that the resulting neural network is able to model long-range dependencies in the language or even a specific pattern (like a quotation) extended over a large number of tokens. This is achieved by using a gentle approach to update these cell states over time, so that there is greater persistence in information storage. Persistence in state values avoids the kind of instability that occurs in the case of the vanishing and exploding gradient problems. One way of understanding this intuitively is that if the states in different temporal layers share a greater level of similarity (through long-term memory), it is harder for the gradients with respect to the incoming weights to be drastically different.


$^2$ In the first layer, the matrix $W^{(1)}$ is of size $4p × (p + d)$ because it is multiplied with a vector of size $(p + d)$.

Here, the element-wise product of vectors is denoted by $“\odot”$ and the notation “sigm” denotes a sigmoid operation. For the very first layer (i.e., $k = 1$), the notation $\bar{h}_t^{(k−1)}$ in the above equation should be replaced with $\bar{x}_t$ and the matrix { $W^{(1)}$ is of size $4p × (p + d)$. In practical implementations, biases are also used in the above updates, although they are omitted here for simplicity.The aforementioned update seems rather cryptic, and therefore it requires further explanation.

The first step in the above sequence of equations is to set up the intermediate variable vectors $\bar{i}, \bar{f}, \bar{o}$, and$\bar{c}$, of which the first three should conceptually be considered binary values, although they are continuous values in (0, 1). Multiplying a pair of binary values is like using an AND gate on a pair of boolean values. We will henceforth refer to this operation as gating. The vectors $\bar{i}, \bar{f},$ and $\bar{o}$ are referred to as input, forget, and output gates. In particular, these vectors are conceptually used as boolean gates for deciding (i)whether to add to a cell-state, (ii) whether to forget a cell state, and (iii) whether to allow leakage into a hidden state from a cell state. The use of the binary abstraction for the input,forget, and output variables helps in understanding the types of decisions being made by the updates. In practice, a continuous value in (0, 1) is contained in these variables, which can enforce the effect of the binary gate in a probabilistic way if the output is seen as a probability. In the neural network setting, it is essential to work with continuous functions in order to ensure the differentiability required for gradient updates. The vector $\bar{c}$ contains the newly proposed contents of the cell state, although the input and forget gates regulate how much it is allowed to change the previous cell state (to retain long-term memory).

The four intermediate variables $\bar{i}, \bar{f}, \bar{o}$, and$\bar{ c}$, are set up using the weight matrices $W^{(k)}$ for the $k$th layer in the first equation above. Let us now examine the second equation that updates the cell state with the use of some of these intermediate variables

This equation has two parts. The first part uses the $p$ forget bits in $\bar{f}$ to decide which of the $p$ cell states from the previous time-stamp to reset to 0, and it uses the $p$ input bits in $\bar{i}$ to decide whether to add the corresponding components from $c$ to each of the cell states.Note that such updates of the cell states are in additive form, which is helpful in avoiding the vanishing gradient problem caused by multiplicative updates. One can view the cellstate vector as a continuously updated long-term memory, where the forget and input bits respectively decide (i) whether to reset the cell states from the previous time-stamp and forget the past, and (ii) whether to increment the cell states from the previous time-stamp to incorporate new information into long-term memory from the current word. The vector $\bar{c}$ contains the $p$ amounts with which to increment the cell states, and these are values in [−1, +1] because they are all outputs of the tanh function.

Finally, the hidden states $\bar{h}_t^{(k)}$  are updated using leakages from the cell state. The hidden state is updated as follows


Here, we are copying a functional form of each of the $p$ cell states into each of the $p$ hidden states, depending on whether the output gate (defined by $\bar{o}$) is 0 or 1. Of course, in the continuous setting of neural networks, partial gating occurs and only a fraction of the signal is copied from each cell state to the corresponding hidden state. It is noteworthy that the final equation does not always use the tanh activation function. The following alternative update may be used:

$\bar{h}_t^{(k)}=\bar{o} \odot \bar{c}_t^{(k)}$

As in the case of all neural networks, the backpropagation algorithm is used for training purposes.

In order to understand why LSTMs provide better gradient flows than vanilla RNNs, let us examine the update for a simple LSTM with a single layer and $p = 1$. In such a case, the cell update can be simplified to the following:

$c_t = c_{t−1} ∗ f + i ∗ c$

Therefore, the partial derivative $c_t$ with respect to $c_{t−1}$ is $f$, which means that the backward gradient flows for $c_t$ are multiplied with the value of the forget gate $f$. Because of elementwise operations, this result generalizes to arbitrary values of the state dimensionality $p$. The biases of the forget gates are often set to high values initially, so that the gradient flows decay relatively slowly. The forget gate $f $ can also be different at different time-stamps, which reduces the propensity of the vanishing gradient problem. The hidden states can be expressed in terms of the cell states as $h_t = o∗tanh(c_t)$, so that one can compute the partial derivative with respect to $h_t$ with the use of a single tanh derivative. In other words, the long-term cell states function as gradient super-highways, which leak into hidden states.





Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Introduction to neural networks