Gradient-Descent Strategies




The most common method for parameter learning in neural networks is the steepest-descent method, in which the gradient of the loss function is used to make parameter updates.  As discussed  earlier , the steepest-gradient method can sometimes behave unexpectedly because it does not always point in the best direction of improvement, when steps of finite size are considered. The steepest-descent direction is the optimal direction only from the perspective of infinitesimal steps. A steepest-descent direction can sometimes become an ascent direction after a small update in parameters. As a result, many course corrections are needed. A specific example of this phenomenon  in which minor differences in sensitivity to different features can cause a steepest-descent algorithm to have oscillations. The problem of oscillation and zigzagging is quite ubiquitous whenever the steepest-descent direction moves along a direction of high curvature in the loss function. The most extreme manifestation of this problem occurs in the case of extreme ill-conditioning,for which the partial derivatives of the loss are wildly different with respect to the different optimization variables. In this section, we will discuss several clever learning strategies that work well in these ill-conditioned settings.

Learning Rate Decay
A constant learning rate is not desirable because it poses a dilemma to the analyst. The dilemma is as follows. A lower learning rate used early on will cause the algorithm to take too long to come even close to an optimal solution. On the other hand, a large initial learning rate will allow the algorithm to come reasonably close to a good solution at first; however, the algorithm will then oscillate around the point for a very long time, or diverge in an unstable way, if the high rate of learning is maintained. In either case, maintaining a constant learning rate is not ideal. Allowing the learning rate to decay over time can naturally achieve the desired learning-rate adjustment to avoid these challenges. The two most common decay functions are exponential decay and inverse decay. The learning rate $α_t$ can be expressed in terms of the initial decay rate $α_0$ and epoch $t$ as follows:

$α_t = α_0 exp(−k · t)$ [Exponential Decay]
$α_t =\frac{α_0}{(1 + k · t)}$ [Inverse Decay]

The parameter $k$ controls the rate of the decay. Another approach is to use step decay in which the learning rate is reduced by a particular factor every few epochs. For example, the learning rate might be multiplied by 0.5 every 5 epochs. A common approach is to track the loss on a held-out portion of the training data set, and reduce the learning rate whenever this loss stops improving. In some cases, the analyst might even babysit the learning process, and use an implementation in which the learning rate can be changed manually depending on the progress. This type of approach can be used with simple implementations of gradient descent, although it does not address many of the other problematic issues.

Momentum-Based Learning

In the case of Mini-batch Gradient Descent when we update the model parameters after iterating through all the data points in the given batch, thus the direction of the update will have some variance which leads to oscillations. Due to this oscillation, it is hard to reach convergence, and it slows down the process of attaining it. To combat this we use Momentum.

Momentum helps us in not taking the direction that does not lead us to convergence.

In other words, we take a fraction of the parameter update from the previous gradient step and add it to the current gradient step.

Momentum-based techniques recognize that zigzagging is a result of highly contradictory steps that cancel out one another and reduce the effective size of the steps in the correct (long-term) direction. An example of this scenario is illustrated in Figure(b) below. Simply attempting to increase the size of the step in order to obtain greater movement in the correct direction might actually move the current solution even further away from the optimum solution. In this point of view, it makes a lot more sense to move in an “averaged” direction of the last few steps, so that the zigzagging is smoothed out.In order to understand this point, consider a setting in which one is performing gradientdescent with respect to the parameter vector $W$. The normal updates for gradient-descent with respect to loss function $L$ (defined over a mini-batch of instances) are as follows:

$\overline{V}=-\alpha \frac{\partial L }{\partial \overline{W}}$
$\overline{W}=\overline{W}+\overline{V}$

Here, $\alpha$ is the learning rate. In momentum-based descent, the vector $V$ is modified with exponential smoothing, where $\beta \in  (0, 1)$ is a smoothing parameter:

$\overline{V}=\beta \overline{V}-\alpha \frac{\partial L }{\partial \overline{W}}$
$\overline{W}=\overline{W}+\overline{V}$

Larger values of $\beta$ help the approach pick up a consistent velocity $V$ in the correct direction.Setting $\beta = 0$ specializes to straightforward mini-batch gradient-descent. The parameter $\beta$ is also referred to as the momentum parameter or the friction parameter. The word “friction” is derived from the fact that small values of $\beta$ act as “brakes,” much like friction.

With momentum-based descent, the learning is accelerated, because one is generally moving in a direction that often points closer to the optimal solution and the useless “sideways” oscillations are muted. The basic idea is to give greater preference to consistent directions over multiple steps, which have greater importance in the descent. This allows the use of larger steps in the correct direction without causing overflows or “explosions” in the sideways direction. As a result, learning is accelerated. An example of the use of momentum is illustrated in Figure below. It is evident from Figure (a) that momentum increases the relative component of the gradient in the correct direction. The corresponding effects on the updates are illustrated in Figure (b) and (c). It is evident that momentum-based updates can reach the optimal solution in fewer updates.

The use of momentum will often cause the solution to slightly overshoot in the direction where velocity is picked up, just as a marble will overshoot when it is allowed to roll down a bowl. However, with the appropriate choice of $\beta$, it will still perform better than a situation in which momentum is not used. The momentum-based method will generally perform better because the marble gains speed as it rolls down the bowl; the quicker arrival at the optimal solution more than compensates for the overshooting of the target. Overshooting is desirable to the extent that it helps avoid local optima. Figure below, which shows a marble rolling down a complex loss surface (picking up speed as it rolls down), illustrates this concept. The marble’s gathering of speed helps it efficiently navigate flat regions of the loss surface. The parameter $\beta$ controls the amount of friction that the marble encounters while rolling down the loss surface. While increased values of $\beta$ help in avoiding local optima, it might also increase oscillation at the end. In this sense, the momentum-based method has a neat interpretation in terms of the physics of a marble rolling down a complex loss surface


Nesterov Momentum
The Nesterov momentum  is a modification of the traditional momentum method in which the gradients are computed at a point that would be reached after executing a $\beta$-discounted version of the previous step again (i.e., the momentum portion of the current step). This point is obtained by multiplying the previous update vector $\overline{V}$ with the friction parameter $\beta$ and then computing the gradient at $\overline{W }+ β\overline{V}$ . The idea is that this corrected gradient uses a better understanding of how the gradients will change because of the momentum portion of the update, and incorporates this information into the gradient portion of the update. Therefore, one is using a certain amount of lookahead in computing the updates. Let us denote the loss function by $L(\overline{W})$ at the current solution $\overline{W}$. In this case, it is important to explicitly denote the argument of the loss function because of the way in which the gradient is computed. Therefore, the update may be computed as follows:

$\overline{V}=\beta\overline{V}-\alpha \frac{\partial L (\overline{W}+\beta \overline{V})}{\partial \overline{W}}$
$\overline{W}=\overline{W}+\overline{V}$

Note that the only difference from the standard momentum method is in terms of where the gradient is computed. Using the value of the gradient a little further along the previous update can lead to faster convergence. In the previous analogy of the rolling marble, such an approach will start applying the “brakes” on the gradient-descent procedure when the marble starts reaching near the bottom of the bowl, because the lookahead will “warn” it about the reversal in gradient direction.The Nesterov method works only in mini-batch gradient descent with modest batch sizes; using very small batches is a bad idea. In such cases, it can be shown that the Nesterov method reduces the error to $O(1/t^2)$ after $t$ steps, as compared to an error of $O(1/t)$ in the momentum method.



Parameter-Specific Learning Rates
The basic idea in the momentum methods of the previous section is to leverage the consistency in the gradient direction of certain parameters in order to speed up the updates. This goal can also be achieved more explicitly by having different learning rates for different parameters.The idea is that parameters with large partial derivatives are often oscillating and zigzagging, whereas parameters with small partial derivatives tend to be more consistent but move in the same direction. An early method, which was proposed in this direction was the delta-bar-delta method . This approach tracks whether the sign of each partial derivative changes or stays the same. If the sign of a partial derivative stays consistent, then it is indicative of the fact that the direction is correct. In such a case, the partial derivative in that direction increases. On the other hand, if the sign of the partial derivative flips all the time, then the partial derivative decreases. However, this kind of approach is designed for gradient descent rather than stochastic gradient descent, because the errors in stochastic gradient descent can get magnified. Therefore, a number of methods have been proposed that can work well even when the mini-batch method is used.

AdaGrad(short for adaptive gradient)

In the case of deep learning, we have many model parameters (Weights) and many layers to train. Our goal is to find the optimal values for all these weights.In all of the previous methods, we observed that the learning rate was a constant value for all the parameters of the network.However, Adagrad adaptively sets the learning rate according to a parameter hence the name adaptive gradient.

In the AdaGrad algorithm , one keeps track of the aggregated squared magnitude of the partial derivative with respect to each parameter over the course of the algorithm. The square-root of this value is proportional to the root-mean-square slope for that parameter (although the absolute value will increase with the number of epochs because of successive aggregation).Let $A_i$ be the aggregate value for the $i^{th}$ parameter. Therefore, in each iteration, the following update is performed:
$A_i=A_i+\left(\frac{\partial L }{\partial w_i}\right)^2 \quad \quad \forall_i$

The update for the $ith$parameter $w_i$ is as follows:

$w_i=w_i - \frac{\alpha}{\sqrt{A_i}} \left(\frac{\partial L }{\partial \overline{w_i}}\right); \forall_i$

If desired, one can use $\sqrt{A_i + \epsilon}$ in the denominator instead of  $\sqrt{A_i}$ to avoid ill-conditioning. Here,$\epsilon$ is a small positive value such as $10^{−8}$.

In the given equation the denominator represents the sum of the squares of the previous gradient step for the given parameter. If we can notice this denominator actually scales of learning rate.

– That is, when the sum of the squared past gradients has a high value, we are basically dividing the learning rate by a high value, so our learning rate will become less.

– Similarly, if the sum of the squared past gradients has a low value, we are dividing the learning rate by a lower value, so our learning rate value will become high.

Scaling the derivative inversely with $\sqrt{A_i}$  is a kind of “signal-to-noise” normalization because $A_i$ only measures the historical magnitude of the gradient rather than its sign; it encourages faster relative movements along gently sloping directions with consistent sign of the gradient. If the gradient component along the $i^{th}$ direction keeps wildly fluctuating between +100 and −100, this type of magnitude-centric normalization will penalize that component far more than another gradient component that consistently takes on the value in the vicinity of 0.1 (but with a consistent sign). For example, the movements along the oscillating direction will be de-emphasized, and the movement along the consistent direction will be emphasized. However, absolute movements along all components will tend to slow down over time, which is the main problem with the approach. The slowing down is caused by the fact that $A_i$ is the aggregate value of the entire history of partial derivatives.This will lead to diminishing values of the scaled derivative. As a result, the progress of AdaGrad might prematurely become too slow, and it will eventually (almost) stop making progress. Another problem is that the aggregate scaling factors depend on ancient history, which can eventually become stale. The use of stale scaling factors can increase inaccuracy. As we will see later, most of the other methods use exponential averaging, which solves both problems.

RMSProp
The RMSProp algorithm  uses a similar motivation as AdaGrad for performing the “signal-to-noise” normalization with the absolute magnitude $\sqrt{A_i}$ of the gradients. However  instead of simply adding the squared gradients to estimate $A_i$, it uses exponential averaging.Since one uses averaging to normalize rather than aggregate values, the progress is not slowed prematurely by a constantly increasing scaling factor $A_i$. The basic idea is to use a decay factor $ρ ∈ (0, 1)$, and weight the squared partial derivatives occurring $t$ updates ago by $ρ^t$. Note that this can be easily achieved by multiplying the current squared aggregate (i.e., running estimate) by $ρ$ and then adding$ (1 − ρ)$ times the current (squared) partial derivative. The running estimate is initialized to 0. This causes some (undesirable) bias in early iterations, which disappears over the longer term. Therefore, if $A_i$ is the exponentially averaged value of the$i^{th}$ parameter $w_i$, we have the following way of updating $A_i$:

$A_i= \rho A_i +(1-\rho) \left(\frac{\partial L }{\partial w_i}\right)^2 \quad \quad \forall_i$

The square-root of this value for each parameter is used to normalize its gradient. Then, the following update is used for (global) learning rate $α$:

$w_i=w_i - \frac{\alpha}{\sqrt{A_i}}\left (\frac{\partial L }{\partial \overline{w_i}}\right); \forall_i$

If desired, one can use $\sqrt{A_i + \epsilon}$ in the denominator instead of  $\sqrt{A_i}$ to avoid ill-conditioning. Here,$\epsilon$ is a small positive value such as $10^{−8}$.

The central idea of RMSprop is keep the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square. Which is why it’s called RMSprop(root mean square).

Another advantage of RMSProp over AdaGrad is that the importance of ancient (i.e., stale) gradients decays exponentially with time. Furthermore, it can benefit by incorporating concepts of momentum within the computational algorithm . The drawback of RMSProp is that the running estimate $A_i $of the second-order moment is biased in early iterations because it is initialized to 0.


Adam(Adaptive moment estimation)

Adam is the most widely used optimizer in deep learning. Adam takes the advantage of both RMSprop (to avoid a small learning rate) and Momentum (for fewer oscillations).

In Adam, we compute the running average of the squared gradients.However, along with computing the running average of the squared gradients, we also compute the running average of the gradients.

The Adam algorithm uses a similar “signal-to-noise” normalization as AdaGrad and RMSProp; however, it also exponentially smooths the first-order gradient in order to incorporate momentum into the update. It also directly addresses the bias inherent in exponential smoothing when the running estimate of a smoothed value is unrealistically initialized to 0. As in the case of RMSProp, let $A_i$ be the exponentially averaged value of the $i_{th}$ parameter $w_i$. This value is updated in the same way as RMSProp with the decay parameter $ρ ∈ (0, 1):$

$A_i= \rho A_i +(1-\rho) \left(\frac{\partial L }{\partial w_i}\right)^2 \quad \quad \forall_i$

At the same time, an exponentially smoothed value of the gradient is maintained for which the $i^{th}$ component is denoted by $F_i$. This smoothing is performed with a different decay parameter $ρ_f$ :

$F_i=ρ_f F_i+(1-ρ_f)\left(\frac{\partial L }{\partial w_i}\right) ; \forall_i$

This type of exponentially smoothing of the gradient with $ρ_f$ is a variation of the momentum method discussed early (which is parameterized by a friction parameter $β$ instead of $ρ_f$ ). Then, the following update is used at learning rate $α_t$ in the $t$th iteration:

$w_i=w_i - \frac{\alpha_t}{\sqrt{A_i}}F_i; \forall_i$

There are two key differences from the RMSProp algorithm. First, the gradient is replaced with its exponentially smoothed value in order to incorporate momentum. Second, the learning rate $α_t$ now depends on the iteration index $t$, and is defined as follows:

$α_t=\alpha \left(  \frac{\sqrt{1-\rho_t}}{1-\rho^t_f}\right)$

Technically, the adjustment to the learning rate is actually a bias correction factor that is applied to account for the unrealistic initialization of the two exponential smoothing mechanisms, and it is particularly important in early iterations. Both $F_i$ and $A_i$ are initialized to 0, which causes bias in early iterations. The two quantities are affected differently by the bias, which accounts for the ratio in above Equation . It is noteworthy that each of $ρ_t$ and $ρ_t^ f$ converge to 0 for large $t$ because $ρ, ρ_f ∈ (0, 1)$. As a result, the initialization bias correction factor of Equation  converges to 1, and $α_t$ converges to $α$. The default suggested values of $ρ_f$ and $ρ$ are 0.9 and 0.999, respectively, according to the original Adam paper .Like other methods, Adam uses $\sqrt{A_i + \epsilon}$ (instead of $\sqrt{A_i}$ in the denominator of the update for better conditioning. The Adam algorithm is extremely popular because it incorporates most of the advantages of other algorithms, and often performs competitively with respect to the best of the other methods.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”



Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Introduction to neural networks