Building neural networks, we compute gradients with the backpropagation algorithm. These gradients are used to perform the parameter updates. The default way of doing this is to update gradient along the negative gradient direction i.e. using the gradient descent optimizer.
There are certain problems with SGD. If the loss function has a local minima or a saddle point, the gradient becomes zero (gradient descent get stuck).
But, there are many more methods to improve on this. Few of the optimizers are:
Here’s a popular story about momentum: gradient descent is a man walking down a hill. He follows the steepest path downwards; his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill. The added inertia acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima.
— Distill.pub (magazine)
Here, the gradient instead of directly influencing the position, it influences the velocity, which in turn has an effect on the position. The momentum update builds up the velocity in directions having a gentle but consistent gradient.
Here, the effect of gradient is to increment the previous velocity. velocity
is initialized to zero and value of mu
is the rate by which velocity
decays, typically taken slightly less than 1. The momentum method has better convergence rate than the basic update.
The standard momentum method first computes the gradient at the current position then takes a big jump in the direction of the updated accumulated gradient. While, in Nesterov momentum, first make a big jump in the direction of the previous accumulated gradient then measure the gradient where you end up and make a correction. It has better convergence than the standard momentum.
Sometimes, the chosen reasonable learning rate can only decrease the error upto a certain value as shown in the figure. In this scenario, it’s useful to decrease the learning rate. It’s accomplished by using learning rate decay e.g. the learning rate can be decreased by half every 7 epochs.
All the above methods manipulate the learning rate globally and equally for all the parameters. Adagrad is an adaptive learning rate method that adaptively (increase or decrease as required) tune the learning rates per parameter.
cache
keep track of per-parameter sum of squared gradients. The learning rate will reduce for the weights having high gradients and increase for the weights having small updates.
RMSProp keeps a moving average of the root mean squared (rms) gradients, by which we divide the current gradient. This makes the learning process much better.
Adam is almost similar to RMSProp. Instead of adapting the parameter learning rate based only on the first moment (mean) as in RMSProp, Adam also uses the second moment of the gradients (variance). That is, it calculates the exponential average of the gradient (i.e. mean) and the squared gradient (i.e. variance).
The Adam upate also includes the bias correction to compensate the bias at zero of m
and v
during initialization.
Adam is recommended the default choice of optimization algorithm. However, it’s also worth trying RMSProp and SGD+Nesterov momentum.
References: