Optimizers in Deep Learning — In short! (without Maths)

Sanket Kangle
7 min readJun 1, 2021

--

Photo by Dan Freeman on Unsplash

In general, deep learning follows the steps as shown in the following image. First, we feed forward the dataset using multiple hidden layers and activation functions to get some output value, we compare this output to actual output and calculate the loss function. The cost is calculated using loss. According to the cost calculated, the optimizer tunes the value of all the weights and biases to reduce loss function via backpropagation. Once the loss/cost is reduced to an acceptable value, our model is trained.

Want to learn about activation functions and loss and cost functions before optimizers? you can refer below articles.

Gradient descent

Gradient descent is the most common method used by optimizers. Our main aim is to decrease loss. the loss depends on output, output depends on the input of the output neuron, which depends on the output of the previous hidden layer, and so on. Optimizer calculates how the cost function changes with respect to the parameters(weights & biases) and changes the value of the parameter in the direction such that the cost is reduced. Gradient descent is moving down the hill from any given point. Let’s see an example, assume we are currently at point A(in the following figure), we can move in two directions, we can go to point B or point C. By calculating the gradient, we ensure that we move towards point C and not point B.(Mathematically we can see this clearly in the equation but as we are not going into that, just trust me on this one).

So, we take a step in the direction of point C. But how bigger or smaller this step depends on one parameter called the learning parameter. If we take a large learning parameter, then we overshoot and miss the minima and if we take too small a learning rate then we take so small steps that it takes a lot of iterations and time to reach minima. This is depicted in the following image.

There are three main approaches used when applying gradient descent. We can use a complete training dataset in one epoch, a subset, or a single data per epoch. Let’s discuss each one briefly.

1. Batch Gradient Descent (BGD)

Steps to follow

  • Pass all the records one by one.
  • Calculate the loss function for each record.
  • Calculate the cost function which is basically just the summation of the loss of each record.
  • Use this cost function in the optimizer to backpropagate and update the weights and biases.

Advantages

  • As we consider a complete training dataset for gradient descent, the step that we take to reduce cost is optimal.

Disadvantages

  • Memory consumption is high as extra memory is needed to store loss obtained from each record.
  • Calculating cost function with a complete dataset for each epoch will take more time, hence the optimization using batch gradient descent is slower.

2. Mini-batch Gradient Descent (MGD or MBGD)

Steps to follow

  • Select a batch of size between 50 to 256(this is not a hard and fast rule, you can take other sized batches as well) records randomly from the training dataset.
  • calculate the loss for each record and bu summation fo it, the cost for the mini-batch.
  • Send this cost to the optimizer to update weights and biases in backward propagation.

Advantages

  • As we do not send the entire dataset for each epoch, memory-wise its efficient.
  • Due to mini-batch, less computation is required as compared to BGD

Disadvantages

  • No guarantee that it will give better convergence like BGD.
  • In case the learning rate is too small, it will also take a lot of time to converge.
  • In case the learning rate is high, it will fluctuate a lot, convergence will not be smooth

If the batch size is equal to the total number of records, BGD = MGD

3. Stochastic Gradient Descent (SGD)

Steps to follow

  • Randomly select one record.
  • Find the loss associated with that record.
  • Pass this loss to the optimizer in backpropagation to update the weights and biases.

Advantages

  • It is faster as one epoch consists of only one record
  • Memory requirement is also very less

Disadvantages

  • Oscillates very frequently as we are trying to find local minima for each record
  • Ends up getting multiple minima

Momentum Based Optimizers

Until now, we were considering a gradient of only one iteration to update weights and baises in a particular backpropagation cycle. In contrast to this, Adaptive optimization algorithms take into consideration statistics from all previous iterations along with the current one to update the weights and biases. These algorithms are more robust and provide smooth convergence. Momentum-based adaptive algorithms use an exponentially weighted sum of gradients of previous iterations along with the current in backpropagation, resulting in quicker convergence than conventional methods. Let’s look at some such algorithms available.

Adagrade

It adapts the learning rate to the parameters. It uses a low learning rate for frequently occurring features and a high learning rate for the parameters associated with infrequent features. Hence it is well suited for dealing with sparse data.

GloVe word embedding uses adagrad where infrequent words a greater update and frequent words smaller update

Adagrade eliminates the need to manually tune the learning rate.

Disadvantages of adagrad

  • As the learning rate is monotonically decreasing, the learning rate in the late training period is very small.
  • It requires manually setting a global initial learning rate

Adadelta

Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing the learning rate. It does this by restricting the window of the past accumulated gradient to some fixed size of w. Running average at the time t then depends on the previous average and the current gradient. In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient

RMSProp

The full name of the RMSProp algorithm is called Root Mean Square Prop, which is an adaptive learning rate optimization algorithm proposed by Geoffrey Hinton. RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient. Adagrad will accumulate all previous gradient squares, and RMSprop just calculates the corresponding average value, so it can alleviate the problem that the learning rate of the Adagrad algorithm drops quickly. In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter. RMSProp divides the learning rate by the average of the exponential decay of squared gradients

Adam

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop. Adam also keeps an exponentially decaying average of past gradients, similar to momentum. Adam can be viewed as a combination of Adagrad and RMSprop,(Adagrad) which works well on sparse gradients, and (RMSProp) which works well in online and nonstationary settings respectively. Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients. Adam is computationally efficient and has very little memory requirements. Adam optimizer is one of the most popular and famous gradient descent optimization algorithms.

Choosing an Optimizer

  • If the data is sparse, use the self-applicable methods, namely Adagrad, Adadelta, RMSprop, Adam.
  • RMSprop, Adadelta, Adam have similar effects in many cases.
  • Adam just added bias-correction and momentum on the basis of RMSprop,
  • As the gradient becomes sparse, Adam will perform better than RMSprop.

Overall, Adam is the best choice.

SGD is used in many papers, without momentum, etc. Although SGD can reach a minimum value, practically it takes longer than other algorithms and may get trapped in the saddle point.

Thanks for reading the article! Kindly give claps if you gained something from this article. Wanna connect with me? Here is a link to my Linkedin Profile

--

--