Optimizers in Deep Learning — In short! (without Maths)

In general, deep learning follows the steps as shown in the following image. First, we feed forward the dataset using multiple hidden layers and activation functions to get some output value, we compare this output to actual output and calculate the loss function. The cost is calculated using loss. According to the cost calculated, the optimizer tunes the value of all the weights and biases to reduce loss function via backpropagation. Once the loss/cost is reduced to an acceptable value, our model is trained.

Want to learn about activation functions and loss and cost functions before optimizers? you can refer below articles.

Gradient descent is the most common method used by optimizers. Our main aim is to decrease loss. the loss depends on output, output depends on the input of the output neuron, which depends on the output of the previous hidden layer, and so on. Optimizer calculates how the cost function changes with respect to the parameters(weights & biases) and changes the value of the parameter in the direction such that the cost is reduced. Gradient descent is moving down the hill from any given point. Let’s see an example, assume we are currently at point A(in the following figure), we can move in two directions, we can go to point B or point C. By calculating the gradient, we ensure that we move towards point C and not point B.(Mathematically we can see this clearly in the equation but as we are not going into that, just trust me on this one).

So, we take a step in the direction of point C. But how bigger or smaller this step depends on one parameter called the learning parameter. If we take a large learning parameter, then we overshoot and miss the minima and if we take too small a learning rate then we take so small steps that it takes a lot of iterations and time to reach minima. This is depicted in the following image.

There are three main approaches used when applying gradient descent. We can use a complete training dataset in one epoch, a subset, or a single data per epoch. Let’s discuss each one briefly.

Steps to follow

• Pass all the records one by one.
• Calculate the loss function for each record.
• Calculate the cost function which is basically just the summation of the loss of each record.
• Use this cost function in the optimizer to backpropagate and update the weights and biases.

• As we consider a complete training dataset for gradient descent, the step that we take to reduce cost is optimal.

• Memory consumption is high as extra memory is needed to store loss obtained from each record.
• Calculating cost function with a complete dataset for each epoch will take more time, hence the optimization using batch gradient descent is slower.

2. Mini-batch Gradient Descent (MGD or MBGD)

Steps to follow

• Select a batch of size between 50 to 256(this is not a hard and fast rule, you can take other sized batches as well) records randomly from the training dataset.
• calculate the loss for each record and bu summation fo it, the cost for the mini-batch.
• Send this cost to the optimizer to update weights and biases in backward propagation.

• As we do not send the entire dataset for each epoch, memory-wise its efficient.
• Due to mini-batch, less computation is required as compared to BGD

• No guarantee that it will give better convergence like BGD.
• In case the learning rate is too small, it will also take a lot of time to converge.
• In case the learning rate is high, it will fluctuate a lot, convergence will not be smooth

If the batch size is equal to the total number of records, BGD = MGD

Steps to follow

• Randomly select one record.
• Find the loss associated with that record.
• Pass this loss to the optimizer in backpropagation to update the weights and biases.

• It is faster as one epoch consists of only one record
• Memory requirement is also very less

• Oscillates very frequently as we are trying to find local minima for each record
• Ends up getting multiple minima

Momentum Based Optimizers

Until now, we were considering a gradient of only one iteration to update weights and baises in a particular backpropagation cycle. In contrast to this, Adaptive optimization algorithms take into consideration statistics from all previous iterations along with the current one to update the weights and biases. These algorithms are more robust and provide smooth convergence. Momentum-based adaptive algorithms use an exponentially weighted sum of gradients of previous iterations along with the current in backpropagation, resulting in quicker convergence than conventional methods. Let’s look at some such algorithms available.

It adapts the learning rate to the parameters. It uses a low learning rate for frequently occurring features and a high learning rate for the parameters associated with infrequent features. Hence it is well suited for dealing with sparse data.

GloVe word embedding uses adagrad where infrequent words a greater update and frequent words smaller update

• As the learning rate is monotonically decreasing, the learning rate in the late training period is very small.
• It requires manually setting a global initial learning rate