Warning: This article contains only one mathematical equation which can be understood even if you have only passed high school. No other mathematical formulas are present. Reader discretion is advised.
If you have ever taken any Machine Learning course or even tried to read a bit about regression, it is inevitable that you will come across a term called Gradient Descent. The name has all the logic behind the algorithm, descend down a slope. Gradient Descent is a way to minimize any function by determining the slope of the function and then taking a small step in the opposite direction of the slope or going a step downhill. As we go through multiple iterations, we reach a valley.
The equation for the algorithm is:
θ = θ – η. ∇J(θ) equation (1)
The ∇J(θ) finds the partial derivative or slope of the function J(θ) and then we multiply it with a learning rate parameter, η that determines how big a step we are going to take. We then adjust our parameter θ in the opposite direction of this.
The image above should make it clearer.
Now this gradient calculation and update is a resource intensive step. By some estimates, if an objective function takes n steps to compute, its gradient takes 3n steps. We also have lots of data and our gradient descent has to go over it lots of time. This step has to be repeated for all the θs and all the rows of the data-set. All this requires a huge amount of computing power.
But we can cheat. Instead of computing the exact objective or loss function, we will compute an estimate of it, a very bad estimate. We will compute the loss for some random sample of the training data, and then compute the gradient only for that sample and pretend that the derivative is the right direction to go.
So now, each step is a very small step, but the price we pay is a higher number of steps instead of one larger step to reach the minima.
However, computationally, we win by a huge margin overall. This technique of using sampling for gradient update is called Stochastic Gradient Descent. It scales well with both the data and the model size which is great since we want both big data and big model.
SGD is however a pretty bad optimizer and comes with a lot of issues in practice. I would suggest Sebastian Ruder’s blog for more detailed explanations, variations and implementations.
Some tips to help Stochastic Gradient Descent: normalize inputs to zero mean and equal variances; use random weights with zero mean and equal variances as starting points.