Gradient Decent

Stochastic Gradient Descent – for beginners

Warning: This article contains only one mathematical equation which can be understood even if you have only passed high school. No other mathematical formulas are present. Reader discretion is advised.

If you have ever taken any Machine Learning course or even tried to read a bit about regression, it is inevitable that you will come across a term called Gradient Descent. The name has all the logic behind the algorithm, descend down a slope. Gradient Descent is a way to minimize any function by determining the slope of the function and then taking a small step in the opposite direction of the slope or going a step downhill. As we go through multiple iterations, we reach a valley.

The equation for the algorithm is:

θ = θ – η. ∇J(θ)                                                                              equation (1)

The ∇J(θ) finds the partial derivative or slope of the function J(θ) and then we multiply it with a learning rate parameter, η that determines how big a step we are going to take. We then adjust our parameter θ in the opposite direction of this.

grad_desc

The image above should make it clearer.

Now this gradient calculation and update is a resource intensive step. By some estimates, if an objective function takes n steps to compute, its gradient takes 3steps. We also have lots of data and our gradient descent has to go over it lots of time. This step has to be repeated for all the θs and all the rows of the data-set. All this requires a huge amount of computing power.

But we can cheat. Instead of computing the exact objective or loss function, we will compute an estimate of it,  a very bad estimate. We will compute the loss for some random sample of the training data, and then compute the gradient only for that sample and pretend that the derivative is the right direction to go.

So now, each step is a very small step, but the price we pay is a higher number of steps instead of one larger step to reach the minima.

However, computationally, we win by a huge margin overall. This technique of using sampling for gradient update is called Stochastic Gradient Descent. It scales well with both the data and the model size which is great since we want both big data and big model.

SGD is however a pretty bad optimizer and comes with a lot of issues in practice. I would suggest Sebastian Ruder’s blog  for more detailed explanations, variations and implementations.

Some tips to help Stochastic Gradient Descent: normalize inputs to zero mean and equal variances; use random weights with zero mean and equal variances as starting points.

 

Advertisements

#2 – Linear Regression with One Variable

In my last post I talked about two types of machine learning algorithms – supervised and unsupervised. The linear regression model comes under the first category. The regression model is basically used to predict the real valued output. So suppose you have a data-set that has the price of houses for houses of different size and you want to predict the price for a house of some particular size not present in the data-set.

Lets assume that the data-set looks like the one below, and you wish to predict the price for a house of size 1250 sq. ft (I am using the lecture example here).

vlcsnap-2014-07-24-21h22m53s219

The most logical and simple way to do so would be to draw a straight line preferably through the middle of all the points on the graph. We can certainly do this easily if there were only a few points on the graph, perhaps four or maybe five. But for a larger number of data points, this task becomes too difficult to solve by trial and error.

But thankfully it turns out that there is an algorithm that can do this efficiently for us. It is the “Gradient Descent” algorithm. But before I go into its details, let me talk about another concept of “Cost Function” which will be used in the gradient descent discussion.

The Cost Function lets us figure out how to fit the best possible straight line to our data. To draw a straight line you basically need two parameters – intercept on the y-axis and the slope of the line. So lets denote our straight line with

h = to + t*x                                                         equation 1

where to is the y-intercept and t is the slope of the line (t stands for theta and h stands for hypothesis). Now we use each point on the x-axis and calculate the corresponding value of hypothesis ‘h’ using the above equation. These are the values that we predicted for some particular value of t0 and t. But they may not give us a line that correctly represents our data.

We solve this problem by using the concept of “Cost Function”. We denote our cost function by

J = 1/2m * [sum(h(i) – y(i)]^2  over all points            equation 2

where h(i) is the value of h due to the i th x point and y(i) is the original value of the point on the y axis corresponding to the i th x value. m is the total number of points on the graph or the number of training set examples.

This cost function basically tells us how good is our line is in representing the data-set. If the points predicted by our line are vary far away from the actual data-set values, then the cost function would be very high and we would have to vary the values of to and t so that we can get a new line. In the end we basically want to find those values of to and t that can minimize our cost function. This is where the gradient descent comes into picture.

Gradient Descent is our algorithm for minimizing the cost function J.

The way it does this is that it assumes a value of to and t and then keeps changing it until we hopefully reach a minimum value. The equation to calculate the subsequent values of t0 and t is,

gradient_des

The alpha in the above equation basically controls how big a step you take downwards. If alpha is too small, the algorithm would be very slow and take a long time to compute; if alpha is too large, the algorithm may overshoot the minimum and fail to converge, or may even diverge. Thus an optimal value for alpha must be selected by trial. Usually it becomes clear what is the optimal value after running the algorithm once.

As an example (the one used in lecture), this is the model when a random value of t0 and t are chosen,

initial

And this is the value after running through four iterations of varying t0 and t1. The various circular lines denote the points which have the same cost.

four iter

And this is the final result. The red crosses in the right figures show the path followed by the gradient descent to reach the minimum.

final_iter

In this way we have a final prediction model for our data-set and now we can hopefully predict the correct price model for any set of given house sizes.

This was Linear Regression with one variable. In my next post I will talk about Linear Regression with Multiple Variables.

Stay tuned….