In the previous post I talked about the simple case in which we were able to fit our data-set with the help of only one variable. But life is not so simple. In most of the real-life scenarios the data-set will have more than one feature, i.e the output may be dependent on more than one variable. Taking the case of the previous example of housing prices, the price of the house may not only depend on the size of the house but also on the say, the age of the house, the number of rooms in the house, etc. In such cases using the simple linear regression (h = to + t*x) may not work.
Rather we must formulate a new equation which takes care of all the parameters. Turns out this is also quite simple. So suppose you have a data-set in which there are say 4 features. We can then model it using the equation:
h = to+ x1*t1+ x2*t2+ x3*t3+ x4*t4 equation 1
where x1 is the value of feature 1 (say age of the house), x2 is the value of the second feature (say number of rooms in the house) and so on.
So now our hypothesis (h) is an (n+1) dimensional vector if there are n features in the training set. The cost function remains the same (J = 1/2m * [sum(h(i) – y(i)]^2 over all points) with h calculated using equation 1. To optimize it we may use our loyal gradient descent algorithm with only a slight modification. Shown below are the equations taken directly from the lecture slides (by Prof. Andrew Ng of Stanford University)
What we are simply doing for this case is updating the values for all the parameters simultaneously.
Now as you might guess, this process will become computationally a little expensive if we have a large data-set with a lot of features. In such cases usually there are two ways to speed up the process: 1) Feature Scaling using Mean Normalization; 2) Varying the learning rate.
1) Feature Scaling using Mean Normalization simply means hat we replace each parameter (x1, x2, …) with (x(i) – mean(i))/range, where x(i) is the ith parameter, mean(i) is the mean of the ith feature column and range is simply the standard deviation or max-min value of that feature column.
2) Varying learning rate implies increasing the value of alpha in the above equation. As stated in my last post, if alpha is too small, the algorithm may take a lot of time to converge, while if alpha becomes too high, the Cost function may begin to diverge. So an optimum value of alpha must be chosen by trial and error.
Now Gradient Descent is not the only algorithm that can help us optimize our cost function (J). As Prof. Andrew mentioned, Normal Regression is another great way to optimize our cost function. In this method we neither have have to choose alpha nor iterate over multiple steps. This method gives the optimum point in one step. However it is not suitable if number of features is very large because it might be very slow. In such cases gradient descent works well even with large number of features. Since this is a slightly complex algorithm with a lot of matrix notations, I will not discuss it here for fear of my post becoming too large and complex. But if it really interests you, then please go through the Week 2 lecture videos 6 and 7. However for most of our cases, the gradient descent works just fine.
In the next post I will talk about Logistic Regression and Classification.