Machine Learning

Random Forest – The Evergreen Classifier

DisclaimerSome of the terms used in this article may seem too advanced for an absolute novice in the fields of machine learning or statistic. I have tried to include supplementary resources as links which can be used for better understanding. All in all I hope that this article motivates you to try solving a problem of your own with random forest.

In the last few weeks I have been working on some classification problems involving multiple classes. My first approach after cleaning the data-set and pre-processing it for categorical outputs was to go with the simplest classification algorithm that I knew – Logistic Regression. The logistic regression is a very simple classifier that uses the sigmoid function output to classify the labels. It is very well suited to a binary classification problem in which there are only two possible outcomes. However, it can also be tweaked to classify multiple classes by using one-vs-one or one-vs-all approaches. Similar approaches can be taken with Support Vector Machine as well. The accuracy I got was around 88% in the training set and about 89% on my cross-validation set after a few hours of parameter tuning. This was good but as I researched more, I came across Decision Trees and their bootstrapped aggregated version, Random Forest. A few minutes into the algorithm’s documentation (by the person who coined the term bagging, Prof Breiman), I was amazed by its robustness and functionality. It was like an all in one algorithm for Classification, Regression, Clustering and even filling the missing values in the data-set. No other machine learning algorithm caught my attention as much as it did. In this article I would try to explain the working of the algorithm and its features which make it an evergreen algorithm.

A random forest works by creating multiple classification trees. Each tree is grown as follows:

  1. If the number of cases in the training set is N, sample N cases at random – but with replacement, from the original data. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. Each tree is grown to the largest extent possible. There is no pruning.

To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

One of the great features of this is that it eliminates the need for a cross-validation set, since each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

The algorithm also gives you an idea about the importance of various features in the data-set. As this article mentions, “In every tree grown in the forest, put down the out-of-bag cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the out-of-bag cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted out-of-bag data from the number of votes for the correct class in the untouched out-of-bag data. The average of this number over all trees in the forest is the raw importance score for variable m.”

Do check out the page by Berkeley to get more idea about the great points about the algorithm like:

  • Outlier Detection
  • Proximity Measure
  • Missing Value replacement for training and test sets
  • Scaling
  • Modelling for quantitative outputs, etc and more

But, one thing is undisputed, Random forest is among the most powerful algorithms that are out there for classification, and there are off the shelf versions that can be used for many typical problems.

As a last note, do check out the photo-gallery of Prof. Breiman to get a more idea about his life and his work. I could not help but feel motivated after going through his work.

Advertisements

Stochastic Gradient Descent – for beginners

Warning: This article contains only one mathematical equation which can be understood even if you have only passed high school. No other mathematical formulas are present. Reader discretion is advised.

If you have ever taken any Machine Learning course or even tried to read a bit about regression, it is inevitable that you will come across a term called Gradient Descent. The name has all the logic behind the algorithm, descend down a slope. Gradient Descent is a way to minimize any function by determining the slope of the function and then taking a small step in the opposite direction of the slope or going a step downhill. As we go through multiple iterations, we reach a valley.

The equation for the algorithm is:

θ = θ – η. ∇J(θ)                                                                              equation (1)

The ∇J(θ) finds the partial derivative or slope of the function J(θ) and then we multiply it with a learning rate parameter, η that determines how big a step we are going to take. We then adjust our parameter θ in the opposite direction of this.

grad_desc

The image above should make it clearer.

Now this gradient calculation and update is a resource intensive step. By some estimates, if an objective function takes n steps to compute, its gradient takes 3steps. We also have lots of data and our gradient descent has to go over it lots of time. This step has to be repeated for all the θs and all the rows of the data-set. All this requires a huge amount of computing power.

But we can cheat. Instead of computing the exact objective or loss function, we will compute an estimate of it,  a very bad estimate. We will compute the loss for some random sample of the training data, and then compute the gradient only for that sample and pretend that the derivative is the right direction to go.

So now, each step is a very small step, but the price we pay is a higher number of steps instead of one larger step to reach the minima.

However, computationally, we win by a huge margin overall. This technique of using sampling for gradient update is called Stochastic Gradient Descent. It scales well with both the data and the model size which is great since we want both big data and big model.

SGD is however a pretty bad optimizer and comes with a lot of issues in practice. I would suggest Sebastian Ruder’s blog  for more detailed explanations, variations and implementations.

Some tips to help Stochastic Gradient Descent: normalize inputs to zero mean and equal variances; use random weights with zero mean and equal variances as starting points.

 

#3 Linear Regression with Multiple Variables

In the previous post I talked about the simple case in which we were able to fit our data-set with the help of only one variable. But life is not so simple. In most of the real-life scenarios the data-set will have more than one feature, i.e the output may be dependent on more than one variable. Taking the case of the previous example of housing prices, the price of the house may not only depend on the size of the house but also on the say, the age of the house, the number of rooms in the house, etc. In such cases using the simple linear regression (h = to + t*x) may not work.

Rather we must formulate a new equation which takes care of all the parameters. Turns out this is also quite simple. So suppose you have a data-set in which there are say 4 features. We can then model it using the equation:

h = to+ x1*t1+ x2*t2+ x3*t3+ x4*t4                                                                        equation 1

where x1 is the value of feature 1 (say age of the house), x2 is the value of the second feature (say number of rooms in the house) and so on.

So now our hypothesis (h) is an (n+1) dimensional vector if there are n features in the training set. The cost function remains the same (J = 1/2m * [sum(h(i) – y(i)]^2  over all points) with h calculated using equation 1. To optimize it we may use our loyal gradient descent algorithm with only a slight modification. Shown below are the equations taken directly from the lecture slides (by Prof. Andrew Ng of Stanford University)

mutivar_gd

What we are simply doing for this case is updating the values for all the parameters simultaneously.

Now as you might guess, this process will become computationally a little expensive if we have a large data-set with a lot of features. In such cases usually there are two ways to speed up the process: 1) Feature Scaling using Mean Normalization; 2) Varying the learning rate.

1) Feature Scaling using Mean Normalization simply means hat we replace each parameter (x1, x2, …) with (x(i) – mean(i))/range, where x(i) is the ith parameter, mean(i) is the mean of the ith feature column and range is simply the standard deviation or max-min value of that feature column.

2) Varying learning rate implies increasing the value of alpha in the above equation. As stated in my last post, if alpha is too small, the algorithm may take a lot of time to converge, while if alpha becomes too high, the Cost function may begin to diverge. So an optimum value of alpha must be chosen by trial and error.

 

Now Gradient Descent is not the only algorithm that can help us optimize our cost function (J). As Prof. Andrew mentioned, Normal Regression is another great way to optimize our cost function. In this method we neither have have to choose alpha nor iterate over multiple steps. This method gives the optimum point in one step. However it is not suitable if number of features is very large because it might be very slow. In such cases gradient descent works well even with large number of features. Since this is a slightly complex algorithm with a lot of matrix notations, I will not discuss it here for fear of my post becoming too large and complex. But if it really interests you, then please go through the Week 2 lecture videos 6 and 7. However for most of our cases, the gradient descent works just fine.

In the next post I will talk about Logistic Regression and Classification.

Stay tuned….

 

#2 – Linear Regression with One Variable

In my last post I talked about two types of machine learning algorithms – supervised and unsupervised. The linear regression model comes under the first category. The regression model is basically used to predict the real valued output. So suppose you have a data-set that has the price of houses for houses of different size and you want to predict the price for a house of some particular size not present in the data-set.

Lets assume that the data-set looks like the one below, and you wish to predict the price for a house of size 1250 sq. ft (I am using the lecture example here).

vlcsnap-2014-07-24-21h22m53s219

The most logical and simple way to do so would be to draw a straight line preferably through the middle of all the points on the graph. We can certainly do this easily if there were only a few points on the graph, perhaps four or maybe five. But for a larger number of data points, this task becomes too difficult to solve by trial and error.

But thankfully it turns out that there is an algorithm that can do this efficiently for us. It is the “Gradient Descent” algorithm. But before I go into its details, let me talk about another concept of “Cost Function” which will be used in the gradient descent discussion.

The Cost Function lets us figure out how to fit the best possible straight line to our data. To draw a straight line you basically need two parameters – intercept on the y-axis and the slope of the line. So lets denote our straight line with

h = to + t*x                                                         equation 1

where to is the y-intercept and t is the slope of the line (t stands for theta and h stands for hypothesis). Now we use each point on the x-axis and calculate the corresponding value of hypothesis ‘h’ using the above equation. These are the values that we predicted for some particular value of t0 and t. But they may not give us a line that correctly represents our data.

We solve this problem by using the concept of “Cost Function”. We denote our cost function by

J = 1/2m * [sum(h(i) – y(i)]^2  over all points            equation 2

where h(i) is the value of h due to the i th x point and y(i) is the original value of the point on the y axis corresponding to the i th x value. m is the total number of points on the graph or the number of training set examples.

This cost function basically tells us how good is our line is in representing the data-set. If the points predicted by our line are vary far away from the actual data-set values, then the cost function would be very high and we would have to vary the values of to and t so that we can get a new line. In the end we basically want to find those values of to and t that can minimize our cost function. This is where the gradient descent comes into picture.

Gradient Descent is our algorithm for minimizing the cost function J.

The way it does this is that it assumes a value of to and t and then keeps changing it until we hopefully reach a minimum value. The equation to calculate the subsequent values of t0 and t is,

gradient_des

The alpha in the above equation basically controls how big a step you take downwards. If alpha is too small, the algorithm would be very slow and take a long time to compute; if alpha is too large, the algorithm may overshoot the minimum and fail to converge, or may even diverge. Thus an optimal value for alpha must be selected by trial. Usually it becomes clear what is the optimal value after running the algorithm once.

As an example (the one used in lecture), this is the model when a random value of t0 and t are chosen,

initial

And this is the value after running through four iterations of varying t0 and t1. The various circular lines denote the points which have the same cost.

four iter

And this is the final result. The red crosses in the right figures show the path followed by the gradient descent to reach the minimum.

final_iter

In this way we have a final prediction model for our data-set and now we can hopefully predict the correct price model for any set of given house sizes.

This was Linear Regression with one variable. In my next post I will talk about Linear Regression with Multiple Variables.

Stay tuned….

 

 

How to teach a metal box ?

Machine learning is one of the most exciting recent technologies. Every time you Google something or when Facebook recommends a photo tag, machine learning comes into picture. If there is an application which you just cannot program by hand, say writing a computer program to make a helicopter fly, we make use of machine learning by letting the computer learn how to fly the helicopter by itself. Most of natural language processing and most of computer vision today is applied machine learning.

So how do we define Machine Learning?

Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, defines it as:

“the field of study that gives computers the ability to learn without being explicitly programmed”

Here is a slightly more recent definition for ML algorithm by Tom Mitchell, who is currently the Chair of Machine Learning Department at Carnegie Mellon University,

“a computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E”

So for example if you have a spam filter. Then its task T would be to categorize the mail as spam or no spam; the experience parameter E would be the type of mails that you classify as spam or no spam; and its performance P would be the number of mails that it is able to correctly classify as spam. If you are using an efficient algorithm then your filter must make more and more correct choices over time.

There are several different types of learning algorithms. The two main types are supervised learning and unsupervised learning.

In supervised learning we give the algorithm a data set and also tell it what are the correct answers. All the algorithm has to do is try and find a pattern in the values given so that it can predict what will be the outcome for the next set of inputs. This is also known as a regression problem. To explain this I will use the example given by Prof. Andrew in class. vlcsnap-2014-07-20-17h10m57s84

So suppose you have a data set with the housing prices for houses of different sizes. You want your program to predict the prices of house if it is given their size. To do this you will first tell the algorithm the prices for the different sizes and then let it predict for any other input size the price. You then tell the algorithm whether its prediction was correct or the difference between the correct price and the predicted value. This way, the algorithm will try to correct itself and learn to draw a line connecting all the correct input values and give a correct prediction eventually.

So for each example in Supervised Learning, we are told explicitly what is the so-called right answer

In Unsupervised Learning, we’re given data that doesn’t have any labels. So we’re given the data set and we’re not told what to do with it and we’re not told what each data point is. Instead we’re just told, here’s a bunch of data. I don’t know what’s in this data. I don’t know who’s and what type. I don’t even know what the different types of people are, but can you automatically find some structure in the data? An example of this is the Google News. The algorithm automatically decides and creates clusters of similar news headlines on various topics by itself.

In my next post I will start to talk about more specific learning algorithms, how these algorithms work and how we can go about implementing them.

Hello World – customary first post

So I recently started to delve into the interesting topic of Machine Learning. I have started the Machine Learning course on Coursera which is being taken by Prof. Andrew from Stanford University. Pretty soon I realized that the topic is vast and due to the weekly nature of the updates, recalling old lectures and concepts was becoming more and more difficult.

So here it is! My first blog ever.

In this blog I will not only try to keep a record of what is being taught in the Machine Learning class but also about all the other courses that I am taking through various MOOC platforms like edX, Udacity and of course, Coursera. I will also try and put some guides on certain other topics like LabVIEW since I am also an intern at National Instruments – R&D, Bangalore. There might even be some posts and updates on topics like Statistics and R Programming and maybe some posts on my life (well, its my blog after all :P).

Stay tuned for more …