Machine Learning

Learning Artificial Intelligence with Udacity

Recently I wrote about my experience with the Udacity’s Self Driving Car Nanodegree (SDCND).

While pursuing this Nanodegree, I was so thrilled by the course material, that I decided to enroll in another nano-degree from Udacity at the end of my Term 2 of SDCND. This was the Artificial Intelligence Nanodegree. The first two terms of the SDCND had helped me to master the basics of Deep Learning and I wanted to explore some of the applications of Deep Learning in other domains like Natural Language Processing (think IBM Watson) and Voice User Interfaces (think Amazon Alexa). The AI-ND seemed like the perfect place to achieve this, partly due to my fantastic experience with the previous Udacity NDs.

The Artificial Intelligence ND is a bit different from the other NDs. There are a total of 4 terms and you need to pay for and complete two of them in order to graduate. In case you desire, you can also enroll for the other modules as well and complete them.

The first term is common and compulsory for all. It teaches you the foundations of AI like Game-Playing, Search, Optimization, Probabilistic AIs, and Hidden Markov Models. The topics are taught by some of the pioneers of AI like Prof. Sebastian ThrunProf. Peter Norvig, and Prof. Thad Starner. All the topics are covered in detail with links to additional research papers and book chapters for additional study.

The course begins with an interesting project of creating a program to solve the Sudoku problem using the concepts of Search and Constraint Propagation. You get an opportunity to play with various heuristics as you try to design an optimum strategy for the game.

Game Playing example

The next project continues from this by implementing an adversarial search agent to play the game of Isolation. Some of the topics that were covered included MinMax, AlphaBeta Search, Iterative Deepening, etc. The project also required an analysis of a research paper. I performed the review of the famous AlphaGo paper, which can be found on my GitHub project page.

From game-playing agents we moved onto the domain of planning problems. I experimented with various automatically generated heuristics, including planning graph heuristics, to solve the problems. Like the previous project, this one also required you to perform a research review.

From planning, we moved to the domain of probabilistic inference. The final project of Term 1 required the understanding of Hidden Markov Models to design a sign-language recognizer. You also get an understanding of the different model selection techniques such as Log likelihood using cross-validation folds, Bayesian Information Criterion and Discriminative Information Criterion.

The next term focused on the concepts and applications of Deep Learning. It covered the basic concepts of Deep Learning like Convolutional Neural Networks (CNN), Recurrent Neural Network (RNN), Semi-supervised learning, etc. and then moved onto the latest developments in the filed like the Generative Adversarial Networks (GANs). At the end of the module, there was an option to choose a specialization. The three options available were Computer Vision, Natural Language Processing and Voice User Interfaces. Since the SDCND had already exposed me to the domain of computer vision and I had already worked on some NLP projects and gone through the Stanford’s CS224d to some extent, I decided to pursue the Voice User Interfaces Specialization. The project involved building a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline. The pipeline accepts raw audio as input and return a predicted transcription of the spoken language. Some of the network architectures that I experimented with were RNN; RNN + TimeDistributed Dense; CNN + RNN + TimeDistributed Dense; Deeper RNN + TimeDistributed Dense and Bidirectional RNN + TimeDistributed Dense.

One of the major feature of the projects was the research component. To pass any project you had to give a detailed scientific reasoning and empirical evidence for your implementations and programs. This helped me to develop the skill of critical thinking and efficient problem solving. As is true with any nano-degree, this course was also full of interactions with people from around the world and from all aspects of industry. It was also heavily focused on applications which kept me excited for the entire duration of six-months.

I have continued my learning from this course by following the books “Artificial Intelligence — A modern approach” by Stuart Russell and Peter Norvig and “Deep Learning” by Ian Goodfellow, Yoshua Bengio and Aaron Courville. I still have a long way to go before I master this interesting field of AI, but the nano-degree has definitely shown me the way forward.


Udacity’s Self Driving Car Engineer Nano-Degree

Around September of the year 2016, Udacity announced a one-of-its-kind program. The program spanned over almost 10 months and promised to teach you the basics of one of the most interesting and exciting technology in the industry. It was designed by some of the pioneers in the field, like Prof. Sebastian Thrun, and was offered online, in the comfort and convenience of your home. The course had also bagged industry partnerships with Nvidia and Mercedes among others. The program was the Self Driving Car Engineer Nanodegree and it required proficiency in the basics of programming and machine learning to be eligible for enrollment.

A snapshot from my final capstone project

Without wasting a minute, I logged into my Udacity account and registered for the course. I had already completed a lot of online courses on various topics of my interest and the Nanodegree seemed like a great place to not only learn about the amazing technologies behind the autonomous vehicles, but also get an experience with designing my own self driving car. The course promised to give the students an opportunity to run their final project on a real vehicle by implementing various functionalities like Drive-by-Wire, Traffic Light Detection and Classification, Steering, Path Planning, etc. I was selected for the November cohort of the course and I officially received my access on November 29, 2016.

My Advanced Lane Detection Project from Term 1

Today, three months after completing my Nanodegree, I look back at the course as one of the best investments of my time and money. The course lectures were very well designed and structured. The three terms of the nano-degree were meticulously planned. The first term introduced the concepts of Computer Vision and Deep Learning. The projects involved a lot of scripting with Python and TensorFlow to solve the problems like Lane and Curvature Detection, Vehicle Detection, Steering Angle prediction, etc. The application oriented nature of the projects made it even more interesting.

My Vehicle Detection Project from Term 1

Term 2 was focused on the control side of things. It covered the topics of Sensor Fusion, Localization and Control. This term was heavily dominated by C++ and Algebra. The projects included implementing Extended and Unscented Kalman filters for tracking non-linear motion, Localization using Markov and Particle Filter and Model Predictive Control to drive the vehicle around the track. I learnt many new things in this term, from C++ programming to the mathematics behind the working of Kalman Filter, Particle Filter and MPC to their algorithmic implementations.

My Model Predictive Controller project from Term 2

The final term was focused on stitching together the various topics that were taught and applying them to create your own autonomous vehicle. The topics included path planning, semantic segmentation (or scene understanding), functional safety and finally the capstone project.

My Path Planning project from Term 3

What set the entire nano-degree apart from the other courses was it novelty. There is no other course out there that can teach you so much in such a short amount of time and in so much depth. The course also provided me with a collated set of resources for learning. Apart from the well-designed lecture videos, quizzes and projects, one of the most rewarding experiences was interaction with people from around the world. Everyone who was taking the course was excited and eager to share his/her knowledge and help others. The Slack and the Udacity discussion forums are full of activities. I interacted with people from around the world, from USA to Germany, to Japan. I discussed the projects and lectures with people from different academic and professional backgrounds, from a freshman to a Vice President of Engineering. These interactions not only helped me to create a world-wide network but also opened my eyes to the opportunities that are present around me. I also got an opportunity to explore some of the open courses like Stanford’s CS231n, the materials for which are freely available online. The amazing support of my peers and mentors played a huge role in helping me to master the material.

The nano-degree took a lot of time and effort to complete. Since I also pursued the optional material, which were mostly research papers, it took me more than average time for completion. However, the effect of the course was so profound, that I still go back to the material for revision, interact with new students on Slack and discuss the projects over WhatsApp. The course changed the way I approach the problems provided me with a solid base for future research. I hope that Udacity launches a more advanced version of the course soon.

My implementation for one of the Term 3 optional projects — Object Detection with R-FCN


Becoming a better programmer

You have just learnt the basics of a programming language at school or college or through an online course. You now know the basic components of a program. You are also able to solve some basic problems using small amounts of code.

But somehow when you are writing pieces of codes in professional capacity, you are always making multiple changes in your code and are constantly discussing things over long meetings. Maybe you are not as good a programmer as you thought you were?

I was faced with an exactly the same kind of problem a couple of years back, when i started my career as a software developer working on large code bases and developing pieces of software that would run in production and impact thousands of systems. Fortunately for me, I had the support of extremely patient peers and colleagues who were kind enough to spend some of their valuable time guiding me. These were people who had almost 15 to 20 years of experience writing programs that were efficient, easy to debug and easy to modify with frequent changes in requirements.

In this post I will list a few resources that were recommended by these seasoned programmers and why every programmer should have a look at them as well. Going through these resources surely changed the way I approached the problems and made me realize the immense knowledge that is still to be gained.

Here are some of the recommended readings for anyone who wants to program for a living:

  1. The Pragmatic Programmer by Andrew Hunt and David Thomas
  2. Head First Design Patterns by Eric Freeman and Elisabeth Robson
  3. Structure and Interpretation of Computer Programs by Gerald Jay Sussman and Hal Abelson
  4. Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein
  5. Modern C++ Programming with Test-Driven Development by Jeff Langr

If you are also working as a person working with developing analytics solutions and using machine learning in your work and are interested in a better understanding of the various algorithms that you are working with, then you should also have a look at these books:

  1. Machine Learning by Tom Mitchell
  2. An Introduction to Statistical Learning by Gareth JamesDaniela WittenTrevor Hastie and Robert Tibshirani
  3. The Elements of Statistical Learning by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
  4. Deep Learning by Ian Goodfellow,‎ Yoshua Bengio and Aaron Courville
  5. Principles of Data Mining by by David J. Hand, Heikki Mannila, and Padhraic Smyth
  6. Modeling the Internet and the Web bPierre BaldiPaolo Frasconi and Padhraic Smyth

The second list of books are not for light reading and require a sufficient amount of devotion. I am still reading some of them even after 2 years and going back to them multiple times for better understanding. However, they are all worth your time and will definitely reward you over the next couple of years as we work on more complex problems and design more sophisticated systems.

Good luck on your journey to becoming a better programmer 🙂



Random Forest – The Evergreen Classifier

DisclaimerSome of the terms used in this article may seem too advanced for an absolute novice in the fields of machine learning or statistic. I have tried to include supplementary resources as links which can be used for better understanding. All in all I hope that this article motivates you to try solving a problem of your own with random forest.

In the last few weeks I have been working on some classification problems involving multiple classes. My first approach after cleaning the data-set and pre-processing it for categorical outputs was to go with the simplest classification algorithm that I knew – Logistic Regression. The logistic regression is a very simple classifier that uses the sigmoid function output to classify the labels. It is very well suited to a binary classification problem in which there are only two possible outcomes. However, it can also be tweaked to classify multiple classes by using one-vs-one or one-vs-all approaches. Similar approaches can be taken with Support Vector Machine as well. The accuracy I got was around 88% in the training set and about 89% on my cross-validation set after a few hours of parameter tuning. This was good but as I researched more, I came across Decision Trees and their bootstrapped aggregated version, Random Forest. A few minutes into the algorithm’s documentation (by the person who coined the term bagging, Prof Breiman), I was amazed by its robustness and functionality. It was like an all in one algorithm for Classification, Regression, Clustering and even filling the missing values in the data-set. No other machine learning algorithm caught my attention as much as it did. In this article I would try to explain the working of the algorithm and its features which make it an evergreen algorithm.

A random forest works by creating multiple classification trees. Each tree is grown as follows:

  1. If the number of cases in the training set is N, sample N cases at random – but with replacement, from the original data. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. Each tree is grown to the largest extent possible. There is no pruning.

To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

One of the great features of this is that it eliminates the need for a cross-validation set, since each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

The algorithm also gives you an idea about the importance of various features in the data-set. As this article mentions, “In every tree grown in the forest, put down the out-of-bag cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the out-of-bag cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted out-of-bag data from the number of votes for the correct class in the untouched out-of-bag data. The average of this number over all trees in the forest is the raw importance score for variable m.”

Do check out the page by Berkeley to get more idea about the great points about the algorithm like:

  • Outlier Detection
  • Proximity Measure
  • Missing Value replacement for training and test sets
  • Scaling
  • Modelling for quantitative outputs, etc and more

But, one thing is undisputed, Random forest is among the most powerful algorithms that are out there for classification, and there are off the shelf versions that can be used for many typical problems.

As a last note, do check out the photo-gallery of Prof. Breiman to get a more idea about his life and his work. I could not help but feel motivated after going through his work.


Stochastic Gradient Descent – for beginners

Warning: This article contains only one mathematical equation which can be understood even if you have only passed high school. No other mathematical formulas are present. Reader discretion is advised.

If you have ever taken any Machine Learning course or even tried to read a bit about regression, it is inevitable that you will come across a term called Gradient Descent. The name has all the logic behind the algorithm, descend down a slope. Gradient Descent is a way to minimize any function by determining the slope of the function and then taking a small step in the opposite direction of the slope or going a step downhill. As we go through multiple iterations, we reach a valley.

The equation for the algorithm is:

θ = θ – η. ∇J(θ)                                                                              equation (1)

The ∇J(θ) finds the partial derivative or slope of the function J(θ) and then we multiply it with a learning rate parameter, η that determines how big a step we are going to take. We then adjust our parameter θ in the opposite direction of this.


The image above should make it clearer.

Now this gradient calculation and update is a resource intensive step. By some estimates, if an objective function takes n steps to compute, its gradient takes 3steps. We also have lots of data and our gradient descent has to go over it lots of time. This step has to be repeated for all the θs and all the rows of the data-set. All this requires a huge amount of computing power.

But we can cheat. Instead of computing the exact objective or loss function, we will compute an estimate of it,  a very bad estimate. We will compute the loss for some random sample of the training data, and then compute the gradient only for that sample and pretend that the derivative is the right direction to go.

So now, each step is a very small step, but the price we pay is a higher number of steps instead of one larger step to reach the minima.

However, computationally, we win by a huge margin overall. This technique of using sampling for gradient update is called Stochastic Gradient Descent. It scales well with both the data and the model size which is great since we want both big data and big model.

SGD is however a pretty bad optimizer and comes with a lot of issues in practice. I would suggest Sebastian Ruder’s blog  for more detailed explanations, variations and implementations.

Some tips to help Stochastic Gradient Descent: normalize inputs to zero mean and equal variances; use random weights with zero mean and equal variances as starting points.



#3 Linear Regression with Multiple Variables

In the previous post I talked about the simple case in which we were able to fit our data-set with the help of only one variable. But life is not so simple. In most of the real-life scenarios the data-set will have more than one feature, i.e the output may be dependent on more than one variable. Taking the case of the previous example of housing prices, the price of the house may not only depend on the size of the house but also on the say, the age of the house, the number of rooms in the house, etc. In such cases using the simple linear regression (h = to + t*x) may not work.

Rather we must formulate a new equation which takes care of all the parameters. Turns out this is also quite simple. So suppose you have a data-set in which there are say 4 features. We can then model it using the equation:

h = to+ x1*t1+ x2*t2+ x3*t3+ x4*t4                                                                        equation 1

where x1 is the value of feature 1 (say age of the house), x2 is the value of the second feature (say number of rooms in the house) and so on.

So now our hypothesis (h) is an (n+1) dimensional vector if there are n features in the training set. The cost function remains the same (J = 1/2m * [sum(h(i) – y(i)]^2  over all points) with h calculated using equation 1. To optimize it we may use our loyal gradient descent algorithm with only a slight modification. Shown below are the equations taken directly from the lecture slides (by Prof. Andrew Ng of Stanford University)


What we are simply doing for this case is updating the values for all the parameters simultaneously.

Now as you might guess, this process will become computationally a little expensive if we have a large data-set with a lot of features. In such cases usually there are two ways to speed up the process: 1) Feature Scaling using Mean Normalization; 2) Varying the learning rate.

1) Feature Scaling using Mean Normalization simply means hat we replace each parameter (x1, x2, …) with (x(i) – mean(i))/range, where x(i) is the ith parameter, mean(i) is the mean of the ith feature column and range is simply the standard deviation or max-min value of that feature column.

2) Varying learning rate implies increasing the value of alpha in the above equation. As stated in my last post, if alpha is too small, the algorithm may take a lot of time to converge, while if alpha becomes too high, the Cost function may begin to diverge. So an optimum value of alpha must be chosen by trial and error.


Now Gradient Descent is not the only algorithm that can help us optimize our cost function (J). As Prof. Andrew mentioned, Normal Regression is another great way to optimize our cost function. In this method we neither have have to choose alpha nor iterate over multiple steps. This method gives the optimum point in one step. However it is not suitable if number of features is very large because it might be very slow. In such cases gradient descent works well even with large number of features. Since this is a slightly complex algorithm with a lot of matrix notations, I will not discuss it here for fear of my post becoming too large and complex. But if it really interests you, then please go through the Week 2 lecture videos 6 and 7. However for most of our cases, the gradient descent works just fine.

In the next post I will talk about Logistic Regression and Classification.

Stay tuned….



#2 – Linear Regression with One Variable

In my last post I talked about two types of machine learning algorithms – supervised and unsupervised. The linear regression model comes under the first category. The regression model is basically used to predict the real valued output. So suppose you have a data-set that has the price of houses for houses of different size and you want to predict the price for a house of some particular size not present in the data-set.

Lets assume that the data-set looks like the one below, and you wish to predict the price for a house of size 1250 sq. ft (I am using the lecture example here).


The most logical and simple way to do so would be to draw a straight line preferably through the middle of all the points on the graph. We can certainly do this easily if there were only a few points on the graph, perhaps four or maybe five. But for a larger number of data points, this task becomes too difficult to solve by trial and error.

But thankfully it turns out that there is an algorithm that can do this efficiently for us. It is the “Gradient Descent” algorithm. But before I go into its details, let me talk about another concept of “Cost Function” which will be used in the gradient descent discussion.

The Cost Function lets us figure out how to fit the best possible straight line to our data. To draw a straight line you basically need two parameters – intercept on the y-axis and the slope of the line. So lets denote our straight line with

h = to + t*x                                                         equation 1

where to is the y-intercept and t is the slope of the line (t stands for theta and h stands for hypothesis). Now we use each point on the x-axis and calculate the corresponding value of hypothesis ‘h’ using the above equation. These are the values that we predicted for some particular value of t0 and t. But they may not give us a line that correctly represents our data.

We solve this problem by using the concept of “Cost Function”. We denote our cost function by

J = 1/2m * [sum(h(i) – y(i)]^2  over all points            equation 2

where h(i) is the value of h due to the i th x point and y(i) is the original value of the point on the y axis corresponding to the i th x value. m is the total number of points on the graph or the number of training set examples.

This cost function basically tells us how good is our line is in representing the data-set. If the points predicted by our line are vary far away from the actual data-set values, then the cost function would be very high and we would have to vary the values of to and t so that we can get a new line. In the end we basically want to find those values of to and t that can minimize our cost function. This is where the gradient descent comes into picture.

Gradient Descent is our algorithm for minimizing the cost function J.

The way it does this is that it assumes a value of to and t and then keeps changing it until we hopefully reach a minimum value. The equation to calculate the subsequent values of t0 and t is,


The alpha in the above equation basically controls how big a step you take downwards. If alpha is too small, the algorithm would be very slow and take a long time to compute; if alpha is too large, the algorithm may overshoot the minimum and fail to converge, or may even diverge. Thus an optimal value for alpha must be selected by trial. Usually it becomes clear what is the optimal value after running the algorithm once.

As an example (the one used in lecture), this is the model when a random value of t0 and t are chosen,


And this is the value after running through four iterations of varying t0 and t1. The various circular lines denote the points which have the same cost.

four iter

And this is the final result. The red crosses in the right figures show the path followed by the gradient descent to reach the minimum.


In this way we have a final prediction model for our data-set and now we can hopefully predict the correct price model for any set of given house sizes.

This was Linear Regression with one variable. In my next post I will talk about Linear Regression with Multiple Variables.

Stay tuned….