Data Science

Becoming a better Data Scientist and Programmer

You have just learnt the basics of a programming language at school or college or through an online course. You now know the basic components of a program. You are also able to solve some basic problems using small amounts of code.

But somehow when you are writing pieces of codes in professional capacity, you are always making multiple changes in your code and are constantly discussing things over long meetings. Maybe you are not as good a programmer as you thought you were?

I was faced with an exactly the same kind of problem a couple of years back, when I started my career as a software developer working on large code bases and developing pieces of software that would run in production and impact thousands of systems. Fortunately for me, I had the support of extremely patient peers and colleagues who were kind enough to spend some of their valuable time guiding me. These were people who had almost 15 to 20 years of experience writing programs that were efficient, easy to debug and easy to modify in face of frequent changes in requirements.

In this post I will list a few resources that were recommended by these seasoned programmers and why every programmer should have a look at them as well. Going through these resources surely changed the way I approached the problems and made me realize the immense knowledge that is still to be gained.

Here are some of the recommended readings for anyone who wants to program for a living:

  1. The Pragmatic Programmer by Andrew Hunt and David Thomas
  2. Head First Design Patterns by Eric Freeman and Elisabeth Robson
  3. Structure and Interpretation of Computer Programs by Gerald Jay Sussman and Hal Abelson
  4. Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein
  5. Modern C++ Programming with Test-Driven Development by Jeff Langr

If you are also working with developing analytics solutions, using machine learning in your work and are looking to get a better understanding of the various algorithms that you are working with, then you should also have a look at these books:

  1. Machine Learning by Tom Mitchell – A good introduction to the basic concepts of Machine Learning. Best when studied in parallel to following the Machine Learning course by Andrew Ng. Recommended for beginners to advanced level learners.
  2. An Introduction to Statistical Learning by Gareth JamesDaniela WittenTrevor Hastie and Robert Tibshirani – A good introduction for anyone entering at a beginner/junior level as a Data Analyst or a Data Scientist. Provides really good introduction to the basic machine learning concepts as well as their code in R language. Recommended for beginner to medium level learners.
  3. The Elements of Statistical Learning by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie – A really great book for understanding the concepts of machine learning and understanding the mathematical and statistical properties behind them. Recommended for medium to advanced level learners.
  4. Deep Learning by Ian Goodfellow,‎ Yoshua Bengio and Aaron Courville – A comprehensive collection of various aspects of deep learning. Includes introduction to Linear Algebra and Statistics followed by the present deep learning research and the future work in the area. Can be read by beginner to advanced level learners.
  5. Data Science for Business by Foster Provost and Tom Fawcett – A great read for anyone at any level. It helps you really understand some aspects of the Data Science field which might not be very intuitive for a  beginner or someone coming in from a purely computer science background.
  6. Modeling the Internet and the Web bPierre BaldiPaolo Frasconi and Padhraic Smyth

The second list of books are not for light reading and require a sufficient amount of devotion. I am still reading some of them even after 2 years and going back to them multiple times for better understanding. However, they are all worth your time and will definitely reward you over the next couple of years as we work on more complex problems and design more sophisticated systems.

Good luck on your journey to becoming a better programmer 🙂

 

Advertisements

Random Forest – The Evergreen Classifier

DisclaimerSome of the terms used in this article may seem too advanced for an absolute novice in the fields of machine learning or statistic. I have tried to include supplementary resources as links which can be used for better understanding. All in all I hope that this article motivates you to try solving a problem of your own with random forest.

In the last few weeks I have been working on some classification problems involving multiple classes. My first approach after cleaning the data-set and pre-processing it for categorical outputs was to go with the simplest classification algorithm that I knew – Logistic Regression. The logistic regression is a very simple classifier that uses the sigmoid function output to classify the labels. It is very well suited to a binary classification problem in which there are only two possible outcomes. However, it can also be tweaked to classify multiple classes by using one-vs-one or one-vs-all approaches. Similar approaches can be taken with Support Vector Machine as well. The accuracy I got was around 88% in the training set and about 89% on my cross-validation set after a few hours of parameter tuning. This was good but as I researched more, I came across Decision Trees and their bootstrapped aggregated version, Random Forest. A few minutes into the algorithm’s documentation (by the person who coined the term bagging, Prof Breiman), I was amazed by its robustness and functionality. It was like an all in one algorithm for Classification, Regression, Clustering and even filling the missing values in the data-set. No other machine learning algorithm caught my attention as much as it did. In this article I would try to explain the working of the algorithm and its features which make it an evergreen algorithm.

A random forest works by creating multiple classification trees. Each tree is grown as follows:

  1. If the number of cases in the training set is N, sample N cases at random – but with replacement, from the original data. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. Each tree is grown to the largest extent possible. There is no pruning.

To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

One of the great features of this is that it eliminates the need for a cross-validation set, since each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

The algorithm also gives you an idea about the importance of various features in the data-set. As this article mentions, “In every tree grown in the forest, put down the out-of-bag cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the out-of-bag cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted out-of-bag data from the number of votes for the correct class in the untouched out-of-bag data. The average of this number over all trees in the forest is the raw importance score for variable m.”

Do check out the page by Berkeley to get more idea about the great points about the algorithm like:

  • Outlier Detection
  • Proximity Measure
  • Missing Value replacement for training and test sets
  • Scaling
  • Modelling for quantitative outputs, etc and more

But, one thing is undisputed, Random forest is among the most powerful algorithms that are out there for classification, and there are off the shelf versions that can be used for many typical problems.

As a last note, do check out the photo-gallery of Prof. Breiman to get a more idea about his life and his work. I could not help but feel motivated after going through his work.

Found a few amazing blogs for R and Data Science enthusiasts

Today I had a bit of free time and since I had not opened up my R console in a really long long time, I decided to try a few of the scripts that could do something interesting. Going through my R-bloggers mails, I found quite a few interesting posts. I thought of putting these here so that I don’t lose them. Hope you enjoy it too. 🙂