Stochastic Gradient Descent – for beginners

Warning: This article contains only one mathematical equation which can be understood even if you have only passed high school. No other mathematical formulas are present. Reader discretion is advised.

If you have ever taken any Machine Learning course or even tried to read a bit about regression, it is inevitable that you will come across a term called Gradient Descent. The name has all the logic behind the algorithm, descend down a slope. Gradient Descent is a way to minimize any function by determining the slope of the function and then taking a small step in the opposite direction of the slope or going a step downhill. As we go through multiple iterations, we reach a valley.

The equation for the algorithm is:

θ = θ – η. ∇J(θ)                                                                              equation (1)

The ∇J(θ) finds the partial derivative or slope of the function J(θ) and then we multiply it with a learning rate parameter, η that determines how big a step we are going to take. We then adjust our parameter θ in the opposite direction of this.


The image above should make it clearer.

Now this gradient calculation and update is a resource intensive step. By some estimates, if an objective function takes n steps to compute, its gradient takes 3steps. We also have lots of data and our gradient descent has to go over it lots of time. This step has to be repeated for all the θs and all the rows of the data-set. All this requires a huge amount of computing power.

But we can cheat. Instead of computing the exact objective or loss function, we will compute an estimate of it,  a very bad estimate. We will compute the loss for some random sample of the training data, and then compute the gradient only for that sample and pretend that the derivative is the right direction to go.

So now, each step is a very small step, but the price we pay is a higher number of steps instead of one larger step to reach the minima.

However, computationally, we win by a huge margin overall. This technique of using sampling for gradient update is called Stochastic Gradient Descent. It scales well with both the data and the model size which is great since we want both big data and big model.

SGD is however a pretty bad optimizer and comes with a lot of issues in practice. I would suggest Sebastian Ruder’s blog  for more detailed explanations, variations and implementations.

Some tips to help Stochastic Gradient Descent: normalize inputs to zero mean and equal variances; use random weights with zero mean and equal variances as starting points.



Convert ‘csv’ format files to ‘libsvm’ data format

A few days ago I started doing some predictive analytic using Apache Spark’s MLlib. The MLlib is a machine learning library and provides support for a large number of popular machine learning algorithms in Scala, Python and Java.

However, as is the case while running many ML programs, the input data format has to be different for different cases. I wanted to do a classification of data into different categories and I decided to use the MLlib’s Multilayer Perceptron Classifier which is a classifier based on feedforward artificial neural network. You can read more about it here.

The input data format to run analysis using this algorithm required data to be in ‘libsvm’ format. The format looks something like this:

5 9:0.0127418 10:0.06200549 11:1 12:1 13:0.02847017 14:0.05982561
3 3:0.001177284 4:0.01679315 7:1 8:1 9:0.0233416 10:0.08687227 11:0.007628717 12:0.01832714 13:0.003491035 14:0.01856935
2 1:0.01250612 2:0.05098133 5:1 6:1 9:0.01482266 10:0.01268549 11:0.0142893 12:0.02920057 13:0.1376151 14:0.183461
5 5:0.001757722 6:0.01785289 7:0.002907001 8:0.01801159 9:0.01303587 10:0.07466476 11:1 12:1 13:0.02893818 14:0.0608585

The values are in the following format:

label col1:val1 col2:val2 ………. colN:valN

The label is simply the class or category of the value and the tab separated values are the non-zero values in the various columns of the data-set. So for the example data-set, we have the category label for first record as ‘5’ and the columns 9,10,11,12,13 and 14 have non-zero values in them which are given after the colon (:).

We basically want to use a compressed row storage (CRS) format which puts the subsequent non-zeros of the matrix rows in contiguous memory locations.

Now for my case, I had a comma separated values (*.csv) file in the following format:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
0, 0, 0, 0, 0, 0, 0, 0, 0.012741798, 0.06200549, 1, 1, 0.028470168, 0.05982561, 5
0, 0, 0.001177284, 0.016793154, 0, 0, 1, 1, 0.023341597, 0.086872275, 0.007628717, 0.018327144, 0.003491035, 0.018569352, 3
0.01250612, 0.050981331, 0, 0, 1, 1, 0, 0, 0.014822657, 0.012685495, 0.014289297, 0.029200574, 0.137615081, 0.183460986, 2
0, 0, 0, 0, 0.0017572, 0.017852892, 0.002907001, 0.018011585, 0.013035873, 0.074664762, 1, 1, 0.028938184, 0.060858526, 5

The first line has the column headers. The last value for each line or the 15th value is the class label. The simplest way to convert this csv file to a libsvm format is to user two R packages – e1071 and SparseM.

The following code takes the csv file as input and converts it into a txt file in libsvm format:

# download the e1071 library

# download the SparseM library

# load the libraries

# load the csv dataset into memory
train <- read.csv('inputFilePath.csv')

# convert labels into numeric format
train$X15 <- as.numeric(train$X15)

# convert from data.frame to matrix format
x <- as.matrix(train[,1:14])

# put the labels in a separate vector
y <- train[,15]

# convert to compressed sparse row format
xs <- as.matrix.csr(x)

# write the output libsvm format file 
write.matrix.csr(xs, y=y, file="out.txt")

With these 10 lines we now have text file in the desired libsvm format ready to be loaded into Spark for further number crunching.

Hope it was helpful.

Stay tuned 🙂

Found a few amazing blogs for R and Data Science enthusiasts

Today I had a bit of free time and since I had not opened up my R console in a really long long time, I decided to try a few of the scripts that could do something interesting. Going through my R-bloggers mails, I found quite a few interesting posts. I thought of putting these here so that I don’t lose them. Hope you enjoy it too. 🙂

Wordclouds with R! – as simple as it can get

Recently I started with a wonderful course titled “MITx-15.071X – The Analytics Edge” on edX. In my experience it is the best course for getting a quick hands on experience with the real world data science applications. If you have already done the course on Machine Learning by Stanford on Coursera, then I would say that its a great follow up course to learn and apply the algorithms on R by doing this course.

Now coming to the main point at hand – Wordcloud. Visualizations are a great way to present information in layman’s term to people who might not be too scientifically or mathematically oriented. Imagine you have to find the most important words in a text and present them. You could create a table of it, but it would be too dull and might not be too appealing to everyone. Wordclouds are a great way to overcome this issue. R provides an extremely simple way to create wordclouds with just 10 lines of code. So lets dive into it.

Step 1: Save your text in a simple notepad text file. For this post I will use an excerpt from the Military-Industrial Complex Speech by Dwight D. Eisenhower, in 1961, which can be found here:

Save the text in a simple .txt file and add an empty line at the end. The reason for this will become clear in the next step.

Step 2: Open the file in R using the command

speech = readLines(“Eisenhower.txt”)

If you had not added an empty line there would be a warning message saying that

incomplete final line found on 'Eisnehower.txt'

This is because readLines() requires an empty line at the end of the file to detect the end.

Step 3: Now we need to download and install 3 packages in R.




Then load these packages using:

library(tm) … and so on

Step 4: This is one of the most important steps in the process. We will use the text-mining package that we just loaded and use it to modify and clean out our text.

First we convert our text to a specific class of R which provides infrastructure for natural language text called Corpus.

eisen = Corpus(VectorSource(speech))

Then we remove all the whitespaces from the text.

eisen = tm_map(eisen, stripWhitespace)

Next we convert all the letters to their lowercase and remove all punctuations.

eisen = tm_map(eisen, tolower)

eisen = tm_map(eisen, removePunctuation)

A speech will contain many typical english words like “I”, “me”, “my”, “and”, “to”, etc. We don’t want these to clutter our cloud and so we must remove them. Fortunately for us R has a list of some typical english words that can be accessed using stopwords(“english”). We will use this directly.

eisen = tm_map(eisen, removeWords, stopwords(“english”))

Looking at the speech I decided to remove three more words using

eisen = tm_map(eisen, removeWords, c(“must”,”will”,”also”))

Next we convert our eisen variable into a plain test format which is necessary in the newer versions of the tm package.

eisen = tm_map(eisen, PlainTextDocument)

Now we will convert this to a nice table like format which will help us get all the words and their frequencies.

dtmEisen = DocumentTermMatrix(eisen)

eisenFinal =

You can see the count of various words in the table by using the colnames() and colSums() functions.

table(colnames(eisenFinal), colSums(eisenFinal))

Here the words are given in rows and their counts in the columns.

Now lets us plot this using a simple wordcloud.

wordcloud(colnames(eisenFinal), colSums(eisenFinal))

You will get a very basic wordcloud as such:


We can use the other parameters of the wordcloud function by looking at the doucumentation.


Lets use them

wordcloud(colnames(eisenFinal), colSums(eisenFinal),scale=c(4,.5),min.freq=1,max.words=Inf, random.order=FALSE, random.color=FALSE, rot.per=.5, colors=brewer.pal(12, "Paired"), ordered.colors=FALSE, fixed.asp=TRUE)

To find out what each of these parameters do, please refer to its documentation. Its extremely simple.

Our new plot looks something like this:


You can also type


to view the different color combinations to give to “colors” parameter and experiment with various combinations.

Well there you go. You can now create and publish exciting wordclouds within seconds using R.

Have fun!!!

Add Horizontal Scroll Bar for IDLE

Since the last week, I have been spending a lot of time scripting in Python, and one of the most difficult things that I found was going through the long lines of code that would extend out of my screen width. I realized that the absence of a horizontal bar was a big problem. Luckily I found a solution online for adding the Horizontal Scroll bar in IDLE by modifying the file located in the “….\Python34\Lib\idlelib” directory (check the directory where Python was installed).

To make the changes in IDLE, open and perform a search for ‘vbar’ which is in the EditorWindow class, __init__ method.
Add those lines that have ### appended to them and then restart the IDLE.

self.vbar = vbar = Scrollbar(top, name=’vbar’)
self.hbar = hbar = Scrollbar(top, orient=HORIZONTAL, name=’hbar’)   ###

vbar[‘command’] = text.yview
vbar.pack(side=RIGHT, fill=Y)
hbar[‘command’] = text.xview ###
hbar.pack(side=BOTTOM, fill=X) ###

text[‘yscrollcommand’] = vbar.set
text[‘xscrollcommand’] = hbar.set ###


Creating and Reading XML files with Python

A few days ago my mentor gave me the task of reading an XML file and then creating a new XML file in a different format by using the data of the read XML file. Since I was given the full freedom of using any programming language, I decided to go with C as taking CS50x has now given me enough confidence to approach any programming dragon with my shining C sword and armor. However after a few hours, I realized that I was getting nowhere. So I decided to look into Python. It turns out that working with XML files is an extremely simple task using Python.

For my task I made use of the ‘xml.etree’ and ‘xml.minidom’.

I will now give you a step by step guide on how to create a beautiful (pretty-print) XML file as shown below using python and also some tips on how to read an XML file.


Step 1: Importing the libraries.

import xml.dom.minidom
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement

Step 2: Creating the root element.

# <membership/>
membership = Element( 'membership' )

Step 3: Creating child for the root. I will create two children for the root in this example.

# <membership><users/>
users = SubElement( membership, 'users' )
# <membership><groups/>
groups = SubElement( membership, 'groups' )

Step 4: Creating nodes inside the children.

# <membership><users><user/>
SubElement( users, 'user', name='john' )
SubElement( users, 'user', name='charles' )
SubElement( users, 'user', name='peter' )
# <membership><groups><group><user/>
SubElement( group, 'user', name='john' )
SubElement( group, 'user', name='charles' )
# <membership><groups><group/>
group = SubElement( groups, 'group', name='administrators' )
# <membership><groups><group><user/>
SubElement( group, 'user', name='peter' )

Step 5: Converting to string and then pretty-print.

xmls = xml.dom.minidom.parseString(string)
pretty_xml_as_string = xmls.toprettyxml()

Step 6: Writing to a file.

output_file = open('membership.xml', 'w')

This will create a nice XML file for us. If you want to add a text node also just use the following command:
group.text = "This is John"
after any of the element or sub-element which is assigned to users, and it will ad a text-node to the file like below.


While reading an XML file we must proceed in an hierarchical manner.

from xml.etree import ElementTree
document = ElementTree.parse( 'membership.xml' )

document will have an object that is not exactly a node in the XML structure, but it provides a handful of functions to consume the contents of the element hierarchy parsed from the file. Which way you choose is largely a matter of taste and probably influenced by the task at hand. The following are examples:

users = document.find( 'users')

is equivalent to:

membership = document.getroot()
users = membership.find( 'users' )

Finding specific elements

XML is a hierarchical structure. Depending on what you do, you may want to enforce certain hierarchy of elements when consuming the contents of the file. For example, we know that the membership.xml file expects users to be defined like membership -> users -> user. You can quickly get all the user nodes by doing this:

for user in document.findall( 'users/user' ):
print user.attrib[ 'name' ]

Likewise, you can quickly get all the groups by doing this:

for group in document.findall( 'groups/group' ):
print group.attrib[ 'name' ]

Iterating elements

Even after finding specific elements or entry points in the hierarchy, you will normally need to iterate the children of a given node. This can be done like this:

for group in document.findall( 'groups/group' ):
print 'Group:', group.attrib[ 'name' ]
print 'Users:'
for node in group.getchildren():
if node.tag == 'user':
print '-', node.attrib[ 'name' ]

Other times, you may need to visit every single element in the hierarchy from any given starting point. There are two ways of doing it, one includes the starting element in the iteration, the other only its children. Subtle, but important difference, i.e.:

Iterate nodes including starting point:

users = document.find( 'users' )
for node in users.getiterator():
print node.tag, node.attrib, node.text, node.tail

Produces this output:

users {} None None
user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None

Iterate only the children:

users = document.find( 'users' )
for node in users.getchildren():
print node.tag, node.attrib, node.text, node.tail

Produces this output:

user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None


I am now a forensic expert !

Okay, so the title is a bit of an exaggeration, but still the task that I managed to do today was something that I had never believed in my wildest dreams that I would be able to accomplish. I just learnt how to recover deleted files using just about hundred lines of C code. Yes! you heard (or rather read) it right. All this was made possible due to the amazing CS50x MOOC that I am taking on edX, an online learning platform.

CS50x is the edX version of the popular CS50 course of the Harvard University, taught by Prof. David J. Malan who is an exceptionally great teacher and one of the best instructor that I have ever had the pleasure of learning from. So as part of the 5th Problem Set (more on first five later), we (who we? the people who are taking the course, duh!) were required to retrieve about 50 jpeg images, which had supposedly been deleted “by mistake”. To solve this problem, some basic knowledge of the JPEG file format, FAT file system and how the images are stored in a CompactFlash (CF) card are stored is required.

So we all know (well at least most of us) that an image is nothing but a large number of pixel values that are stored in some specific order with a specific format. It turns out that the JPEGs too have patterns of bytes that distinguish them from other file formats. In fact, most JPEGs begin with one of two sequences of bytes. Specifically, the first four bytes of most JPEGs are either

0xff 0xd8 0xff 0xe0


0xff 0xd8 0xff 0xe1

from first byte to fourth byte, left to right. Odds are, if you find one of these patterns of bytes on a disk known to store photos then that means that they de-mark the start of a JPEG.

Also it turns out that when you delete a file in a FAT file system, it is not actually erased from memory. What actually happens is that  the system modifies the filename’s first character in the file’s directory entry to signal that the file has been deleted and that the directory entry can be recycled. Second, the system moves all of the file’s FAT clusters to the hard drive’s list of free clusters. Thus, as you might guess, we can easily recover any deleted files by scavenging through the file-system for some specific byte pattern.

You can read more about it from here :

Furthermore, digital cameras tend to store photographs contiguously on CF cards, whereby each photo is stored immediately after the previously taken photo. Accordingly, the start of a JPEG usually demarks the end of another. However, digital cameras generally initialize CF cards with a FAT file system whose “block size” is 512 bytes (B). The implication is that these cameras only write to those cards in units of 512 B. A photo that’s 1 MB (i.e., 1,048,576 B) thus takes up 1048576 ÷ 512 = 2048 “blocks” on a CF card. But so does a photo that’s, say, one byte smaller (i.e., 1,048,575 B)! The wasted space on disk is called “slack space.” Forensic investigators often look at slack space for remnants of suspicious data.

So now lets come to our problem. According to the problem statement, we must read a *.raw format file, which is nothing but a “forensic image” of the card and store its contents, byte after byte. We just open the file using the  fopen(…) function and then read through it using the fread(…) function. Also we read in 512 bytes into the buffer at a time (because a block is stored as 512 bytes in FAT file-system) and then check if the starting 4 bytes represent the start of a JPEG file. If they do, then we have found an image and we write all the block’s data into a file using fwrite(…), and continue writing the next blocks into the same file until we encounter another block which starts with JPEG specific 4 byte format. Since our CF card stores images contiguously as mentioned earlier, we  need not worry about random files and garbage values coming in the middle of two blocks. When we find a another pattern that signifies an new jpeg image, we close our present file and create a new file with a new name, and start writing the data into it until we encounter another JPEG start specific byte pattern.

Continuing this way, we will reach the end of the file, at which point we must close our input *.raw file and our newly written *.jpeg file.

If you do everything right (as I did :P), you will hopefully end up recovering your deleted images. Does’nt it sound too easy? Well, it actually is. It took me less than two hours to go through the entire problem statement for the Forensics assignment and complete it. Also it was the most interesting piece if code that I had ever written (I am not a Computer Science student, and no I don’t code all day). And having done that, I am now going to insert my camera memory card into my computer, delete all the images and then hopefully recover them successfully. So have fun trying this yourself and as they say, “This is CS50”.