Stochastic Gradient Descent – for beginners

Warning: This article contains only one mathematical equation which can be understood even if you have only passed high school. No other mathematical formulas are present. Reader discretion is advised.

If you have ever taken any Machine Learning course or even tried to read a bit about regression, it is inevitable that you will come across a term called Gradient Descent. The name has all the logic behind the algorithm, descend down a slope. Gradient Descent is a way to minimize any function by determining the slope of the function and then taking a small step in the opposite direction of the slope or going a step downhill. As we go through multiple iterations, we reach a valley.

The equation for the algorithm is:

θ = θ – η. ∇J(θ)                                                                              equation (1)

The ∇J(θ) finds the partial derivative or slope of the function J(θ) and then we multiply it with a learning rate parameter, η that determines how big a step we are going to take. We then adjust our parameter θ in the opposite direction of this.


The image above should make it clearer.

Now this gradient calculation and update is a resource intensive step. By some estimates, if an objective function takes n steps to compute, its gradient takes 3steps. We also have lots of data and our gradient descent has to go over it lots of time. This step has to be repeated for all the θs and all the rows of the data-set. All this requires a huge amount of computing power.

But we can cheat. Instead of computing the exact objective or loss function, we will compute an estimate of it,  a very bad estimate. We will compute the loss for some random sample of the training data, and then compute the gradient only for that sample and pretend that the derivative is the right direction to go.

So now, each step is a very small step, but the price we pay is a higher number of steps instead of one larger step to reach the minima.

However, computationally, we win by a huge margin overall. This technique of using sampling for gradient update is called Stochastic Gradient Descent. It scales well with both the data and the model size which is great since we want both big data and big model.

SGD is however a pretty bad optimizer and comes with a lot of issues in practice. I would suggest Sebastian Ruder’s blog  for more detailed explanations, variations and implementations.

Some tips to help Stochastic Gradient Descent: normalize inputs to zero mean and equal variances; use random weights with zero mean and equal variances as starting points.


Wordclouds with R! – as simple as it can get

Recently I started with a wonderful course titled “MITx-15.071X – The Analytics Edge” on edX. In my experience it is the best course for getting a quick hands on experience with the real world data science applications. If you have already done the course on Machine Learning by Stanford on Coursera, then I would say that its a great follow up course to learn and apply the algorithms on R by doing this course.

Now coming to the main point at hand – Wordcloud. Visualizations are a great way to present information in layman’s term to people who might not be too scientifically or mathematically oriented. Imagine you have to find the most important words in a text and present them. You could create a table of it, but it would be too dull and might not be too appealing to everyone. Wordclouds are a great way to overcome this issue. R provides an extremely simple way to create wordclouds with just 10 lines of code. So lets dive into it.

Step 1: Save your text in a simple notepad text file. For this post I will use an excerpt from the Military-Industrial Complex Speech by Dwight D. Eisenhower, in 1961, which can be found here:

Save the text in a simple .txt file and add an empty line at the end. The reason for this will become clear in the next step.

Step 2: Open the file in R using the command

speech = readLines(“Eisenhower.txt”)

If you had not added an empty line there would be a warning message saying that

incomplete final line found on 'Eisnehower.txt'

This is because readLines() requires an empty line at the end of the file to detect the end.

Step 3: Now we need to download and install 3 packages in R.




Then load these packages using:

library(tm) … and so on

Step 4: This is one of the most important steps in the process. We will use the text-mining package that we just loaded and use it to modify and clean out our text.

First we convert our text to a specific class of R which provides infrastructure for natural language text called Corpus.

eisen = Corpus(VectorSource(speech))

Then we remove all the whitespaces from the text.

eisen = tm_map(eisen, stripWhitespace)

Next we convert all the letters to their lowercase and remove all punctuations.

eisen = tm_map(eisen, tolower)

eisen = tm_map(eisen, removePunctuation)

A speech will contain many typical english words like “I”, “me”, “my”, “and”, “to”, etc. We don’t want these to clutter our cloud and so we must remove them. Fortunately for us R has a list of some typical english words that can be accessed using stopwords(“english”). We will use this directly.

eisen = tm_map(eisen, removeWords, stopwords(“english”))

Looking at the speech I decided to remove three more words using

eisen = tm_map(eisen, removeWords, c(“must”,”will”,”also”))

Next we convert our eisen variable into a plain test format which is necessary in the newer versions of the tm package.

eisen = tm_map(eisen, PlainTextDocument)

Now we will convert this to a nice table like format which will help us get all the words and their frequencies.

dtmEisen = DocumentTermMatrix(eisen)

eisenFinal =

You can see the count of various words in the table by using the colnames() and colSums() functions.

table(colnames(eisenFinal), colSums(eisenFinal))

Here the words are given in rows and their counts in the columns.

Now lets us plot this using a simple wordcloud.

wordcloud(colnames(eisenFinal), colSums(eisenFinal))

You will get a very basic wordcloud as such:


We can use the other parameters of the wordcloud function by looking at the doucumentation.


Lets use them

wordcloud(colnames(eisenFinal), colSums(eisenFinal),scale=c(4,.5),min.freq=1,max.words=Inf, random.order=FALSE, random.color=FALSE, rot.per=.5, colors=brewer.pal(12, "Paired"), ordered.colors=FALSE, fixed.asp=TRUE)

To find out what each of these parameters do, please refer to its documentation. Its extremely simple.

Our new plot looks something like this:


You can also type


to view the different color combinations to give to “colors” parameter and experiment with various combinations.

Well there you go. You can now create and publish exciting wordclouds within seconds using R.

Have fun!!!