Wordclouds with R! – as simple as it can get

Recently I started with a wonderful course titled “MITx-15.071X – The Analytics Edge” on edX. In my experience it is the best course for getting a quick hands on experience with the real world data science applications. If you have already done the course on Machine Learning by Stanford on Coursera, then I would say that its a great follow up course to learn and apply the algorithms on R by doing this course.

Now coming to the main point at hand – Wordcloud. Visualizations are a great way to present information in layman’s term to people who might not be too scientifically or mathematically oriented. Imagine you have to find the most important words in a text and present them. You could create a table of it, but it would be too dull and might not be too appealing to everyone. Wordclouds are a great way to overcome this issue. R provides an extremely simple way to create wordclouds with just 10 lines of code. So lets dive into it.

Step 1: Save your text in a simple notepad text file. For this post I will use an excerpt from the Military-Industrial Complex Speech by Dwight D. Eisenhower, in 1961, which can be found here: http://coursesa.matrix.msu.edu/~hst306/documents/indust.html

Save the text in a simple .txt file and add an empty line at the end. The reason for this will become clear in the next step.

Step 2: Open the file in R using the command

speech = readLines(“Eisenhower.txt”)

If you had not added an empty line there would be a warning message saying that

incomplete final line found on 'Eisnehower.txt'

This is because readLines() requires an empty line at the end of the file to detect the end.

Step 3: Now we need to download and install 3 packages in R.

install.packages(“tm)

install.packages(“RColorBrewer”)

install.packages(“wordcloud”)

Then load these packages using:

library(tm) … and so on

Step 4: This is one of the most important steps in the process. We will use the text-mining package that we just loaded and use it to modify and clean out our text.

First we convert our text to a specific class of R which provides infrastructure for natural language text called Corpus.

eisen = Corpus(VectorSource(speech))

Then we remove all the whitespaces from the text.

eisen = tm_map(eisen, stripWhitespace)

Next we convert all the letters to their lowercase and remove all punctuations.

eisen = tm_map(eisen, tolower)

eisen = tm_map(eisen, removePunctuation)

A speech will contain many typical english words like “I”, “me”, “my”, “and”, “to”, etc. We don’t want these to clutter our cloud and so we must remove them. Fortunately for us R has a list of some typical english words that can be accessed using stopwords(“english”). We will use this directly.

eisen = tm_map(eisen, removeWords, stopwords(“english”))

Looking at the speech I decided to remove three more words using

eisen = tm_map(eisen, removeWords, c(“must”,”will”,”also”))

Next we convert our eisen variable into a plain test format which is necessary in the newer versions of the tm package.

eisen = tm_map(eisen, PlainTextDocument)

Now we will convert this to a nice table like format which will help us get all the words and their frequencies.

dtmEisen = DocumentTermMatrix(eisen)

eisenFinal = as.data.frame(as.matrix(dtmEisen))

You can see the count of various words in the table by using the colnames() and colSums() functions.

table(colnames(eisenFinal), colSums(eisenFinal))

Here the words are given in rows and their counts in the columns.

Now lets us plot this using a simple wordcloud.

wordcloud(colnames(eisenFinal), colSums(eisenFinal))

You will get a very basic wordcloud as such:

wordcloud_basic

We can use the other parameters of the wordcloud function by looking at the doucumentation.

?wordcloud

Lets use them

wordcloud(colnames(eisenFinal), colSums(eisenFinal),scale=c(4,.5),min.freq=1,max.words=Inf, random.order=FALSE, random.color=FALSE, rot.per=.5, colors=brewer.pal(12, "Paired"), ordered.colors=FALSE, fixed.asp=TRUE)

To find out what each of these parameters do, please refer to its documentation. Its extremely simple.

Our new plot looks something like this:

wordcloud2

You can also type

display.brewer.all()

to view the different color combinations to give to “colors” parameter and experiment with various combinations.

Well there you go. You can now create and publish exciting wordclouds within seconds using R.

Have fun!!!

Advertisements

Add Horizontal Scroll Bar for IDLE

Since the last week, I have been spending a lot of time scripting in Python, and one of the most difficult things that I found was going through the long lines of code that would extend out of my screen width. I realized that the absence of a horizontal bar was a big problem. Luckily I found a solution online for adding the Horizontal Scroll bar in IDLE by modifying the EditorWindow.py file located in the “….\Python34\Lib\idlelib” directory (check the directory where Python was installed).

To make the changes in IDLE, open EditorWindow.py and perform a search for ‘vbar’ which is in the EditorWindow class, __init__ method.
Add those lines that have ### appended to them and then restart the IDLE.

self.vbar = vbar = Scrollbar(top, name=’vbar’)
self.hbar = hbar = Scrollbar(top, orient=HORIZONTAL, name=’hbar’)   ###

vbar[‘command’] = text.yview
vbar.pack(side=RIGHT, fill=Y)
hbar[‘command’] = text.xview ###
hbar.pack(side=BOTTOM, fill=X) ###

text[‘yscrollcommand’] = vbar.set
text[‘xscrollcommand’] = hbar.set ###

VOILA!!!

Creating and Reading XML files with Python

A few days ago my mentor gave me the task of reading an XML file and then creating a new XML file in a different format by using the data of the read XML file. Since I was given the full freedom of using any programming language, I decided to go with C as taking CS50x has now given me enough confidence to approach any programming dragon with my shining C sword and armor. However after a few hours, I realized that I was getting nowhere. So I decided to look into Python. It turns out that working with XML files is an extremely simple task using Python.

For my task I made use of the ‘xml.etree’ and ‘xml.minidom’.

I will now give you a step by step guide on how to create a beautiful (pretty-print) XML file as shown below using python and also some tips on how to read an XML file.

xml

Step 1: Importing the libraries.

import xml.dom.minidom
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement

Step 2: Creating the root element.

# <membership/>
membership = Element( 'membership' )

Step 3: Creating child for the root. I will create two children for the root in this example.

# <membership><users/>
users = SubElement( membership, 'users' )
# <membership><groups/>
groups = SubElement( membership, 'groups' )

Step 4: Creating nodes inside the children.

# <membership><users><user/>
SubElement( users, 'user', name='john' )
SubElement( users, 'user', name='charles' )
SubElement( users, 'user', name='peter' )
# <membership><groups><group><user/>
SubElement( group, 'user', name='john' )
SubElement( group, 'user', name='charles' )
# <membership><groups><group/>
group = SubElement( groups, 'group', name='administrators' )
# <membership><groups><group><user/>
SubElement( group, 'user', name='peter' )

Step 5: Converting to string and then pretty-print.

xmls = xml.dom.minidom.parseString(string)
pretty_xml_as_string = xmls.toprettyxml()

Step 6: Writing to a file.

output_file = open('membership.xml', 'w')
output_file.write('')
output_file.write(pretty_xml_as_string)
output_file.close()

This will create a nice XML file for us. If you want to add a text node also just use the following command:
group.text = "This is John"
after any of the element or sub-element which is assigned to users, and it will ad a text-node to the file like below.

text_node

While reading an XML file we must proceed in an hierarchical manner.

from xml.etree import ElementTree
document = ElementTree.parse( 'membership.xml' )

document will have an object that is not exactly a node in the XML structure, but it provides a handful of functions to consume the contents of the element hierarchy parsed from the file. Which way you choose is largely a matter of taste and probably influenced by the task at hand. The following are examples:

users = document.find( 'users')

is equivalent to:

membership = document.getroot()
users = membership.find( 'users' )


Finding specific elements

XML is a hierarchical structure. Depending on what you do, you may want to enforce certain hierarchy of elements when consuming the contents of the file. For example, we know that the membership.xml file expects users to be defined like membership -> users -> user. You can quickly get all the user nodes by doing this:

for user in document.findall( 'users/user' ):
print user.attrib[ 'name' ]

Likewise, you can quickly get all the groups by doing this:

for group in document.findall( 'groups/group' ):
print group.attrib[ 'name' ]

Iterating elements

Even after finding specific elements or entry points in the hierarchy, you will normally need to iterate the children of a given node. This can be done like this:

for group in document.findall( 'groups/group' ):
print 'Group:', group.attrib[ 'name' ]
print 'Users:'
for node in group.getchildren():
if node.tag == 'user':
print '-', node.attrib[ 'name' ]

Other times, you may need to visit every single element in the hierarchy from any given starting point. There are two ways of doing it, one includes the starting element in the iteration, the other only its children. Subtle, but important difference, i.e.:

Iterate nodes including starting point:

users = document.find( 'users' )
for node in users.getiterator():
print node.tag, node.attrib, node.text, node.tail

Produces this output:

users {} None None
user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None

Iterate only the children:

users = document.find( 'users' )
for node in users.getchildren():
print node.tag, node.attrib, node.text, node.tail

Produces this output:

user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None

 

I am now a forensic expert !

Okay, so the title is a bit of an exaggeration, but still the task that I managed to do today was something that I had never believed in my wildest dreams that I would be able to accomplish. I just learnt how to recover deleted files using just about hundred lines of C code. Yes! you heard (or rather read) it right. All this was made possible due to the amazing CS50x MOOC that I am taking on edX, an online learning platform.

CS50x is the edX version of the popular CS50 course of the Harvard University, taught by Prof. David J. Malan who is an exceptionally great teacher and one of the best instructor that I have ever had the pleasure of learning from. So as part of the 5th Problem Set (more on first five later), we (who we? the people who are taking the course, duh!) were required to retrieve about 50 jpeg images, which had supposedly been deleted “by mistake”. To solve this problem, some basic knowledge of the JPEG file format, FAT file system and how the images are stored in a CompactFlash (CF) card are stored is required.

So we all know (well at least most of us) that an image is nothing but a large number of pixel values that are stored in some specific order with a specific format. It turns out that the JPEGs too have patterns of bytes that distinguish them from other file formats. In fact, most JPEGs begin with one of two sequences of bytes. Specifically, the first four bytes of most JPEGs are either

0xff 0xd8 0xff 0xe0

or

0xff 0xd8 0xff 0xe1

from first byte to fourth byte, left to right. Odds are, if you find one of these patterns of bytes on a disk known to store photos then that means that they de-mark the start of a JPEG.

Also it turns out that when you delete a file in a FAT file system, it is not actually erased from memory. What actually happens is that  the system modifies the filename’s first character in the file’s directory entry to signal that the file has been deleted and that the directory entry can be recycled. Second, the system moves all of the file’s FAT clusters to the hard drive’s list of free clusters. Thus, as you might guess, we can easily recover any deleted files by scavenging through the file-system for some specific byte pattern.

You can read more about it from here : http://d2o9nyf4hwsci4.cloudfront.net/2013/fall/psets/5/garfinkel.pdf

Furthermore, digital cameras tend to store photographs contiguously on CF cards, whereby each photo is stored immediately after the previously taken photo. Accordingly, the start of a JPEG usually demarks the end of another. However, digital cameras generally initialize CF cards with a FAT file system whose “block size” is 512 bytes (B). The implication is that these cameras only write to those cards in units of 512 B. A photo that’s 1 MB (i.e., 1,048,576 B) thus takes up 1048576 ÷ 512 = 2048 “blocks” on a CF card. But so does a photo that’s, say, one byte smaller (i.e., 1,048,575 B)! The wasted space on disk is called “slack space.” Forensic investigators often look at slack space for remnants of suspicious data.

So now lets come to our problem. According to the problem statement, we must read a *.raw format file, which is nothing but a “forensic image” of the card and store its contents, byte after byte. We just open the file using the  fopen(…) function and then read through it using the fread(…) function. Also we read in 512 bytes into the buffer at a time (because a block is stored as 512 bytes in FAT file-system) and then check if the starting 4 bytes represent the start of a JPEG file. If they do, then we have found an image and we write all the block’s data into a file using fwrite(…), and continue writing the next blocks into the same file until we encounter another block which starts with JPEG specific 4 byte format. Since our CF card stores images contiguously as mentioned earlier, we  need not worry about random files and garbage values coming in the middle of two blocks. When we find a another pattern that signifies an new jpeg image, we close our present file and create a new file with a new name, and start writing the data into it until we encounter another JPEG start specific byte pattern.

Continuing this way, we will reach the end of the file, at which point we must close our input *.raw file and our newly written *.jpeg file.

If you do everything right (as I did :P), you will hopefully end up recovering your deleted images. Does’nt it sound too easy? Well, it actually is. It took me less than two hours to go through the entire problem statement for the Forensics assignment and complete it. Also it was the most interesting piece if code that I had ever written (I am not a Computer Science student, and no I don’t code all day). And having done that, I am now going to insert my camera memory card into my computer, delete all the images and then hopefully recover them successfully. So have fun trying this yourself and as they say, “This is CS50”.

 

 

#3 Linear Regression with Multiple Variables

In the previous post I talked about the simple case in which we were able to fit our data-set with the help of only one variable. But life is not so simple. In most of the real-life scenarios the data-set will have more than one feature, i.e the output may be dependent on more than one variable. Taking the case of the previous example of housing prices, the price of the house may not only depend on the size of the house but also on the say, the age of the house, the number of rooms in the house, etc. In such cases using the simple linear regression (h = to + t*x) may not work.

Rather we must formulate a new equation which takes care of all the parameters. Turns out this is also quite simple. So suppose you have a data-set in which there are say 4 features. We can then model it using the equation:

h = to+ x1*t1+ x2*t2+ x3*t3+ x4*t4                                                                        equation 1

where x1 is the value of feature 1 (say age of the house), x2 is the value of the second feature (say number of rooms in the house) and so on.

So now our hypothesis (h) is an (n+1) dimensional vector if there are n features in the training set. The cost function remains the same (J = 1/2m * [sum(h(i) – y(i)]^2  over all points) with h calculated using equation 1. To optimize it we may use our loyal gradient descent algorithm with only a slight modification. Shown below are the equations taken directly from the lecture slides (by Prof. Andrew Ng of Stanford University)

mutivar_gd

What we are simply doing for this case is updating the values for all the parameters simultaneously.

Now as you might guess, this process will become computationally a little expensive if we have a large data-set with a lot of features. In such cases usually there are two ways to speed up the process: 1) Feature Scaling using Mean Normalization; 2) Varying the learning rate.

1) Feature Scaling using Mean Normalization simply means hat we replace each parameter (x1, x2, …) with (x(i) – mean(i))/range, where x(i) is the ith parameter, mean(i) is the mean of the ith feature column and range is simply the standard deviation or max-min value of that feature column.

2) Varying learning rate implies increasing the value of alpha in the above equation. As stated in my last post, if alpha is too small, the algorithm may take a lot of time to converge, while if alpha becomes too high, the Cost function may begin to diverge. So an optimum value of alpha must be chosen by trial and error.

 

Now Gradient Descent is not the only algorithm that can help us optimize our cost function (J). As Prof. Andrew mentioned, Normal Regression is another great way to optimize our cost function. In this method we neither have have to choose alpha nor iterate over multiple steps. This method gives the optimum point in one step. However it is not suitable if number of features is very large because it might be very slow. In such cases gradient descent works well even with large number of features. Since this is a slightly complex algorithm with a lot of matrix notations, I will not discuss it here for fear of my post becoming too large and complex. But if it really interests you, then please go through the Week 2 lecture videos 6 and 7. However for most of our cases, the gradient descent works just fine.

In the next post I will talk about Logistic Regression and Classification.

Stay tuned….

 

#2 – Linear Regression with One Variable

In my last post I talked about two types of machine learning algorithms – supervised and unsupervised. The linear regression model comes under the first category. The regression model is basically used to predict the real valued output. So suppose you have a data-set that has the price of houses for houses of different size and you want to predict the price for a house of some particular size not present in the data-set.

Lets assume that the data-set looks like the one below, and you wish to predict the price for a house of size 1250 sq. ft (I am using the lecture example here).

vlcsnap-2014-07-24-21h22m53s219

The most logical and simple way to do so would be to draw a straight line preferably through the middle of all the points on the graph. We can certainly do this easily if there were only a few points on the graph, perhaps four or maybe five. But for a larger number of data points, this task becomes too difficult to solve by trial and error.

But thankfully it turns out that there is an algorithm that can do this efficiently for us. It is the “Gradient Descent” algorithm. But before I go into its details, let me talk about another concept of “Cost Function” which will be used in the gradient descent discussion.

The Cost Function lets us figure out how to fit the best possible straight line to our data. To draw a straight line you basically need two parameters – intercept on the y-axis and the slope of the line. So lets denote our straight line with

h = to + t*x                                                         equation 1

where to is the y-intercept and t is the slope of the line (t stands for theta and h stands for hypothesis). Now we use each point on the x-axis and calculate the corresponding value of hypothesis ‘h’ using the above equation. These are the values that we predicted for some particular value of t0 and t. But they may not give us a line that correctly represents our data.

We solve this problem by using the concept of “Cost Function”. We denote our cost function by

J = 1/2m * [sum(h(i) – y(i)]^2  over all points            equation 2

where h(i) is the value of h due to the i th x point and y(i) is the original value of the point on the y axis corresponding to the i th x value. m is the total number of points on the graph or the number of training set examples.

This cost function basically tells us how good is our line is in representing the data-set. If the points predicted by our line are vary far away from the actual data-set values, then the cost function would be very high and we would have to vary the values of to and t so that we can get a new line. In the end we basically want to find those values of to and t that can minimize our cost function. This is where the gradient descent comes into picture.

Gradient Descent is our algorithm for minimizing the cost function J.

The way it does this is that it assumes a value of to and t and then keeps changing it until we hopefully reach a minimum value. The equation to calculate the subsequent values of t0 and t is,

gradient_des

The alpha in the above equation basically controls how big a step you take downwards. If alpha is too small, the algorithm would be very slow and take a long time to compute; if alpha is too large, the algorithm may overshoot the minimum and fail to converge, or may even diverge. Thus an optimal value for alpha must be selected by trial. Usually it becomes clear what is the optimal value after running the algorithm once.

As an example (the one used in lecture), this is the model when a random value of t0 and t are chosen,

initial

And this is the value after running through four iterations of varying t0 and t1. The various circular lines denote the points which have the same cost.

four iter

And this is the final result. The red crosses in the right figures show the path followed by the gradient descent to reach the minimum.

final_iter

In this way we have a final prediction model for our data-set and now we can hopefully predict the correct price model for any set of given house sizes.

This was Linear Regression with one variable. In my next post I will talk about Linear Regression with Multiple Variables.

Stay tuned….

 

 

Create a simple program using NI LabVIEW

The first program that my mentor told me to develop here at NI was a simple calculator that could perform basic mathematical operations. This was basically done to help the new interns get an idea of the LabVIEW environment and its various feature. Developing programs in LV is kind of different from the traditional methods of writing lines of code and compiling them. LV makes use of an interactive graphical method to develop complex software. You basically just drag and drop blocks that perform some specific functions and connect them together through wires to get the desired output.

Now there may be a lot of tutorials and VIs (programs in LV are saved with a *.vi extension and are known as Virtual Instruments) on the internet that will let you do this. But I am sure that the calculator that I am going to show you is different from them. At least I was not able to find any similar program on the internet.

So lets begin!

The first thing I did was to create a while loop with a nested event structure. All the button clicks an the inputs that user enters are handled by this event structure. I went for a simple minimalist interface with the following components on the screen:-

  1. A numeric control that the is used for taking the input from the user.
  2. A numeric indicator that displays the result.
  3. A string indicator that displays the sequence of operations.
  4. Seven buttons for performing basic mathematical operations.
  5. An Enter button to perform calculations.
  6. A Refresh button to clear the previous operations.
  7. An Exit button that will end the program.

To perform the operations I used a case structure. All the operation buttons were connected to a ‘build array’ function which was further connected to ‘Boolean array to number’ function which finally connected to the Case Selector. In this way I could control the case structure with Boolean selector variables. Below is an image which shows what I mean.

caseStruc

To implement the Refresh button I simply used a select operation which took the output of  the Refresh button as the selection criterion (s). Its implementation is shown below.

Refresh

This means that whenever I pressed the Refresh button, zeros were taken as input and the output window would show 0. Also I used a similar way for displaying the Sequence of Operations. The only difference was that since it was a String indicator, it had to select between the space character and the output from a concatenate string which basically kept on appending each successive operation into the present string. I used shift registers to keep a track of the previous values.

Also I used a shift register to insert my result back into the case structure as the second operand. The bottom-most yellow wire in the last image is actually coming from a shift register in the left. Here is the image to bring some clarity

shift

This is pretty much all there is to the operations part.

Next I decided to add nice touch to my calculator by adding a custom wallpaper. To do this you right click on the front panel scroll bar and select the Properties option. Then in the background tab you can select a custom image as your wallpaper.

 

This got over quite fast. So to explore even more features I decided to create a standalone application that could run without first starting LV in your system. Now this part requires that you have LV Run Time Engine on your system.

To do this select ‘Build Application (EXE) from VI’ option under the Tools menu. Now I could go on telling you step by step on how to do this but luckily I found a great video on how to do this on You Tube. Here is the link to it:

By the end of this video you can easily create a working standalone application.

Here is the final image of how my calculator turned out to be,

Final

 

And here is the block Diagram image,

block

I also made use of some custom buttons for my Exit button which I downloaded from ni.com (https://decibel.ni.com/content/docs/DOC-16709). You can look around on the internet for some more options and features.

You can access and download my program from https://decibel.ni.com/content/docs/DOC-38860

Stay tuned for more…