R Programming

Convert ‘csv’ format files to ‘libsvm’ data format

A few days ago I started doing some predictive analytic using Apache Spark’s MLlib. The MLlib is a machine learning library and provides support for a large number of popular machine learning algorithms in Scala, Python and Java.

However, as is the case while running many ML programs, the input data format has to be different for different cases. I wanted to do a classification of data into different categories and I decided to use the MLlib’s Multilayer Perceptron Classifier which is a classifier based on feedforward artificial neural network. You can read more about it here.

The input data format to run analysis using this algorithm required data to be in ‘libsvm’ format. The format looks something like this:

5 9:0.0127418 10:0.06200549 11:1 12:1 13:0.02847017 14:0.05982561
3 3:0.001177284 4:0.01679315 7:1 8:1 9:0.0233416 10:0.08687227 11:0.007628717 12:0.01832714 13:0.003491035 14:0.01856935
2 1:0.01250612 2:0.05098133 5:1 6:1 9:0.01482266 10:0.01268549 11:0.0142893 12:0.02920057 13:0.1376151 14:0.183461
5 5:0.001757722 6:0.01785289 7:0.002907001 8:0.01801159 9:0.01303587 10:0.07466476 11:1 12:1 13:0.02893818 14:0.0608585

The values are in the following format:

label col1:val1 col2:val2 ………. colN:valN

The label is simply the class or category of the value and the tab separated values are the non-zero values in the various columns of the data-set. So for the example data-set, we have the category label for first record as ‘5’ and the columns 9,10,11,12,13 and 14 have non-zero values in them which are given after the colon (:).

We basically want to use a compressed row storage (CRS) format which puts the subsequent non-zeros of the matrix rows in contiguous memory locations.

Now for my case, I had a comma separated values (*.csv) file in the following format:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
0, 0, 0, 0, 0, 0, 0, 0, 0.012741798, 0.06200549, 1, 1, 0.028470168, 0.05982561, 5
0, 0, 0.001177284, 0.016793154, 0, 0, 1, 1, 0.023341597, 0.086872275, 0.007628717, 0.018327144, 0.003491035, 0.018569352, 3
0.01250612, 0.050981331, 0, 0, 1, 1, 0, 0, 0.014822657, 0.012685495, 0.014289297, 0.029200574, 0.137615081, 0.183460986, 2
0, 0, 0, 0, 0.0017572, 0.017852892, 0.002907001, 0.018011585, 0.013035873, 0.074664762, 1, 1, 0.028938184, 0.060858526, 5

The first line has the column headers. The last value for each line or the 15th value is the class label. The simplest way to convert this csv file to a libsvm format is to user two R packages – e1071 and SparseM.

The following code takes the csv file as input and converts it into a txt file in libsvm format:

# download the e1071 library

# download the SparseM library

# load the libraries

# load the csv dataset into memory
train <- read.csv('inputFilePath.csv')

# convert labels into numeric format
train$X15 <- as.numeric(train$X15)

# convert from data.frame to matrix format
x <- as.matrix(train[,1:14])

# put the labels in a separate vector
y <- train[,15]

# convert to compressed sparse row format
xs <- as.matrix.csr(x)

# write the output libsvm format file 
write.matrix.csr(xs, y=y, file="out.txt")

With these 10 lines we now have text file in the desired libsvm format ready to be loaded into Spark for further number crunching.

Hope it was helpful.

Stay tuned 🙂

Found a few amazing blogs for R and Data Science enthusiasts

Today I had a bit of free time and since I had not opened up my R console in a really long long time, I decided to try a few of the scripts that could do something interesting. Going through my R-bloggers mails, I found quite a few interesting posts. I thought of putting these here so that I don’t lose them. Hope you enjoy it too. 🙂

Wordclouds with R! – as simple as it can get

Recently I started with a wonderful course titled “MITx-15.071X – The Analytics Edge” on edX. In my experience it is the best course for getting a quick hands on experience with the real world data science applications. If you have already done the course on Machine Learning by Stanford on Coursera, then I would say that its a great follow up course to learn and apply the algorithms on R by doing this course.

Now coming to the main point at hand – Wordcloud. Visualizations are a great way to present information in layman’s term to people who might not be too scientifically or mathematically oriented. Imagine you have to find the most important words in a text and present them. You could create a table of it, but it would be too dull and might not be too appealing to everyone. Wordclouds are a great way to overcome this issue. R provides an extremely simple way to create wordclouds with just 10 lines of code. So lets dive into it.

Step 1: Save your text in a simple notepad text file. For this post I will use an excerpt from the Military-Industrial Complex Speech by Dwight D. Eisenhower, in 1961, which can be found here: http://coursesa.matrix.msu.edu/~hst306/documents/indust.html

Save the text in a simple .txt file and add an empty line at the end. The reason for this will become clear in the next step.

Step 2: Open the file in R using the command

speech = readLines(“Eisenhower.txt”)

If you had not added an empty line there would be a warning message saying that

incomplete final line found on 'Eisnehower.txt'

This is because readLines() requires an empty line at the end of the file to detect the end.

Step 3: Now we need to download and install 3 packages in R.




Then load these packages using:

library(tm) … and so on

Step 4: This is one of the most important steps in the process. We will use the text-mining package that we just loaded and use it to modify and clean out our text.

First we convert our text to a specific class of R which provides infrastructure for natural language text called Corpus.

eisen = Corpus(VectorSource(speech))

Then we remove all the whitespaces from the text.

eisen = tm_map(eisen, stripWhitespace)

Next we convert all the letters to their lowercase and remove all punctuations.

eisen = tm_map(eisen, tolower)

eisen = tm_map(eisen, removePunctuation)

A speech will contain many typical english words like “I”, “me”, “my”, “and”, “to”, etc. We don’t want these to clutter our cloud and so we must remove them. Fortunately for us R has a list of some typical english words that can be accessed using stopwords(“english”). We will use this directly.

eisen = tm_map(eisen, removeWords, stopwords(“english”))

Looking at the speech I decided to remove three more words using

eisen = tm_map(eisen, removeWords, c(“must”,”will”,”also”))

Next we convert our eisen variable into a plain test format which is necessary in the newer versions of the tm package.

eisen = tm_map(eisen, PlainTextDocument)

Now we will convert this to a nice table like format which will help us get all the words and their frequencies.

dtmEisen = DocumentTermMatrix(eisen)

eisenFinal = as.data.frame(as.matrix(dtmEisen))

You can see the count of various words in the table by using the colnames() and colSums() functions.

table(colnames(eisenFinal), colSums(eisenFinal))

Here the words are given in rows and their counts in the columns.

Now lets us plot this using a simple wordcloud.

wordcloud(colnames(eisenFinal), colSums(eisenFinal))

You will get a very basic wordcloud as such:


We can use the other parameters of the wordcloud function by looking at the doucumentation.


Lets use them

wordcloud(colnames(eisenFinal), colSums(eisenFinal),scale=c(4,.5),min.freq=1,max.words=Inf, random.order=FALSE, random.color=FALSE, rot.per=.5, colors=brewer.pal(12, "Paired"), ordered.colors=FALSE, fixed.asp=TRUE)

To find out what each of these parameters do, please refer to its documentation. Its extremely simple.

Our new plot looks something like this:


You can also type


to view the different color combinations to give to “colors” parameter and experiment with various combinations.

Well there you go. You can now create and publish exciting wordclouds within seconds using R.

Have fun!!!