Convert ‘csv’ format files to ‘libsvm’ data format

A few days ago I started doing some predictive analytic using Apache Spark’s MLlib. The MLlib is a machine learning library and provides support for a large number of popular machine learning algorithms in Scala, Python and Java.

However, as is the case while running many ML programs, the input data format has to be different for different cases. I wanted to do a classification of data into different categories and I decided to use the MLlib’s Multilayer Perceptron Classifier which is a classifier based on feedforward artificial neural network. You can read more about it here.

The input data format to run analysis using this algorithm required data to be in ‘libsvm’ format. The format looks something like this:

5 9:0.0127418 10:0.06200549 11:1 12:1 13:0.02847017 14:0.05982561
3 3:0.001177284 4:0.01679315 7:1 8:1 9:0.0233416 10:0.08687227 11:0.007628717 12:0.01832714 13:0.003491035 14:0.01856935
2 1:0.01250612 2:0.05098133 5:1 6:1 9:0.01482266 10:0.01268549 11:0.0142893 12:0.02920057 13:0.1376151 14:0.183461
5 5:0.001757722 6:0.01785289 7:0.002907001 8:0.01801159 9:0.01303587 10:0.07466476 11:1 12:1 13:0.02893818 14:0.0608585

The values are in the following format:

label col1:val1 col2:val2 ………. colN:valN

The label is simply the class or category of the value and the tab separated values are the non-zero values in the various columns of the data-set. So for the example data-set, we have the category label for first record as ‘5’ and the columns 9,10,11,12,13 and 14 have non-zero values in them which are given after the colon (:).

We basically want to use a compressed row storage (CRS) format which puts the subsequent non-zeros of the matrix rows in contiguous memory locations.

Now for my case, I had a comma separated values (*.csv) file in the following format:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
0, 0, 0, 0, 0, 0, 0, 0, 0.012741798, 0.06200549, 1, 1, 0.028470168, 0.05982561, 5
0, 0, 0.001177284, 0.016793154, 0, 0, 1, 1, 0.023341597, 0.086872275, 0.007628717, 0.018327144, 0.003491035, 0.018569352, 3
0.01250612, 0.050981331, 0, 0, 1, 1, 0, 0, 0.014822657, 0.012685495, 0.014289297, 0.029200574, 0.137615081, 0.183460986, 2
0, 0, 0, 0, 0.0017572, 0.017852892, 0.002907001, 0.018011585, 0.013035873, 0.074664762, 1, 1, 0.028938184, 0.060858526, 5

The first line has the column headers. The last value for each line or the 15th value is the class label. The simplest way to convert this csv file to a libsvm format is to user two R packages – e1071 and SparseM.

The following code takes the csv file as input and converts it into a txt file in libsvm format:

# download the e1071 library
install.packages("e1071")

# download the SparseM library
install.packages("SparseM")

# load the libraries
library(e1071)
library(SparseM)

# load the csv dataset into memory
train <- read.csv('inputFilePath.csv')

# convert labels into numeric format
train$X15 <- as.numeric(train$X15)

# convert from data.frame to matrix format
x <- as.matrix(train[,1:14])

# put the labels in a separate vector
y <- train[,15]

# convert to compressed sparse row format
xs <- as.matrix.csr(x)

# write the output libsvm format file 
write.matrix.csr(xs, y=y, file="out.txt")

With these 10 lines we now have text file in the desired libsvm format ready to be loaded into Spark for further number crunching.

Hope it was helpful.

Stay tuned 🙂

Advertisements

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s