Read/Write Data in Libsvm Format

How to prepare data into a LibSVM format from DataFrame?

The issue you are facing can be divided into the following :

  • Converting your ratings (I believe) into LabeledPoint data X.
  • Saving X in libsvm format.

1. Converting your ratings into LabeledPoint data X

Let's consider the following raw ratings :

val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")

You can handle those raw ratings as a coordinate list matrix (COO).

Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD

val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}

Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows

2. Saving LabeledPoint data in libsvm format

Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")

Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.

Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)

import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)

Now let's save the DataFrame :

convertedVecDF.write.format("libsvm").save("data/foo")

And we can check the files contents :

$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0

EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")

LIBSVM Data Preparation: Excel data to LIBSVM format

The LIBSVM data format is given by:

<label> <index1>:<value1> <index2>:<value2> ...
...
...

As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.

For the meanig of the tags, I cite the ReadMe file:

<label> is the target value of the training data. For classification,
it should be an integer which identifies a class (multi-class
classification is supported). For regression, it's any real
number. For one-class SVM, it's not used so can be any number.
is an integer starting from 1, <value> is a real number. The indices
must be in an ascending order.

As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.

Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:

<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>

Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):

0.72 1:25 2:0

0.65 1:21 2:1

0.68 2:29 2:2

...

Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:

0 1:23 2:2

Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.

r: Reading libsvm files with library (e1071)

OK, the short of it:

m= read.matrix.csr(filename)$x

because read.matrix.csr is a list with two elements; the matrix and a vector.
In other words, the target/label/class is separated out from the features matrix.

NOTE for fellow r neophytes: In Cran documents, it seems that the "Value" subheading refers to the return values of the function

Value

If the data file includes no y variable, read.matrix.csr returns
an object of class matrix.csr,

else a list with components:

x object of class matrix.csr

y vector of numeric values or factor levels, depending on fac

How to understand the format type of libsvm of Spark MLlib?

The LibSVM format is quite simple. The first row contains the class label, in this case 0 or 1. Following that are the features, here there are two values for each one; the first one is the feature index (i.e. which feature it is) and the second one is the actual value.

The feature indices starts from 1 (there is no index 0) and are in ascending order. The indices not present on a row are 0.

In summary, each row looks like this;

<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>

This format is advantageous to use when the data is sparse and contain lots of zeroes. All 0 values are not saved which will make the files both smaller and easier to read.



Related Topics



Leave a reply



Submit