How to prepare data into a LibSVM format from DataFrame?
The issue you are facing can be divided into the following :
- Converting your ratings (I believe) into
LabeledPoint
data X. - Saving X in libsvm format.
1. Converting your ratings into LabeledPoint
data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix
where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry]
to a CoordinateMatrix
and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter
. Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame
we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter
directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg
to ml.linalg
types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils.
In our case we need to do the following (for both the dummy data and the DataFrame
from step 1.
)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib
package. You can simply save LabeledPoint
data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
LIBSVM Data Preparation: Excel data to LIBSVM format
The LIBSVM data format is given by:
<label> <index1>:<value1> <index2>:<value2> ...
...
...
As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index
, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>
, only the indices 5
and 8
and of course label
will have a custom value, all other values are set to 0
. This is just for notational simplicity or to save space, since datasets can be huge.
For the meanig of the tags, I cite the ReadMe file:
<label> is the target value of the training data. For classification,
it should be an integer which identifies a class (multi-class
classification is supported). For regression, it's any real
number. For one-class SVM, it's not used so can be any number.
is an integer starting from 1, <value> is a real number. The indices
must be in an ascending order.
As you can see, the label
is the data you want to predict. The index
marks a feature of your data and its value
. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.
Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature
with index 1
as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0
to 6
) also affects his load, so he selects it as a second feature with index 2
. A matrix row for LIBSVM now has the following format:
<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>
Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):
0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...
Notice, that we could leave out 2:0
, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23
°C and today is Tuesday, which is 1
, so tomorrow is 2
. So, this is the line or vector to use with the model:
0 1:23 2:2
Here, you can set the <label>
value arbitrarily. It will be overwritten with the predicted value. I hope this helps.
r: Reading libsvm files with library (e1071)
OK, the short of it:
m= read.matrix.csr(filename)$x
because read.matrix.csr is a list with two elements; the matrix and a vector.
In other words, the target/label/class is separated out from the features matrix.
NOTE for fellow r neophytes: In Cran documents, it seems that the "Value" subheading refers to the return values of the function
Value
If the data file includes no y variable, read.matrix.csr returns
an object of class matrix.csr,else a list with components:
x object of class matrix.csr
y vector of numeric values or factor levels, depending on fac
How to understand the format type of libsvm of Spark MLlib?
The LibSVM format is quite simple. The first row contains the class label, in this case 0 or 1. Following that are the features, here there are two values for each one; the first one is the feature index (i.e. which feature it is) and the second one is the actual value.
The feature indices starts from 1 (there is no index 0) and are in ascending order. The indices not present on a row are 0.
In summary, each row looks like this;
<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>
This format is advantageous to use when the data is sparse and contain lots of zeroes. All 0 values are not saved which will make the files both smaller and easier to read.
Related Topics
Typeof Returns Integer for Something That Is Clearly a Factor
Loop in R: How to Save the Outputs
Read and Rbind Multiple CSV Files
How to Plot Multiple Stacked Histograms Together in R
Divide Row Value by Aggregated Sum in R Data.Frame
Looping Through T.Tests for Data Frame Subsets in R
R Group by Date, and Summarize the Values
Adding Regression Line Per Group with Ggplot2
How Does Cut with Breaks Work in R
Using Two Scale Colour Gradients on One Ggplot
Removing One Tablegrob When Applied to a Box Plot with a Facet_Wrap
Merge Panel Data to Get Balanced Panel Data
Repeat Vector When Its Length Is Not a Multiple of Desired Total Length
R Fuzzy String Match to Return Specific Column Based on Matched String