How to Save\Load Models in Spark\Pyspark

In Spark MLlib, How to save the BisectingKMeansModel with Python to HDFS?

It may be your spark version. For bisecting k_means is recommended to have above 2.1.0.

You can find a complete example here on the class pyspark.ml.clustering.BisectingKMeans, hope it helps:

https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans%20featuresCol=%22features%22,%20predictionCol=%22prediction%22

The last part of the example code include a model save/load:

model_path = temp_path + "/bkm_model"
model.save(model_path)
model2 = BisectingKMeansModel.load(model_path)

It works for hdfs as well, but make sure that temp_path/bkm_model folder does not exist before saving the model or it will give you an error:

(java.io.IOException: Path <temp_path>/bkm_model already exists)

Load model pyspark

You almost have it ... Here is a snippet of how you can load your trained model back into a dataframe to make predictions on new data.

print(spark.version)
2.4.3

# fit model
cvModel = cv_grid.fit(train_df)

# save best model to specified path
mPath = "/path/to/model/folder"
cvModel.bestModel.write().overwrite().save(mPath)

# read pickled model via pipeline api
from pyspark.ml.pipeline import PipelineModel
persistedModel = PipelineModel.load(mPath)

# predict
predictionsDF = persistedModel.transform(test_df)

How to save and load MLLib model in Apache Spark?

You can save your model by using the save method of mllib models.

# let lrm be a LogisticRegression Model
lrm.save(sc, "lrm_model.model")

After storing it you can load it in another application.

sameModel = LogisticRegressionModel.load(sc, "lrm_model.model")

As @zero323 stated before, there is another way to achieve this, and is by using the Predictive Model Markup Language (PMML).

is an XML-based file format developed by the Data Mining Group to provide a way for applications to describe and exchange models produced by data mining and machine learning algorithms.

Save and load two ML models in pyspark

I figured out a way to do it just by placing them together in a folder. Then the user only needs to provide and know the path to this folder.

import sys
import os
from pyspark.ml.classification import RandomForestClassifier

trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)

path = 'model_rfc'
os.mkdir(path)
model_1.save(os.path.join(sys.argv[1], 'model_1'))
model_2.save(os.path.join(sys.argv[1], 'model_2'))

The names model_1 and model_2 are hardcoded and not needed to be known by the user.

import sys
import os
from pyspark.ml.classification import RandomForestClassificationModel

model_1 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_1'))
model_2 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_2'))

This should solve the problem. Is this the best way to do it or could there be an even better way to bundle the models together using functionality from the Spark library?

How to save Spark model as a file

Nothing is wrong with your code. It is correct that models are saved as a directory, specifically there is a modeland metadata directory. This makes sense as Spark is a distributed system. It's like when you save data back to hdfs or s3 which happens in parallel, this is also done with the model.



Related Topics



Leave a reply



Submit