In Spark MLlib, How to save the BisectingKMeansModel with Python to HDFS?
It may be your spark version. For bisecting k_means is recommended to have above 2.1.0.
You can find a complete example here on the class pyspark.ml.clustering.BisectingKMeans, hope it helps:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans%20featuresCol=%22features%22,%20predictionCol=%22prediction%22
The last part of the example code include a model save/load:
model_path = temp_path + "/bkm_model"
model.save(model_path)
model2 = BisectingKMeansModel.load(model_path)
It works for hdfs as well, but make sure that temp_path/bkm_model folder does not exist before saving the model or it will give you an error:
(java.io.IOException: Path <temp_path>/bkm_model already exists)
Load model pyspark
You almost have it ... Here is a snippet of how you can load your trained model back into a dataframe
to make predictions on new data.
print(spark.version)
2.4.3
# fit model
cvModel = cv_grid.fit(train_df)
# save best model to specified path
mPath = "/path/to/model/folder"
cvModel.bestModel.write().overwrite().save(mPath)
# read pickled model via pipeline api
from pyspark.ml.pipeline import PipelineModel
persistedModel = PipelineModel.load(mPath)
# predict
predictionsDF = persistedModel.transform(test_df)
How to save and load MLLib model in Apache Spark?
You can save your model by using the save method of mllib
models.
# let lrm be a LogisticRegression Model
lrm.save(sc, "lrm_model.model")
After storing it you can load it in another application.
sameModel = LogisticRegressionModel.load(sc, "lrm_model.model")
As @zero323 stated before, there is another way to achieve this, and is by using the Predictive Model Markup Language (PMML).
is an XML-based file format developed by the Data Mining Group to provide a way for applications to describe and exchange models produced by data mining and machine learning algorithms.
Save and load two ML models in pyspark
I figured out a way to do it just by placing them together in a folder. Then the user only needs to provide and know the path to this folder.
import sys
import os
from pyspark.ml.classification import RandomForestClassifier
trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)
path = 'model_rfc'
os.mkdir(path)
model_1.save(os.path.join(sys.argv[1], 'model_1'))
model_2.save(os.path.join(sys.argv[1], 'model_2'))
The names model_1
and model_2
are hardcoded and not needed to be known by the user.
import sys
import os
from pyspark.ml.classification import RandomForestClassificationModel
model_1 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_1'))
model_2 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_2'))
This should solve the problem. Is this the best way to do it or could there be an even better way to bundle the models together using functionality from the Spark library?
How to save Spark model as a file
Nothing is wrong with your code. It is correct that models are saved as a directory, specifically there is a model
and metadata
directory. This makes sense as Spark is a distributed system. It's like when you save data back to hdfs or s3 which happens in parallel, this is also done with the model.
Related Topics
Python: Opencv - Selecting Region of an Image
Using a String Variable as a Variable Name
Add Numpy Array as Column to Pandas Data Frame
How to Extract the Substring Between Two Markers
How-To Run Tensorflow on Multiple Core and Threads
How to Find the Average Colour of an Image in Python With Opencv
How to Add a Path to Pythonpath in Virtualenv
How to End Program If Input == "Quit" With Many If Statements
Splitting One CSV into Multiple Files
How to Delete All Columns in Dataframe Except Certain Ones
Filenotfounderror: [Errno 2] No Such File or Directory
How to Index a Middle Character in a List in Python
How to Execute Two Commands in Terminal Using Python'S Subprocess Module
Plot Different Dataframes in the Same Figure
Pyspark - Sum a Column in Dataframe and Return Results as Int