Save classifier to disk in scikit-learn
Classifiers are just objects that can be pickled and dumped like any other. To continue your example:
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
cPickle.dump(gnb, fid)
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)
Edit: if you are using a sklearn Pipeline in which you have custom transformers that cannot be serialized by pickle (nor by joblib), then using Neuraxle's custom ML Pipeline saving is a solution where you can define your own custom step savers on a per-step basis. The savers are called for each step if defined upon saving, and otherwise joblib is used as default for steps without a saver.
How to save classifier in sklearn with Countvectorizer() and TfidfTransformer()
With MaximeKan's suggestion, I researched a way to save all 3.
saving the model and the vectorizers
import pickle
with open(filename, 'wb') as fout:
pickle.dump((movieVzer, movieTfmer, clf), fout)
loading the model and the vectorizers for use
import pickle
with open('finalized_model.pkl', 'rb') as f:
movieVzer, movieTfmer, clf = pickle.load(f)
Save scikit-learn model without datasets
The persistent representation of Scikit-Learn estimators DOES NOT include any training data.
Speaking about decision trees and their ensembles (such as random forests), then the size of the estimator object scales quadratically to the depth of decision trees (ie. the max_depth
parameter). This is so, because decision tree configuration is represented using (max_depth, max_depth)
matrices (float64
data type).
You can make your random forest objects smaller by limiting the max_depth
parameter. If you're worried about potential loss of predictive performance, you may increase the number of child estimators.
Longer term, you may wish to explore alternative representations for Scikit-Learn models. For example, converting them to PMML data format using the SkLearn2PMML package.
Export sklearn classifier to reference it in other scripts
This is the code snipet will work for you:
import pickle
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(clf, open(filename, 'wb'))
# some time later...
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)
from this source.
your question has duplicate.
How to use pickle to save sklearn model
Save:
import pickle
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
Load:
with open("model.pkl", "rb") as f:
model = pickle.load(f)
In the specific case of scikit-learn, it may be better to use joblib’s
replacement of pickle (dump & load), which is more efficient on
objects that carry large numpy arrays internally as is often the case
for fitted scikit-learn estimators:
Save:
import joblib
joblib.dump(model, "model.joblib")
Load:
model = joblib.load("model.joblib")
Related Topics
Splitting a Pandas Dataframe Column by Delimiter
Index of Duplicates Items in a Python List
Is There a Matplotlib Equivalent of Matlab's Datacursormode
Check If a File Is Not Open Nor Being Used by Another Process
Why Doesn't a Python Dict.Update() Return the Object
Why Does Sys.Exit() Not Exit When Called Inside a Thread in Python
How Include Static Files to Setuptools - Python Package
Which Is the Easiest Way to Simulate Keyboard and Mouse on Python
Sorting a List of Dot-Separated Numbers, Like Software Versions
How to Avoid Explicit 'Self' in Python
Naturally Sorting Pandas Dataframe
How to Create a "View" on a Python List
How to Get Rid of Double Backslash in Python Windows File Path String