How to one hot encode variant length features?
You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.
Code for your example:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)
Output:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 0]])
This can also be used in a pipeline, along with other feature_selection utilities.
Multi-Feature One-Hot-Encoder with varying amount of feature instances
I solved it for now by transforming it into a CountVectorizer Problem, thanks to David Maspis answer on the datascience stackexchange.
How to elegantly one hot encode a series of lists in pandas
MultiLabelBinarizer
from the sklearn
library is more efficient for these problems. It should be preferred to apply
with pd.Series
. Here's a demo:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(test),
columns=mlb.classes_,
index=test.index)
Result
a b c d e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1
How to handle unseen categorical variables with one hot encoding in sklearn
When you're first fitting your encoder on the training set, save the categories OneHotEncoder produces.
oh = OneHotEncoder()
encoded = oh.fit_transform(categorical_attribute)
attribute_cats = oh.categories_
Then you can use those categories when transforming the test samples.
oh = OneHotEncoder(categories=attribute_cats)
test_encoded = oh.fit_transform(test.iloc[:3])
Categories, unseen in the testset, will have zeros in oh.categories_[0][i]
columns.
Related Topics
Get Column Index from Column Name in Python Pandas
How to Get Md5 Sum of a String Using Python
Why Does Pandas Apply Calculate Twice
Pythonic Way to Combine For-Loop and If-Statement
Class Variables Is Shared Across All Instances in Python
Is There a Difference Between Continue and Pass in a for Loop in Python
Python Runtimewarning: Overflow Encountered in Long Scalars
How to Extract a Url from a String Using Python
How to Use MySQLdb with Python and Django in Osx 10.6
How to Understand the Output of Dis.Dis
Safe Way to Parse User-Supplied Mathematical Formula in Python
Better Way to Shuffle Two Numpy Arrays in Unison
How to Run Python Script on Terminal