How to One Hot Encode Variant Length Features

How to one hot encode variant length features?

You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.

Code for your example:

features = [
            ['f1', 'f2', 'f3'],
            ['f2', 'f4', 'f5', 'f6'],
            ['f1', 'f2']
           ]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)

Output:

array([[1, 1, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 0]])

This can also be used in a pipeline, along with other feature_selection utilities.

Multi-Feature One-Hot-Encoder with varying amount of feature instances

I solved it for now by transforming it into a CountVectorizer Problem, thanks to David Maspis answer on the datascience stackexchange.

How to elegantly one hot encode a series of lists in pandas

MultiLabelBinarizer from the sklearn library is more efficient for these problems. It should be preferred to apply with pd.Series. Here's a demo:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(test),
                   columns=mlb.classes_,
                   index=test.index)

Result

   a  b  c  d  e
0  1  1  0  0  1
1  1  0  1  0  0
2  0  0  0  1  0
3  0  0  0  1  0
4  0  0  0  0  1

How to handle unseen categorical variables with one hot encoding in sklearn

When you're first fitting your encoder on the training set, save the categories OneHotEncoder produces.

oh = OneHotEncoder()
encoded = oh.fit_transform(categorical_attribute)
attribute_cats = oh.categories_

Then you can use those categories when transforming the test samples.

oh = OneHotEncoder(categories=attribute_cats)
test_encoded = oh.fit_transform(test.iloc[:3])

Categories, unseen in the testset, will have zeros in oh.categories_[0][i] columns.

How to One Hot Encode Variant Length Features