How to One-Hot-Encode from a Pandas Column Containing a List

How to one-hot-encode from a pandas column containing a list?

We can also use sklearn.preprocessing.MultiLabelBinarizer:

Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.

Sparse solution (for Pandas v0.25.0+)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('Col3')),
                index=df.index,
                columns=mlb.classes_))

result:

In [38]: df
Out[38]:
  Col1  Col2  Apple  Banana  Grape  Orange
0    C  33.0      1       1      0       1
1    A   2.5      1       0      1       0
2    B  42.0      0       1      0       0

In [39]: df.dtypes
Out[39]:
Col1                object
Col2               float64
Apple     Sparse[int32, 0]
Banana    Sparse[int32, 0]
Grape     Sparse[int32, 0]
Orange    Sparse[int32, 0]
dtype: object

In [40]: df.memory_usage()
Out[40]:
Index     128
Col1       24
Col2       24
Apple      16    #  <--- NOTE!
Banana     16    #  <--- NOTE!
Grape       8    #  <--- NOTE!
Orange      8    #  <--- NOTE!
dtype: int64

Dense solution

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
                          columns=mlb.classes_,
                          index=df.index))

Result:

In [77]: df
Out[77]:
  Col1  Col2  Apple  Banana  Grape  Orange
0    C  33.0      1       1      0       1
1    A   2.5      1       0      1       0
2    B  42.0      0       1      0       0

How do I perform One Hot Encoding on lists in a pandas column?

Another way is to use the apply and the Series constructor:

In [11]: pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)
Out[11]:
    Good  Other   Bad  Terrible
0   True   True  True     False
1  False  False  True      True

where

In [12]: df.messageLabels.apply(lambda x: pd.Series(1, x))
Out[12]:
   Good  Other  Bad  Terrible
0   1.0    1.0  1.0       NaN
1   NaN    NaN  1.0       1.0

To get your desired output:

In [21]: res = pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)

In [22]: df[res.columns] = res

In [23]: df
Out[23]:
        messageLabels   Good  Other   Bad  Terrible
0  [Good, Other, Bad]   True   True  True     False
1     [Bad, Terrible]  False  False  True      True

pandas one-hot-encoding column containing a list of feature and each feature can be negative

In pandas this can be done as follows:

df1 = df.explode('features')
df1['f1'] = abs(df1.features)
df1['f2'] = np.sign(df1.features)
df1.pivot(['user', 'item'], 'f1', 'f2').fillna(0).reset_index()

f2 user  item  1  2  137
0     a     1  0  1    0
1     a     2 -1 -1    0
2     b     1 -1  1   -1
3     b     3  1  1   -1

How to elegantly one hot encode a series of lists in pandas

MultiLabelBinarizer from the sklearn library is more efficient for these problems. It should be preferred to apply with pd.Series. Here's a demo:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(test),
                   columns=mlb.classes_,
                   index=test.index)

Result

   a  b  c  d  e
0  1  1  0  0  1
1  1  0  1  0  0
2  0  0  0  1  0
3  0  0  0  1  0
4  0  0  0  0  1

How to one-hot-encode from a pandas column composed of a list of space containing strings?

Here's one way -

import ast
dfC3 = [ast.literal_eval(i) for i in df.Col3]
ids,U = pd.factorize(np.concatenate(dfC3))
df_out = pd.DataFrame([np.isin(U,i) for i in dfC3], columns=U).astype(int)

Sample o/p -

In [50]: df_out
Out[50]: 
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0

If you need it to be concatenated with input df, use pd.concat([df,df_out],axis=1).

More performant with array-assignment

We can use array-assignment to hopefully get more performance, if needed for large datasets (re-using ids,U from earlier metrhod) -

lens = list(map(len,dfC3))
mask = np.zeros((len(lens),len(U)), dtype=int)
mask[np.repeat(range(len(lens)),lens), ids] = 1
df_out = pd.DataFrame(mask, columns=U)

One-hot encoding in Python for array values in a DataFrame

IIUC, and if target contains lists, you could do:

(df.drop('trace',1)
   .join(df['trace']
         .apply('|'.join)
         .str.get_dummies()
        )
 )

or for in place modification of df:

df = (df.join(df.pop('trace')
              .apply('|'.join)
              .str.get_dummies())
      )

Or using explode and pivot_table:

(df.explode('trace')
   .assign(x=1)
   .pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
   .fillna(0, downcast='infer')
   .reset_index()
 )

Output:

   ID  length  A  B  C  D  E
0   3       4  1  1  1  0  0
1   4       5  1  1  1  1  0
2   5       6  1  1  1  1  1
3  24       4  1  1  1  0  0
4  25       5  1  1  1  1  0

One-hot encoding for list variable with customized delimiter and new column names

You can do:

s = [df[col].str.get_dummies().add_prefix(f'{col.lower()}_') 
        for col in ['Platforms', 'Technology']]

pd.concat([df[['Rank']]] + s, axis=1)

How to One-Hot-Encode from a Pandas Column Containing a List