How to One-Hot-Encode from a Pandas Column Containing a List

How to one-hot-encode from a pandas column containing a list?

We can also use sklearn.preprocessing.MultiLabelBinarizer:

Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.

Sparse solution (for Pandas v0.25.0+)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
pd.DataFrame.sparse.from_spmatrix(
mlb.fit_transform(df.pop('Col3')),
index=df.index,
columns=mlb.classes_))

result:

In [38]: df
Out[38]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0

In [39]: df.dtypes
Out[39]:
Col1 object
Col2 float64
Apple Sparse[int32, 0]
Banana Sparse[int32, 0]
Grape Sparse[int32, 0]
Orange Sparse[int32, 0]
dtype: object

In [40]: df.memory_usage()
Out[40]:
Index 128
Col1 24
Col2 24
Apple 16 # <--- NOTE!
Banana 16 # <--- NOTE!
Grape 8 # <--- NOTE!
Orange 8 # <--- NOTE!
dtype: int64


Dense solution

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
columns=mlb.classes_,
index=df.index))

Result:

In [77]: df
Out[77]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0


How do I perform One Hot Encoding on lists in a pandas column?

Another way is to use the apply and the Series constructor:

In [11]: pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)
Out[11]:
Good Other Bad Terrible
0 True True True False
1 False False True True

where

In [12]: df.messageLabels.apply(lambda x: pd.Series(1, x))
Out[12]:
Good Other Bad Terrible
0 1.0 1.0 1.0 NaN
1 NaN NaN 1.0 1.0

To get your desired output:

In [21]: res = pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)

In [22]: df[res.columns] = res

In [23]: df
Out[23]:
messageLabels Good Other Bad Terrible
0 [Good, Other, Bad] True True True False
1 [Bad, Terrible] False False True True

pandas one-hot-encoding column containing a list of feature and each feature can be negative

In pandas this can be done as follows:

df1 = df.explode('features')
df1['f1'] = abs(df1.features)
df1['f2'] = np.sign(df1.features)
df1.pivot(['user', 'item'], 'f1', 'f2').fillna(0).reset_index()

f2 user item 1 2 137
0 a 1 0 1 0
1 a 2 -1 -1 0
2 b 1 -1 1 -1
3 b 3 1 1 -1

How to elegantly one hot encode a series of lists in pandas

MultiLabelBinarizer from the sklearn library is more efficient for these problems. It should be preferred to apply with pd.Series. Here's a demo:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(test),
columns=mlb.classes_,
index=test.index)

Result

   a  b  c  d  e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1

How to one-hot-encode from a pandas column composed of a list of space containing strings?

Here's one way -

import ast
dfC3 = [ast.literal_eval(i) for i in df.Col3]
ids,U = pd.factorize(np.concatenate(dfC3))
df_out = pd.DataFrame([np.isin(U,i) for i in dfC3], columns=U).astype(int)

Sample o/p -

In [50]: df_out
Out[50]:
Chocolate cake Peanuts Salmon White wine
0 1 1 0 1
1 0 0 0 0
2 1 0 1 0

If you need it to be concatenated with input df, use pd.concat([df,df_out],axis=1).


More performant with array-assignment

We can use array-assignment to hopefully get more performance, if needed for large datasets (re-using ids,U from earlier metrhod) -

lens = list(map(len,dfC3))
mask = np.zeros((len(lens),len(U)), dtype=int)
mask[np.repeat(range(len(lens)),lens), ids] = 1
df_out = pd.DataFrame(mask, columns=U)

One-hot encoding in Python for array values in a DataFrame

IIUC, and if target contains lists, you could do:

(df.drop('trace',1)
.join(df['trace']
.apply('|'.join)
.str.get_dummies()
)
)

or for in place modification of df:

df = (df.join(df.pop('trace')
.apply('|'.join)
.str.get_dummies())
)

Or using explode and pivot_table:

(df.explode('trace')
.assign(x=1)
.pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
.fillna(0, downcast='infer')
.reset_index()
)

Output:

   ID  length  A  B  C  D  E
0 3 4 1 1 1 0 0
1 4 5 1 1 1 1 0
2 5 6 1 1 1 1 1
3 24 4 1 1 1 0 0
4 25 5 1 1 1 1 0

One-hot encoding for list variable with customized delimiter and new column names

You can do:

s = [df[col].str.get_dummies().add_prefix(f'{col.lower()}_') 
for col in ['Platforms', 'Technology']]

pd.concat([df[['Rank']]] + s, axis=1)


Related Topics



Leave a reply



Submit