How to one-hot-encode from a pandas column containing a list?
We can also use sklearn.preprocessing.MultiLabelBinarizer:
Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.
Sparse solution (for Pandas v0.25.0+)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df = df.join(
pd.DataFrame.sparse.from_spmatrix(
mlb.fit_transform(df.pop('Col3')),
index=df.index,
columns=mlb.classes_))
result:
In [38]: df
Out[38]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
In [39]: df.dtypes
Out[39]:
Col1 object
Col2 float64
Apple Sparse[int32, 0]
Banana Sparse[int32, 0]
Grape Sparse[int32, 0]
Orange Sparse[int32, 0]
dtype: object
In [40]: df.memory_usage()
Out[40]:
Index 128
Col1 24
Col2 24
Apple 16 # <--- NOTE!
Banana 16 # <--- NOTE!
Grape 8 # <--- NOTE!
Orange 8 # <--- NOTE!
dtype: int64
Dense solution
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
columns=mlb.classes_,
index=df.index))
Result:
In [77]: df
Out[77]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
How do I perform One Hot Encoding on lists in a pandas column?
Another way is to use the apply and the Series constructor:
In [11]: pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)
Out[11]:
Good Other Bad Terrible
0 True True True False
1 False False True True
where
In [12]: df.messageLabels.apply(lambda x: pd.Series(1, x))
Out[12]:
Good Other Bad Terrible
0 1.0 1.0 1.0 NaN
1 NaN NaN 1.0 1.0
To get your desired output:
In [21]: res = pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)
In [22]: df[res.columns] = res
In [23]: df
Out[23]:
messageLabels Good Other Bad Terrible
0 [Good, Other, Bad] True True True False
1 [Bad, Terrible] False False True True
pandas one-hot-encoding column containing a list of feature and each feature can be negative
In pandas this can be done as follows:
df1 = df.explode('features')
df1['f1'] = abs(df1.features)
df1['f2'] = np.sign(df1.features)
df1.pivot(['user', 'item'], 'f1', 'f2').fillna(0).reset_index()
f2 user item 1 2 137
0 a 1 0 1 0
1 a 2 -1 -1 0
2 b 1 -1 1 -1
3 b 3 1 1 -1
How to elegantly one hot encode a series of lists in pandas
MultiLabelBinarizer
from the sklearn
library is more efficient for these problems. It should be preferred to apply
with pd.Series
. Here's a demo:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(test),
columns=mlb.classes_,
index=test.index)
Result
a b c d e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1
How to one-hot-encode from a pandas column composed of a list of space containing strings?
Here's one way -
import ast
dfC3 = [ast.literal_eval(i) for i in df.Col3]
ids,U = pd.factorize(np.concatenate(dfC3))
df_out = pd.DataFrame([np.isin(U,i) for i in dfC3], columns=U).astype(int)
Sample o/p -
In [50]: df_out
Out[50]:
Chocolate cake Peanuts Salmon White wine
0 1 1 0 1
1 0 0 0 0
2 1 0 1 0
If you need it to be concatenated with input df, use pd.concat([df,df_out],axis=1)
.
More performant with array-assignment
We can use array-assignment
to hopefully get more performance, if needed for large datasets (re-using ids,U
from earlier metrhod) -
lens = list(map(len,dfC3))
mask = np.zeros((len(lens),len(U)), dtype=int)
mask[np.repeat(range(len(lens)),lens), ids] = 1
df_out = pd.DataFrame(mask, columns=U)
One-hot encoding in Python for array values in a DataFrame
IIUC, and if target contains lists, you could do:
(df.drop('trace',1)
.join(df['trace']
.apply('|'.join)
.str.get_dummies()
)
)
or for in place modification of df
:
df = (df.join(df.pop('trace')
.apply('|'.join)
.str.get_dummies())
)
Or using explode
and pivot_table
:
(df.explode('trace')
.assign(x=1)
.pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
.fillna(0, downcast='infer')
.reset_index()
)
Output:
ID length A B C D E
0 3 4 1 1 1 0 0
1 4 5 1 1 1 1 0
2 5 6 1 1 1 1 1
3 24 4 1 1 1 0 0
4 25 5 1 1 1 1 0
One-hot encoding for list variable with customized delimiter and new column names
You can do:
s = [df[col].str.get_dummies().add_prefix(f'{col.lower()}_')
for col in ['Platforms', 'Technology']]
pd.concat([df[['Rank']]] + s, axis=1)
Related Topics
Passing Functions with Arguments to Another Function in Python
How to Avoid Python/Pandas Creating an Index in a Saved CSV
How to Find Length of Digits in an Integer
Executing Multi-Line Statements in the One-Line Command-Line
What's a Correct and Good Way to Implement _Hash_()
How to Get Current Available Gpus in Tensorflow
How to Use Method Overloading in Python
Reading an Excel File in Python Using Pandas
Is There a Decorator to Simply Cache Function Return Values
How to Access "Static" Class Variables Within Methods in Python
Change Chromeoptions in an Existing Webdriver
How to Install Python Modules Without Root Access
How to Get List of Methods in a Python Class
Beautifulsoup Webscraping Find_All( ): Finding Exact Match