Reverse a Get_Dummies Encoding in Pandas

Reverse a get_dummies encoding in pandas

set_index + stack, stack will dropna by default

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0, axis=1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7

Reverse get_dummies()

You can convert for dummies columns to index first by DataFrame.set_index:

#https://stackoverflow.com/a/62085741/2901002
df = undummify(df.set_index(['score1','score2'])).reset_index()

Or use alternative solution with DataFrame.melt, fiter rows with boolean indexing, splitting by Series.str.split and last pivoting by DataFrame.pivot:

df1 = df.melt(['score1','score2'])
df1 = df1[df1['value'].eq(1)]
df1[['a','b']] = df1.pop('variable').str.split('_', expand=True)
df1 = df1.pivot(index=['score1','score2'], columns='a', values='b').reset_index()
print (df1)
a score1 score2 category country
0 0.55 0.54 leader CN
1 0.89 0.45 AU

How to reverse a dummy variables from a pandas dataframe

We can use wide_to_long, then select rows that are not equal to zero i.e

ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')

T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0

not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)

id T
0 id2 30
1 id1 40

Update based on your edit :

ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')

not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)

T
index
1 30
0 40

Pandas, reverse one hot encoding

IIUC, you can use DataFrame.idxmax along axis=1. If necessary you can replace dummy prefix, with str.replace:

X_test[filter_col].idxmax(axis=1).str.replace('mycol_', '')

Pandas Get Dummy Reversal For Prediction

You could use reindex to have the result dataframe have same columns as the second one:

Dataframe4 = pd.get_dummies(Dataframe3, columns=['feature_x', 'feature_y']
).reindex(columns=Dataframe2.columns).fillna(0).astype('int')

Reversing 'one-hot' encoding in Pandas

I would use apply to decode the columns:

In [2]: animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})

In [3]: def get_animal(row):
...: for c in animals.columns:
...: if row[c]==1:
...: return c

In [4]: animals.apply(get_animal, axis=1)
Out[4]:
0 rabbit
1 monkey
2 fox
3 None
4 None
dtype: object

Python how to inverse back the actual values after using one-hot-encode/pd.get_dummies

You can make use of the inverse_transform method of sklearn.preprocessing.OneHotEncoder to do it. I have illustrated it with an example below:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male'], ['Female'], ['Female']]
enc.fit(X)
enc.categories_

[array(['Female', 'Male'], dtype=object)]

enc.transform([['Female'], ['Male']]).toarray()

array([[1., 0.],
[0., 1.]])

enc.inverse_transform([[0, 1], [1,0], [0, 1]])

array([['Male'],
['Female'],
['Male']], dtype=object)

To get the category-to-key dictionary you could do this:

A = {}
for i in enc.categories_[0]:
A[i] = enc.transform([[i]]).toarray()

But there could be a better way for doing this.

Reconstruct a categorical variable from dummies in pandas

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here

How to convert (Not-One) Hot Encodings to a Column with Multiple Values on the Same Row

You can do DataFrame.dot which is much faster than iterating over all the rows in the dataframe:

df.dot(df.columns + ', ').str.rstrip(', ')


0         three, four
1 one, three, four
2 three
3 one, three
4
dtype: object


Related Topics



Leave a reply



Submit