Reverse a get_dummies encoding in pandas
set_index
+ stack
, stack will dropna by default
df.set_index('ID',inplace=True)
df[df==1].stack().reset_index().drop(0, axis=1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7
Reverse get_dummies()
You can convert for dummies columns to index first by DataFrame.set_index
:
#https://stackoverflow.com/a/62085741/2901002
df = undummify(df.set_index(['score1','score2'])).reset_index()
Or use alternative solution with DataFrame.melt
, fiter rows with boolean indexing
, splitting by Series.str.split
and last pivoting by DataFrame.pivot
:
df1 = df.melt(['score1','score2'])
df1 = df1[df1['value'].eq(1)]
df1[['a','b']] = df1.pop('variable').str.split('_', expand=True)
df1 = df1.pivot(index=['score1','score2'], columns='a', values='b').reset_index()
print (df1)
a score1 score2 category country
0 0.55 0.54 leader CN
1 0.89 0.45 AU
How to reverse a dummy variables from a pandas dataframe
We can use wide_to_long
, then select rows that are not equal to zero i.e
ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')
T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0
not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)
id T
0 id2 30
1 id1 40
Update based on your edit :
ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')
not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)
T
index
1 30
0 40
Pandas, reverse one hot encoding
IIUC, you can use DataFrame.idxmax
along axis=1
. If necessary you can replace dummy prefix, with str.replace
:
X_test[filter_col].idxmax(axis=1).str.replace('mycol_', '')
Pandas Get Dummy Reversal For Prediction
You could use reindex
to have the result dataframe have same columns as the second one:
Dataframe4 = pd.get_dummies(Dataframe3, columns=['feature_x', 'feature_y']
).reindex(columns=Dataframe2.columns).fillna(0).astype('int')
Reversing 'one-hot' encoding in Pandas
I would use apply to decode the columns:
In [2]: animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})
In [3]: def get_animal(row):
...: for c in animals.columns:
...: if row[c]==1:
...: return c
In [4]: animals.apply(get_animal, axis=1)
Out[4]:
0 rabbit
1 monkey
2 fox
3 None
4 None
dtype: object
Python how to inverse back the actual values after using one-hot-encode/pd.get_dummies
You can make use of the inverse_transform
method of sklearn.preprocessing.OneHotEncoder
to do it. I have illustrated it with an example below:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male'], ['Female'], ['Female']]
enc.fit(X)
enc.categories_
[array(['Female', 'Male'], dtype=object)]
enc.transform([['Female'], ['Male']]).toarray()
array([[1., 0.],
[0., 1.]])
enc.inverse_transform([[0, 1], [1,0], [0, 1]])
array([['Male'],
['Female'],
['Male']], dtype=object)
To get the category-to-key dictionary you could do this:
A = {}
for i in enc.categories_[0]:
A[i] = enc.transform([[i]]).toarray()
But there could be a better way for doing this.
Reconstruct a categorical variable from dummies in pandas
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories()
, see here
How to convert (Not-One) Hot Encodings to a Column with Multiple Values on the Same Row
You can do DataFrame.dot
which is much faster
than iterating over all the rows in the dataframe:
df.dot(df.columns + ', ').str.rstrip(', ')
0 three, four
1 one, three, four
2 three
3 one, three
4
dtype: object
Related Topics
How to Avoid "Permission Denied" When Using Pip with Virtualenv
Type Annotations for *Args and **Kwargs
How to Left Align a Fixed Width String
How to Draw Axis in the Middle of the Figure
How to Isolate Everything Inside of a Contour, Scale It, and Test the Similarity to an Image
Python Slice How-To, I Know the Python Slice But How to Use Built-In Slice Object for It
Matrix Multiplication in Pure Python
Error: Pg_Config Executable Not Found When Installing Psycopg2 on Alpine in Docker
Python: Sorting Items from Top Left to Bottom Right with Opencv
Beautifulsoup:Difference Between .Find() and .Select()
Factorize a Column of Strings in Pandas
How to Upload a File to Google Cloud Storage on Python 3
How to Create an Object for a Django Model with a Many to Many Field
How to Tell a Python Script to Use a Particular Version
How to Add Static(Html, CSS, Js, etc) Files in Pyinstaller to Create Standalone Exe File