Reconstruct a categorical variable from dummies in pandas
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories()
, see here
Most efficient way to un-dummy variables in Pandas DF
Setup
data = pd.DataFrame([
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]
], columns=['ID01', 'ID18', 'ID31']).assign(A=1, B=2)
data
ID01 ID18 ID31 A B
0 1 0 0 1 2
1 0 1 0 1 2
2 0 0 1 1 2
3 1 0 0 1 2
4 0 1 0 1 2
dot
product with strings and objects.
This works if these are truly dummy values 0
or 1
def undummy(d):
return d.dot(d.columns)
data.assign(Site=data.filter(regex='^ID').pipe(undummy))
ID01 ID18 ID31 A B Site
0 1 0 0 1 2 ID01
1 0 1 0 1 2 ID18
2 0 0 1 1 2 ID31
3 1 0 0 1 2 ID01
4 0 1 0 1 2 ID18
argmax
slicing
This works but can produce unexpected results if data is not as represented in question.
def undummy(d):
return d.columns[d.values.argmax(1)]
data.assign(Site=data.filter(regex='^ID').pipe(undummy))
ID01 ID18 ID31 A B Site
0 1 0 0 1 2 ID01
1 0 1 0 1 2 ID18
2 0 0 1 1 2 ID31
3 1 0 0 1 2 ID01
4 0 1 0 1 2 ID18
Getting Dummy Back to Categorical
First we can regroup each original column from your resultant df
into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1
, knowing all the other entries are 0
. This can be done with idxmax()
:
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
The most elegant way to get back from pandas.df_dummies
idxmax
will do it pretty easily.
from itertools import groupby
def back_from_dummies(df):
result_series = {}
# Find dummy columns and build pairs (category, category_value)
dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]
# Find non-dummy columns that do not have a _
non_dummy_cols = [col for col in df.columns if "_" not in col]
# For each category column group use idxmax to find the value.
for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):
#Select columns for each category
dummy_df = df[[col[1] for col in cols]]
# Find max value among columns
max_columns = dummy_df.idxmax(axis=1)
# Remove category_ prefix
result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])
# Copy non-dummy columns over.
for col in non_dummy_cols:
result_series[col] = df[col]
# Return dataframe of the resulting series
return pd.DataFrame(result_series)
(back_from_dummies(df_dummies) == df).all()
How to reverse a dummy variables from a pandas dataframe
We can use wide_to_long
, then select rows that are not equal to zero i.e
ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')
T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0
not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)
id T
0 id2 30
1 id1 40
Update based on your edit :
ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')
not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)
T
index
1 30
0 40
From Dummy to a List pandas
dummies = df.apply(lambda x: [col for col in df.columns if x[col] == 1], axis=1)
Reverse a get_dummies encoding in pandas
set_index
+ stack
, stack will dropna by default
df.set_index('ID',inplace=True)
df[df==1].stack().reset_index().drop(0, axis=1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7
Pandas DataFrame: How to convert numeric columns into pairwise categorical data?
Use DataFrame.stack
with filtering and Index.to_frame
:
s = df.stack()
df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B
Or if performance is important use numpy.where
for indices by matched values with DataFrame
constructor:
i, c = np.where(df != 0)
df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c]})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B
EDIT:
For first:
s = df.stack()
df = s[s!=0].reset_index()
df.columns= ['id','result','vals']
print (df)
id result vals
0 0 A 3
1 0 D 1
2 1 A 4
3 1 B 1
4 2 A 1
5 2 B 7
6 2 C 20
7 3 D 4
8 5 B 1
For second:
df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c],
'vals':df.values[i,c]})
Related Topics
Finding Multiple Occurrences of a String Within a String in Python
How to Find All the Subsets of a Set, with Exactly N Elements
Function for Factorial in Python
Why Are Empty Strings Returned in Split() Results
Why Is Parenthesis in Print Voluntary in Python 2.7
How to Properly Round-Up Half Float Numbers
Read and Write CSV Files Including Unicode with Python 2.7
Remove Punctuation from Unicode Formatted Strings
Python/Numpy First Occurrence of Subarray
Embedding a Matplotlib Figure Inside a Wxpython Panel
How to Add Placeholder to an Entry in Tkinter
How to Print Variables Without Spaces Between Values
List VS Generator Comprehension Speed with Join Function
Differencebetween 'Same' and 'Valid' Padding in Tf.Nn.Max_Pool of Tensorflow
What Does 'Valueerror: Cannot Reindex from a Duplicate Axis' Mean