Reconstruct a Categorical Variable from Dummies in Pandas

Reconstruct a categorical variable from dummies in pandas

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here

Most efficient way to un-dummy variables in Pandas DF

Setup

data = pd.DataFrame([
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]
], columns=['ID01', 'ID18', 'ID31']).assign(A=1, B=2)

data

ID01 ID18 ID31 A B
0 1 0 0 1 2
1 0 1 0 1 2
2 0 0 1 1 2
3 1 0 0 1 2
4 0 1 0 1 2

dot product with strings and objects.

This works if these are truly dummy values 0 or 1

def undummy(d):
return d.dot(d.columns)

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

ID01 ID18 ID31 A B Site
0 1 0 0 1 2 ID01
1 0 1 0 1 2 ID18
2 0 0 1 1 2 ID31
3 1 0 0 1 2 ID01
4 0 1 0 1 2 ID18

argmax slicing

This works but can produce unexpected results if data is not as represented in question.

def undummy(d):
return d.columns[d.values.argmax(1)]

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

ID01 ID18 ID31 A B Site
0 1 0 0 1 2 ID01
1 0 1 0 1 2 ID18
2 0 0 1 1 2 ID31
3 1 0 0 1 2 ID01
4 0 1 0 1 2 ID18

Getting Dummy Back to Categorical

First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:

>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0

Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():

>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12

The most elegant way to get back from pandas.df_dummies

idxmax will do it pretty easily.

from itertools import groupby

def back_from_dummies(df):
result_series = {}

# Find dummy columns and build pairs (category, category_value)
dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

# Find non-dummy columns that do not have a _
non_dummy_cols = [col for col in df.columns if "_" not in col]

# For each category column group use idxmax to find the value.
for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

#Select columns for each category
dummy_df = df[[col[1] for col in cols]]

# Find max value among columns
max_columns = dummy_df.idxmax(axis=1)

# Remove category_ prefix
result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

# Copy non-dummy columns over.
for col in non_dummy_cols:
result_series[col] = df[col]

# Return dataframe of the resulting series
return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()

How to reverse a dummy variables from a pandas dataframe

We can use wide_to_long, then select rows that are not equal to zero i.e

ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')

T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0

not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)

id T
0 id2 30
1 id1 40

Update based on your edit :

ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')

not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)

T
index
1 30
0 40

From Dummy to a List pandas

dummies = df.apply(lambda x: [col for col in df.columns if x[col] == 1], axis=1)

Reverse a get_dummies encoding in pandas

set_index + stack, stack will dropna by default

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0, axis=1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7

Pandas DataFrame: How to convert numeric columns into pairwise categorical data?

Use DataFrame.stack with filtering and Index.to_frame:

s = df.stack()

df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B

Or if performance is important use numpy.where for indices by matched values with DataFrame constructor:

i, c = np.where(df != 0)

df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c]})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B

EDIT:

For first:

s = df.stack()

df = s[s!=0].reset_index()
df.columns= ['id','result','vals']
print (df)
id result vals
0 0 A 3
1 0 D 1
2 1 A 4
3 1 B 1
4 2 A 1
5 2 B 7
6 2 C 20
7 3 D 4
8 5 B 1

For second:

df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c],
'vals':df.values[i,c]})


Related Topics



Leave a reply



Submit