Reconstruct a Categorical Variable from Dummies in Pandas

Reconstruct a categorical variable from dummies in pandas

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here

Most efficient way to un-dummy variables in Pandas DF

Setup

data = pd.DataFrame([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1],
    [1, 0, 0],
    [0, 1, 0]
], columns=['ID01', 'ID18', 'ID31']).assign(A=1, B=2)

data

   ID01  ID18  ID31  A  B
0     1     0     0  1  2
1     0     1     0  1  2
2     0     0     1  1  2
3     1     0     0  1  2
4     0     1     0  1  2

`dot` product with strings and objects.

This works if these are truly dummy values 0 or 1

def undummy(d):
    return d.dot(d.columns)

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

   ID01  ID18  ID31  A  B  Site
0     1     0     0  1  2  ID01
1     0     1     0  1  2  ID18
2     0     0     1  1  2  ID31
3     1     0     0  1  2  ID01
4     0     1     0  1  2  ID18

`argmax` slicing

This works but can produce unexpected results if data is not as represented in question.

def undummy(d):
    return d.columns[d.values.argmax(1)]

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

   ID01  ID18  ID31  A  B  Site
0     1     0     0  1  2  ID01
1     0     1     0  1  2  ID18
2     0     0     1  1  2  ID31
3     1     0     0  1  2  ID01
4     0     1     0  1  2  ID18

Getting Dummy Back to Categorical

First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:

>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
  Class          Family         
    Mid Low High     12  6  5  2
0     1   0    0      1  0  0  0

Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():

>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
  Class Family
0   Mid     12

The most elegant way to get back from pandas.df_dummies

idxmax will do it pretty easily.

from itertools import groupby

def back_from_dummies(df):
    result_series = {}

    # Find dummy columns and build pairs (category, category_value)
    dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

    # Find non-dummy columns that do not have a _
    non_dummy_cols = [col for col in df.columns if "_" not in col]

    # For each category column group use idxmax to find the value.
    for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

        #Select columns for each category
        dummy_df = df[[col[1] for col in cols]]

        # Find max value among columns
        max_columns = dummy_df.idxmax(axis=1)

        # Remove category_ prefix
        result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

    # Copy non-dummy columns over.
    for col in non_dummy_cols:
        result_series[col] = df[col]

    # Return dataframe of the resulting series
    return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()

How to reverse a dummy variables from a pandas dataframe

We can use wide_to_long, then select rows that are not equal to zero i.e

ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')

      T_
id  T     
id1 30   0
id2 30   1
id1 40   1
id2 40   0

not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)

   id   T
0  id2  30
1  id1  40

Update based on your edit :

ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')

not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)

        T
index    
1      30
0      40

From Dummy to a List pandas

dummies = df.apply(lambda x: [col for col in df.columns if x[col] == 1], axis=1)

Reverse a get_dummies encoding in pandas

set_index + stack, stack will dropna by default

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0, axis=1)
Out[363]: 
     ID level_1
0  1002       2
1  1002       4
2  1004       1
3  1004       2
4  1005       5
5  1006       6
6  1007       1
7  1007       3
8  1009       3
9  1009       7

Pandas DataFrame: How to convert numeric columns into pairwise categorical data?

Use DataFrame.stack with filtering and Index.to_frame:

s = df.stack()

df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
print (df)
   id result
0   0      A
1   0      D
2   1      A
3   1      B
4   2      A
5   2      B
6   2      C
7   3      D
8   5      B

Or if performance is important use numpy.where for indices by matched values with DataFrame constructor:

i, c = np.where(df != 0)

df = pd.DataFrame({'id':df.index.values[i],
                   'result':df.columns.values[c]})
print (df)
   id result
0   0      A
1   0      D
2   1      A
3   1      B
4   2      A
5   2      B
6   2      C
7   3      D
8   5      B

EDIT:

For first:

s = df.stack()

df = s[s!=0].reset_index()
df.columns= ['id','result','vals']
print (df)
   id result  vals
0   0      A     3
1   0      D     1
2   1      A     4
3   1      B     1
4   2      A     1
5   2      B     7
6   2      C    20
7   3      D     4
8   5      B     1

For second:

df = pd.DataFrame({'id':df.index.values[i],
                   'result':df.columns.values[c],
                   'vals':df.values[i,c]})

Reconstruct a Categorical Variable from Dummies in Pandas