Dummy Variables When Not All Categories Are Present

Dummy variables when not all categories are present

Using transpose and reindex

import pandas as pd

cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)

print dummies

    a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0

Dummy variables when not all categories are present across multiple features & data sets

I think you need reindex by union of all columns if same categorical columns names in both Dataframes:

print (df1)  
  df1
1   a
2   b
3   c

print (df2)
  df1
1   b
2   c
3   d

df1 = pd.get_dummies(df1)
df2 = pd.get_dummies(df2)

union = df1.columns | df2.columns
df1 = df1.reindex(columns=union, fill_value=0)
df2 = df2.reindex(columns=union, fill_value=0)
print (df1)
   df1_a  df1_b  df1_c  df1_d
1      1      0      0      0
2      0      1      0      0
3      0      0      1      0
print (df2)
   df1_a  df1_b  df1_c  df1_d
1      0      1      0      0
2      0      0      1      0
3      0      0      0      1

Get dummies when some categories are not present in a pandas column

You can using categroy data type

df.Type=df.Type.astype('category', categories=['type1','type2','type3','type4'])
df
Out[200]: 
    Type
0  type1
1  type2
2  type3
pd.get_dummies(df["Type"], prefix="type")
Out[201]: 
   type_type1  type_type2  type_type3  type_type4
0           1           0           0           0
1           0           1           0           0
2           0           0           1           0

Dummy variable levels not present in unseen data

For the levels of categorical variables missing in the unseen data, create new features in the data by adding those missing levels and keeping the value as 0 for all the records.

I was able to solve using this One Hot Encoding Tutorial

Creating dummy variables as counts using tidyverse/dplyr

using reshape2 but you could pretty much use any package that lets you reformat from long to wide

    library(reshape2)
    df = dcast(fruitData,ID~FRUIT,length)
   
    > df
    ID apple banana grape
  1  1     2      1     0
  2  2     1      0     1
  3  3     1      0     0

Make dummy variables for a categorial variable

The number of dummy variables should be one less than the number of kinds in type.

Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from type doesn't matter.

Even if I don't know the values in type and the number of kinds, I somehow want to make it as dummy variables.

This is treatment contrasts coding. First, you need a factor variable.

## option 1: if you care the order of dummy variables
## the 1st level is not in dummy variables
## I do this to match your example output with "A", "B" and "AB"
f <- factor(df$type, levels = c("O", "A", "B", "AB"))

## option 2: if you don't care, then let R automatically order levels
f <- factor(df$type)

Now, apply treatment contrasts coding.

## option 1 (recommended): using contr.treatment()
m <- contr.treatment(nlevels(f))[f, ]

## option 2 (less efficient): using model.matrix()
m <- model.matrix(~ f)[, -1]

Finally you want to have nice row/column names for readability.

dimnames(m) <- list(1:length(f), levels(f)[-1])

The resulting m looks like:

#   A  B  AB
#1  1  0   0
#2  0  1   0
#3  0  0   1
#4  0  0   0
#5  0  0   0
#6  0  1   0
#7  1  0   0

This is a matrix. If you want a data frame, do data.frame(m).

How to encode dummy variables in Python for sequential data such that the same order is maintained always?

So if you pass the categories in the exact order that you want, get_dummies will maintain it regardless. The code shows how its done.

In[1]: from pandas.api.types import CategoricalDtype

       splice1 = pd.Series(list('bdcccb'))
       splice1 = splice1.astype(CategoricalDtype(categories=['a','c','b','d']))

       splice2 = pd.Series(list('accd'))
       splice2 = splice2.astype(CategoricalDtype(categories=['a','c','b','d']))

In[2]: splice1_dummy = pd.get_dummies(splice1)
Out[2]:     a   c   b   d
        0   0   0   1   0
        1   0   0   0   1
        2   0   1   0   0
        3   0   1   0   0
        4   0   1   0   0
        5   0   0   1   0

In[3]:  splice2_dummy = pd.get_dummies(splice2)
Out[3]:     a   c   b   d
        0   1   0   0   0
        1   0   1   0   0
        2   0   1   0   0
        3   0   0   0   1

Although, I still haven't solved the issue of which variable to drop.