Dummy variables when not all categories are present
Using transpose and reindex
import pandas as pd
cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)
print dummies
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
Dummy variables when not all categories are present across multiple features & data sets
I think you need reindex
by union of all columns if same categorical columns names in both Dataframe
s:
print (df1)
df1
1 a
2 b
3 c
print (df2)
df1
1 b
2 c
3 d
df1 = pd.get_dummies(df1)
df2 = pd.get_dummies(df2)
union = df1.columns | df2.columns
df1 = df1.reindex(columns=union, fill_value=0)
df2 = df2.reindex(columns=union, fill_value=0)
print (df1)
df1_a df1_b df1_c df1_d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
print (df2)
df1_a df1_b df1_c df1_d
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
Get dummies when some categories are not present in a pandas column
You can using categroy
data type
df.Type=df.Type.astype('category', categories=['type1','type2','type3','type4'])
df
Out[200]:
Type
0 type1
1 type2
2 type3
pd.get_dummies(df["Type"], prefix="type")
Out[201]:
type_type1 type_type2 type_type3 type_type4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
Dummy variable levels not present in unseen data
For the levels of categorical variables missing in the unseen data, create new features in the data by adding those missing levels and keeping the value as 0 for all the records.
I was able to solve using this One Hot Encoding Tutorial
Creating dummy variables as counts using tidyverse/dplyr
using reshape2
but you could pretty much use any package that lets you reformat from long to wide
library(reshape2)
df = dcast(fruitData,ID~FRUIT,length)
> df
ID apple banana grape
1 1 2 1 0
2 2 1 0 1
3 3 1 0 0
Make dummy variables for a categorial variable
The number of dummy variables should be one less than the number of kinds in
type
.
Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from
type
doesn't matter.
Even if I don't know the values in
type
and the number of kinds, I somehow want to make it as dummy variables.
This is treatment contrasts coding. First, you need a factor variable.
## option 1: if you care the order of dummy variables
## the 1st level is not in dummy variables
## I do this to match your example output with "A", "B" and "AB"
f <- factor(df$type, levels = c("O", "A", "B", "AB"))
## option 2: if you don't care, then let R automatically order levels
f <- factor(df$type)
Now, apply treatment contrasts coding.
## option 1 (recommended): using contr.treatment()
m <- contr.treatment(nlevels(f))[f, ]
## option 2 (less efficient): using model.matrix()
m <- model.matrix(~ f)[, -1]
Finally you want to have nice row/column names for readability.
dimnames(m) <- list(1:length(f), levels(f)[-1])
The resulting m
looks like:
# A B AB
#1 1 0 0
#2 0 1 0
#3 0 0 1
#4 0 0 0
#5 0 0 0
#6 0 1 0
#7 1 0 0
This is a matrix. If you want a data frame, do data.frame(m)
.
How to encode dummy variables in Python for sequential data such that the same order is maintained always?
So if you pass the categories in the exact order that you want, get_dummies will maintain it regardless. The code shows how its done.
In[1]: from pandas.api.types import CategoricalDtype
splice1 = pd.Series(list('bdcccb'))
splice1 = splice1.astype(CategoricalDtype(categories=['a','c','b','d']))
splice2 = pd.Series(list('accd'))
splice2 = splice2.astype(CategoricalDtype(categories=['a','c','b','d']))
In[2]: splice1_dummy = pd.get_dummies(splice1)
Out[2]: a c b d
0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 0 1 0
In[3]: splice2_dummy = pd.get_dummies(splice2)
Out[3]: a c b d
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 0 1
Although, I still haven't solved the issue of which variable to drop.
Related Topics
Python Dictionary:Typeerror: Unhashable Type: 'List'
Installing Numpy on 64Bit Windows 7 with Python 2.7.3
Advanced Nested List Comprehension Syntax
How to Use SQL Parameters with Python
Removing Duplicates from Dictionary
Python Ctypes Issue on Different Oses
Running Infinite Loops Using Threads in Python
Modules Are Installed Using Pip on Osx But Not Found When Importing
Functions That Help to Understand JSON(Dict) Structure
How to Check If Code Is Executed in the Ipython Notebook
Opencv Error: (-215)Size.Width>0 && Size.Height>0 in Function Imshow
How to Save and Load Numpy.Array() Data Properly
Login Credentials Not Working with Gmail Smtp
Python Serialization - Why Pickle
Prevent Pandas from Interpreting 'Na' as Nan in a String
Differencebetween Pylab and Pyplot
Python: Platform Independent Way to Modify Path Environment Variable