Keep Same Dummy Variable in Training and Testing Data

Running get_dummies on train and test data returns different amount of columns - is it ok to concat the two sets and split after feature engineering?

It is acceptable to join the train and test as you say, but I would not recommend that.

Particularly, because when you deploy a model and you start scoring "real data" you don't get the chance to join it back to the train set to produce the dummy variables.

There are alternative solutions using the OneHotEncoder class from either Scikit-learn, Feature-engine or Category encoders. All these are open source python packages, with classes that implement the fit / transform functionality.

With fit, the class learns the dummy variables that will be created from the train set, and with trasnform it creates the dummy variables. In the example that you provide, the test set will also have 4 dummies, and the dummy "Excellent" will contain all 0.

Find examples of the OneHotEncoder from Scikit-learn, Feature-engine and Category encoders in the provided links

How can I align pandas get_dummies across training / validation / testing?

dummies should be created before dividing the dataset into train, test or validate

suppose i have train and test dataframe as follows

import pandas as pd  
train = pd.DataFrame([1,2,3], columns= ['A'])
test= pd.DataFrame([7,8], columns= ['A'])

#creating dummy for train
pd.get_dummies(train, columns= ['A'])

o/p
A_1 A_2 A_3 A_4 A_5 A_6
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 1 0
5 0 0 0 0 0 1

# creating dummies for test data
pd.get_dummies(test, columns = ['A'])
A_7 A_8
0 1 0
1 0 1

so dummy for 7 and 8 category will only be present in test and thus will result with different feature

final_df = pd.concat([train, test]) 

dummy_created = pd.get_dummies(final_df)

# now you can split it into train and test
from sklearn.model_selection import train_test_split
train_x, test_x = train_test_split(dummy_created, test_size=0.33)

Now train and test will have same set of features

Dummy Variables on training and testing set resulting in different size dataframe output

When you have the dataframe , and would like to transform object to dummies variable, dot not split it before using get_dummies

 df = pd.get_dummies(df)
train = df[cond]
test = df.drop(train.index)

To fix your code

df = pd.get_dummies(pd.concat([train , test]))
train = df[df.index.isin(train.index)]
test = df.drop(train.index)

Dummy variable levels not present in unseen data

For the levels of categorical variables missing in the unseen data, create new features in the data by adding those missing levels and keeping the value as 0 for all the records.

I was able to solve using this One Hot Encoding Tutorial

how to apply pandas get_dummies function to valid data set?

All you need to do is

  1. Create columns in the validation dataset which are present in the training data but missing in the validation data.
missing_cols = [col for col in train.columns if col not in valid.columns]
for col in missing_cols:
valid[col] = 0

  1. Now, these columns are created in the end, so the order of the columns would be changed. Thus in the next step we would rearrange the columns as below:
valid = valid[[train.columns]]

What is the best way to implement Pipeline to make sure train and test dummy variables are the same?

Here are few things I noted in you code which may help

  • Error is complaining that some the columns you are trying to drop doesn't exist on the dataframe. To fix this you can replace code to drop columns with
data = np.random.rand(50,4)
df = pd.DataFrame(data, columns=["a","b","c","d"])
drop_columns=['b', 'c', 'e', 'f']

## code to drop columns
columns = df.columns
drop_columns = set(columns) & set(drop_columns)
df.drop(columns=drop_columns, inplace=True)
  • Fit function is only used to infer transformation parameters from train data. And is called only with train data. In your case you are only inferring the remaining columns on training data after applying functions and dropping the specified columns. For which you don't need to actually apply the functions. As you know what columns each function adds and what columns you need to drop. You can find it only using some set operations on the columns.

  • You can also simplify the transform function, you already know which columns to include so you 1st add missing columns than take only the columns you want to include instead of dropping columns

how to extend the data frame with dummy variable with dummyVars package?

You can do this easily without the caret package. For example:

library(dplyr)
library(mice)

imp <- mice(mice::nhanes, m=5)
df <- complete(imp, action="long")

df <- df %>%
mutate(hyp1 = 2 - hyp,
hyp2 = hyp - 1) %>%
select(-hyp)

or using Base R:

df$hyp.1 <- 2 - df$hyp
df$hyp.2 <- df$hyp - 1
df[, !colnames(df) %in% "hyp"]


Related Topics



Leave a reply



Submit