R - How to One Hot Encoding a Single Column While Keep Other Columns Still

R - How to one hot encoding a single column while keep other columns still?

require(tidyr)
require(dplyr)

df %>% mutate(value = 1) %>% spread(subject, value, fill = 0 )

group student exam_pass Japanese Math Science
1 A 01 N 0 0 1
2 A 01 Y 1 1 0
3 A 02 N 0 1 0
4 A 02 Y 0 0 1
5 B 01 Y 1 0 0
6 C 02 N 0 1 0

How to keep track of columns after encoding categorical variables?

Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

  • First use get_dummies() to one-hot-encode the new data set.
  • Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
  • The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
  • Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

I would also advice you to read the below for further reference:
One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Combining one-hot encoded dataframe rows

Something like this?

aggregate(.~C2:C3,df,function(x) sum(x))

R DataFrame - One Hot Encoding of column containing multiple terms

One option is mtabulate from qdapTools after splitting the 'Info' column by ,

library(qdapTools)
cbind(mydf, mtabulate(strsplit(mydf$Info, ", ")))
#Age Info Target bad fun go good happy joy nice NULL okay sad wild
#1 99 good, bad, sad Boy 1 0 0 1 0 0 0 0 0 1 0
#2 10 nice, happy, joy Girl 0 0 0 0 1 1 1 0 0 0 0
#3 40 NULL Boy 0 0 0 0 0 0 0 1 0 0 0
#4 15 okay, nice, fun, wild, go Boy 0 1 1 0 0 0 1 0 1 0 1


Related Topics



Leave a reply



Submit