R - How to One Hot Encoding a Single Column While Keep Other Columns Still

R - How to one hot encoding a single column while keep other columns still?

require(tidyr)
require(dplyr)

df %>% mutate(value = 1)  %>% spread(subject, value,  fill = 0 ) 

group student exam_pass Japanese Math Science
1     A      01         N        0    0       1
2     A      01         Y        1    1       0
3     A      02         N        0    1       0
4     A      02         Y        0    0       1
5     B      01         Y        1    0       0
6     C      02         N        0    1       0

How to keep track of columns after encoding categorical variables?

Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

First use get_dummies() to one-hot-encode the new data set.
Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

I would also advice you to read the below for further reference:
One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Combining one-hot encoded dataframe rows

Something like this?

aggregate(.~C2:C3,df,function(x) sum(x))

R DataFrame - One Hot Encoding of column containing multiple terms

One option is mtabulate from qdapTools after splitting the 'Info' column by ,

library(qdapTools)
cbind(mydf, mtabulate(strsplit(mydf$Info, ", ")))
#Age                      Info Target bad fun go good happy joy nice NULL okay sad wild
#1  99            good, bad, sad    Boy   1   0  0    1     0   0    0    0    0   1    0
#2  10          nice, happy, joy   Girl   0   0  0    0     1   1    1    0    0   0    0
#3  40                      NULL    Boy   0   0  0    0     0   0    0    1    0   0    0
#4  15 okay, nice, fun, wild, go    Boy   0   1  1    0     0   0    1    0    1   0    1

R - How to One Hot Encoding a Single Column While Keep Other Columns Still