R - How to one hot encoding a single column while keep other columns still?
require(tidyr)
require(dplyr)
df %>% mutate(value = 1) %>% spread(subject, value, fill = 0 )
group student exam_pass Japanese Math Science
1 A 01 N 0 0 1
2 A 01 Y 1 1 0
3 A 02 N 0 1 0
4 A 02 Y 0 0 1
5 B 01 Y 1 0 0
6 C 02 N 0 1 0
How to keep track of columns after encoding categorical variables?
Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns
from the original data frame and use it to reindex your new dataframe.
new_df_reindexed = new_df[df_columns]
To answer your other questions, you can one-hot encode your data using get_dummies()
from pandas. Use the drop_first
parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.
To ensure that you new / testing / holdout data set has the same column definition as that used in model training,
- First use
get_dummies()
to one-hot-encode the new data set. - Use pandas
reindex
to bring the new dataframe into the same structure as the one used in model training -df.reindex(columns=train_one_hot_encode_col_list, axis="columns")
. - The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
- Finally, use the above method to remove any columns in the new data set that are not present in the old data set -
test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]
If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.
I would also advice you to read the below for further reference:
One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Combining one-hot encoded dataframe rows
Something like this?
aggregate(.~C2:C3,df,function(x) sum(x))
R DataFrame - One Hot Encoding of column containing multiple terms
One option is mtabulate
from qdapTools
after splitting the 'Info' column by ,
library(qdapTools)
cbind(mydf, mtabulate(strsplit(mydf$Info, ", ")))
#Age Info Target bad fun go good happy joy nice NULL okay sad wild
#1 99 good, bad, sad Boy 1 0 0 1 0 0 0 0 0 1 0
#2 10 nice, happy, joy Girl 0 0 0 0 1 1 1 0 0 0 0
#3 40 NULL Boy 0 0 0 0 0 0 0 1 0 0 0
#4 15 okay, nice, fun, wild, go Boy 0 1 1 0 0 0 1 0 1 0 1
Related Topics
Setting Working Directory: Julia Versus R
Why Should Someone Use {} for Initializing an Empty Object in R
R Multiple Conditions in If Statement
R-How to Generate Random Sample of a Discrete Random Variables
How to Save a Data Frame in a Txt or Excel File Separated by Columns
R- Plot Numbers Instead of Points
Why Is R Dplyr::Mutate Inconsistent with Custom Functions
Adding Shade to R Lineplot Denotes Standard Error
Mathematical Expression in Axis Label
Scale Back Linear Regression Coefficients in R from Scaled and Centered Data
How to Do Str_Extract with Base R
As.Posixct Gives an Unexpected Timezone
How to Insert Appendix After References in Rmd Using Rstudio
How to Set Factor Levels to the Order They Appear in a Data Frame