How to turn one-hot encoded variables to a single factor in R
Here's a solution ...
First one hot encode carb
mtcars$carb <- factor(mtcars$carb)
df <- as.data.frame(model.matrix(~ carb - 1, mtcars))
head(df)
#> carb1 carb2 carb3 carb4 carb6 carb8
#> Mazda RX4 0 0 0 1 0 0
#> Mazda RX4 Wag 0 0 0 1 0 0
#> Datsun 710 1 0 0 0 0 0
#> Hornet 4 Drive 1 0 0 0 0 0
#> Hornet Sportabout 0 1 0 0 0 0
#> Valiant 1 0 0 0 0 0
We could of course select out the hot encode variables
library(dplyr)
df %>%
rowwise() %>%
mutate(remade = which.max(c_across(starts_with("carb")))) %>%
ungroup %>%
mutate(remade = factor(remade))
#> # A tibble: 32 x 7
#> carb1 carb2 carb3 carb4 carb6 carb8 remade
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 0 0 0 1 0 0 4
#> 2 0 0 0 1 0 0 4
#> 3 1 0 0 0 0 0 1
#> 4 1 0 0 0 0 0 1
#> 5 0 1 0 0 0 0 2
#> 6 1 0 0 0 0 0 1
#> 7 0 0 0 1 0 0 4
#> 8 0 1 0 0 0 0 2
#> 9 0 1 0 0 0 0 2
#> 10 0 0 0 1 0 0 4
#> # … with 22 more rows
Here it is as a function with the option to keep or delete the one hot encoded columns a la @KM_83
cold_encode <- function(df, encoded_prefix, keep_dummies = FALSE) {
var <- sym(encoded_prefix)
df <-
df %>%
rowwise() %>%
mutate({{ var }} := which.max(c_across(starts_with(encoded_prefix)))) %>%
ungroup %>%
mutate({{ var }} := factor({{ var }}))
if (!keep_dummies) {
df <-
df %>% select(-matches(paste0(encoded_prefix,1:9)))
}
return(df)
}
cold_encode(df, "carb")
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
One-Hot Encoding in [R] | Categorical to Dummy Variables
dd <- read.table(text="
RACE AGE.BELOW.21 CLASS
HISPANIC 0 A
ASIAN 1 A
HISPANIC 1 D
CAUCASIAN 1 B",
header=TRUE)
with(dd,
data.frame(model.matrix(~RACE-1,dd),
AGE.BELOW.21,CLASS))
## RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS
## 1 0 0 1 0 A
## 2 1 0 0 1 A
## 3 0 0 1 1 D
## 4 0 1 0 1 B
The formula ~RACE-1
specifies that R should create dummy variables from the RACE
variable, but suppress the intercept (so that each column represents whether an observation comes from a specified category); the default, without -1
, is to make the first column an intercept term (all ones), omitting the dummy variable for the baseline level (first level of the factor) from the model matrix.
More generally, you might want something like
dd0 <- subset(dd,select=-CLASS)
data.frame(model.matrix(~.-1,dd0),CLASS=dd$CLASS)
Note that when you have multiple categorical variables you will have to something a little bit tricky if you want full sets of dummy variables for each one. I would think of cbind()
ing together separate model matrices, but I think there's also some trick for doing this all at once that I forget ...
Related Topics
Adaptive Moving Average - Top Performance in R
How to Add Hatches, Stripes or Another Pattern or Texture to a Barplot in Ggplot
How to Round Up to the Nearest 10 (Or 100 or X)
How to Install an R Package from the Source Tarball on Windows
Error ".Onload Failed in Loadnamespace() for 'Tcltk'"
How to Add a Ggplot2 Subtitle with Different Size and Colour
How to Plot with a Png as Background
Efficient Way to Filter One Data Frame by Ranges in Another
How to Increase Font Size in a Plot in R
How to Sort Letters in a String
Argument Is of Length Zero in If Statement
Remove Rows from Data Frame Where a Row Matches a String
Count Number of Zeros Per Row, and Remove Rows with More Than N Zeros
Evaluating Both Column Name and the Target Value Within 'J' Expression Within 'Data.Table'
Ggplot2: Facet_Wrap Strip Color Based on Variable in Data Set