One-Hot Encoding in [R] | Categorical to Dummy Variables
dd <- read.table(text="
RACE AGE.BELOW.21 CLASS
HISPANIC 0 A
ASIAN 1 A
HISPANIC 1 D
CAUCASIAN 1 B",
header=TRUE)
with(dd,
data.frame(model.matrix(~RACE-1,dd),
AGE.BELOW.21,CLASS))
## RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS
## 1 0 0 1 0 A
## 2 1 0 0 1 A
## 3 0 0 1 1 D
## 4 0 1 0 1 B
The formula ~RACE-1
specifies that R should create dummy variables from the RACE
variable, but suppress the intercept (so that each column represents whether an observation comes from a specified category); the default, without -1
, is to make the first column an intercept term (all ones), omitting the dummy variable for the baseline level (first level of the factor) from the model matrix.
More generally, you might want something like
dd0 <- subset(dd,select=-CLASS)
data.frame(model.matrix(~.-1,dd0),CLASS=dd$CLASS)
Note that when you have multiple categorical variables you will have to something a little bit tricky if you want full sets of dummy variables for each one. I would think of cbind()
ing together separate model matrices, but I think there's also some trick for doing this all at once that I forget ...
tidymodels recipes: can I use step_dummy() to one-hot encode the categorical variabes *except* booleans which only needs 1 dummy?
There is no automatic way to do this within recipes itself, but I think you can create a function that will handle this for you, something like this:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(crickets, package = "modeldata")
levels_more_than <- function(vec, num = 2) {
n_distinct(levels(vec)) > num
}
recipe(~ ., data = crickets) %>%
step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 31 × 3
#> temp rate species_O..niveus
#> <dbl> <dbl> <dbl>
#> 1 20.8 67.9 0
#> 2 20.8 65.1 0
#> 3 24 77.3 0
#> 4 24 78.7 0
#> 5 24 79.4 0
#> 6 24 80.4 0
#> 7 26.2 85.8 0
#> 8 26.2 86.6 0
#> 9 26.2 87.5 0
#> 10 26.2 89.1 0
#> # … with 21 more rows
recipe(~ ., data = iris) %>%
step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 150 × 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2 1
#> 2 4.9 3 1.4 0.2 1
#> 3 4.7 3.2 1.3 0.2 1
#> 4 4.6 3.1 1.5 0.2 1
#> 5 5 3.6 1.4 0.2 1
#> 6 5.4 3.9 1.7 0.4 1
#> 7 4.6 3.4 1.4 0.3 1
#> 8 5 3.4 1.5 0.2 1
#> 9 4.4 2.9 1.4 0.2 1
#> 10 4.9 3.1 1.5 0.1 1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> # Species_virginica <dbl>
Created on 2022-02-23 by the reprex package (v2.0.1)
Here are some tips for using not-quite-standard selectors in recipes.
How to turn one-hot encoded variables to a single factor in R
Here's a solution ...
First one hot encode carb
mtcars$carb <- factor(mtcars$carb)
df <- as.data.frame(model.matrix(~ carb - 1, mtcars))
head(df)
#> carb1 carb2 carb3 carb4 carb6 carb8
#> Mazda RX4 0 0 0 1 0 0
#> Mazda RX4 Wag 0 0 0 1 0 0
#> Datsun 710 1 0 0 0 0 0
#> Hornet 4 Drive 1 0 0 0 0 0
#> Hornet Sportabout 0 1 0 0 0 0
#> Valiant 1 0 0 0 0 0
We could of course select out the hot encode variables
library(dplyr)
df %>%
rowwise() %>%
mutate(remade = which.max(c_across(starts_with("carb")))) %>%
ungroup %>%
mutate(remade = factor(remade))
#> # A tibble: 32 x 7
#> carb1 carb2 carb3 carb4 carb6 carb8 remade
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 0 0 0 1 0 0 4
#> 2 0 0 0 1 0 0 4
#> 3 1 0 0 0 0 0 1
#> 4 1 0 0 0 0 0 1
#> 5 0 1 0 0 0 0 2
#> 6 1 0 0 0 0 0 1
#> 7 0 0 0 1 0 0 4
#> 8 0 1 0 0 0 0 2
#> 9 0 1 0 0 0 0 2
#> 10 0 0 0 1 0 0 4
#> # … with 22 more rows
Here it is as a function with the option to keep or delete the one hot encoded columns a la @KM_83
cold_encode <- function(df, encoded_prefix, keep_dummies = FALSE) {
var <- sym(encoded_prefix)
df <-
df %>%
rowwise() %>%
mutate({{ var }} := which.max(c_across(starts_with(encoded_prefix)))) %>%
ungroup %>%
mutate({{ var }} := factor({{ var }}))
if (!keep_dummies) {
df <-
df %>% select(-matches(paste0(encoded_prefix,1:9)))
}
return(df)
}
cold_encode(df, "carb")
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
One hot encoding creating n-1 dummy variables
Here goes a solution performing the full-rank dummification (i.e. creating n-1 columns to avoid co-linearity):
require('caret')
data.table(ID=DT$ID, predict(dummyVars(ID ~ ., DT, fullRank = T),DT))
This does exactly the job:
ID colorgreen colorred sizemedium sizesmall
1: 1 0 0 0 0
2: 2 1 0 1 0
3: 3 0 1 0 1
See this for a friendly walkthrough of this function, and ?dummyVars for all the available options.
Also: in a comment, the OP mentioned that this operation would need to be done for millions of rows and thousands of columns, thus justifying the need for data.table
. If this simple pre-processing step is too much for the "computing muscle", then I am afraid that the modeling step (aka the real deal) is doomed to fail.
Create dummy variables from all categorical variables in a dataframe
Also one-liner with fastDummies
package.
fastDummies::dummy_cols(customers)
id gender mood outcome gender_male gender_female mood_happy mood_sad
1 10 male happy 1 1 0 1 0
2 20 female sad 1 0 1 0 1
3 30 female happy 0 0 1 1 0
4 40 male sad 0 1 0 0 1
5 50 female happy 0 0 1 1 0
Related Topics
Why Use As.Factor() Instead of Just Factor()
Grid of Multiple Ggplot2 Plots Which Have Been Made in a for Loop
Shift Values in Single Column of Dataframe Up
R: What Do You Call the :: and ::: Operators and How Do They Differ
R Group by Date, and Summarize the Values
Fixing Cluttered Titles on Graphs
What You Can Do with a Data.Frame That You Can't with a Data.Table
Producing a Vector Graphics Image (I.E. Metafile) in R Suitable for Printing in Word 2007
Overlay Two Ggplot2 Stat_Density2D Plots with Alpha Channels
Ggplot2 - Adding Secondary Y-Axis on Top of a Plot
How to Label a Barplot Bar with Positive and Negative Bars with Ggplot2
Reverse Datetime (Posixct Data) Axis in Ggplot
R- How to Dynamically Name Data Frames
Filling Missing Dates in a Grouped Time Series - a Tidyverse-Way