One-Hot Encoding in [R] | Categorical to Dummy Variables

dd <- read.table(text="
   RACE        AGE.BELOW.21     CLASS
   HISPANIC          0          A
   ASIAN             1          A
   HISPANIC          1          D
   CAUCASIAN         1          B",
  header=TRUE)


  with(dd,
       data.frame(model.matrix(~RACE-1,dd),
                  AGE.BELOW.21,CLASS))
 ##   RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS
 ## 1         0             0            1            0     A
 ## 2         1             0            0            1     A
 ## 3         0             0            1            1     D
 ## 4         0             1            0            1     B

The formula ~RACE-1 specifies that R should create dummy variables from the RACE variable, but suppress the intercept (so that each column represents whether an observation comes from a specified category); the default, without -1, is to make the first column an intercept term (all ones), omitting the dummy variable for the baseline level (first level of the factor) from the model matrix.

More generally, you might want something like

 dd0 <- subset(dd,select=-CLASS)
 data.frame(model.matrix(~.-1,dd0),CLASS=dd$CLASS)

Note that when you have multiple categorical variables you will have to something a little bit tricky if you want full sets of dummy variables for each one. I would think of cbind()ing together separate model matrices, but I think there's also some trick for doing this all at once that I forget ...

tidymodels recipes: can I use step_dummy() to one-hot encode the categorical variabes except booleans which only needs 1 dummy?

There is no automatic way to do this within recipes itself, but I think you can create a function that will handle this for you, something like this:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(crickets, package = "modeldata")

levels_more_than <- function(vec, num = 2) {
  n_distinct(levels(vec)) > num
}

recipe(~ ., data = crickets) %>%
  step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 31 × 3
#>     temp  rate species_O..niveus
#>    <dbl> <dbl>             <dbl>
#>  1  20.8  67.9                 0
#>  2  20.8  65.1                 0
#>  3  24    77.3                 0
#>  4  24    78.7                 0
#>  5  24    79.4                 0
#>  6  24    80.4                 0
#>  7  26.2  85.8                 0
#>  8  26.2  86.6                 0
#>  9  26.2  87.5                 0
#> 10  26.2  89.1                 0
#> # … with 21 more rows

recipe(~ ., data = iris) %>%
  step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 150 × 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#>           <dbl>       <dbl>        <dbl>       <dbl>          <dbl>
#>  1          5.1         3.5          1.4         0.2              1
#>  2          4.9         3            1.4         0.2              1
#>  3          4.7         3.2          1.3         0.2              1
#>  4          4.6         3.1          1.5         0.2              1
#>  5          5           3.6          1.4         0.2              1
#>  6          5.4         3.9          1.7         0.4              1
#>  7          4.6         3.4          1.4         0.3              1
#>  8          5           3.4          1.5         0.2              1
#>  9          4.4         2.9          1.4         0.2              1
#> 10          4.9         3.1          1.5         0.1              1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> #   Species_virginica <dbl>

^{Created on 2022-02-23 by the reprex package (v2.0.1)}

Here are some tips for using not-quite-standard selectors in recipes.

How to turn one-hot encoded variables to a single factor in R

Here's a solution ...

First one hot encode carb

mtcars$carb <- factor(mtcars$carb)
df <- as.data.frame(model.matrix(~ carb - 1, mtcars))
head(df)

#>                   carb1 carb2 carb3 carb4 carb6 carb8
#> Mazda RX4             0     0     0     1     0     0
#> Mazda RX4 Wag         0     0     0     1     0     0
#> Datsun 710            1     0     0     0     0     0
#> Hornet 4 Drive        1     0     0     0     0     0
#> Hornet Sportabout     0     1     0     0     0     0
#> Valiant               1     0     0     0     0     0

We could of course select out the hot encode variables

library(dplyr)

df %>% 
   rowwise() %>% 
   mutate(remade = which.max(c_across(starts_with("carb")))) %>%
   ungroup %>%
   mutate(remade = factor(remade))

#> # A tibble: 32 x 7
#>    carb1 carb2 carb3 carb4 carb6 carb8 remade
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> 
#>  1     0     0     0     1     0     0 4     
#>  2     0     0     0     1     0     0 4     
#>  3     1     0     0     0     0     0 1     
#>  4     1     0     0     0     0     0 1     
#>  5     0     1     0     0     0     0 2     
#>  6     1     0     0     0     0     0 1     
#>  7     0     0     0     1     0     0 4     
#>  8     0     1     0     0     0     0 2     
#>  9     0     1     0     0     0     0 2     
#> 10     0     0     0     1     0     0 4     
#> # … with 22 more rows

Here it is as a function with the option to keep or delete the one hot encoded columns a la @KM_83

cold_encode <- function(df, encoded_prefix, keep_dummies = FALSE) {
   var <- sym(encoded_prefix)
   df <- 
      df %>%
      rowwise() %>%
      mutate({{ var }} := which.max(c_across(starts_with(encoded_prefix)))) %>%
      ungroup %>%
      mutate({{ var }} := factor({{ var }})) 
   if (!keep_dummies) {
      df <- 
      df %>% select(-matches(paste0(encoded_prefix,1:9)))
   }
   return(df)
}

cold_encode(df, "carb")
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear carb 
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4 4    
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4 4    
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4 1    
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3 1    
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3 2    
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3 1    
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3 4    
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4 2    
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4 2    
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4 4    
#> # … with 22 more rows

One hot encoding creating n-1 dummy variables

Here goes a solution performing the full-rank dummification (i.e. creating n-1 columns to avoid co-linearity):

require('caret') 
data.table(ID=DT$ID, predict(dummyVars(ID ~ ., DT, fullRank = T),DT))

This does exactly the job:

   ID colorgreen colorred sizemedium sizesmall
1:  1          0        0          0         0
2:  2          1        0          1         0
3:  3          0        1          0         1

See this for a friendly walkthrough of this function, and ?dummyVars for all the available options.

Also: in a comment, the OP mentioned that this operation would need to be done for millions of rows and thousands of columns, thus justifying the need for data.table. If this simple pre-processing step is too much for the "computing muscle", then I am afraid that the modeling step (aka the real deal) is doomed to fail.

Create dummy variables from all categorical variables in a dataframe

Also one-liner with fastDummies package.

fastDummies::dummy_cols(customers)

  id gender  mood outcome gender_male gender_female mood_happy mood_sad
1 10   male happy       1           1             0          1        0
2 20 female   sad       1           0             1          0        1
3 30 female happy       0           0             1          1        0
4 40   male   sad       0           1             0          0        1
5 50 female happy       0           0             1          1        0

One-Hot Encoding in [R] | Categorical to Dummy Variables