Automatically Expanding an R Factor into a Collection of 1/0 Indicator Variables For Every Factor Level

Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level

Use the model.matrix function:

model.matrix( ~ Species - 1, data=iris )

R: Expanding an R factor into dummy columns for every factor level

This worked for me perfectly:

library(reshape2)
m <- acast(data = d, User ~ Code)

The only thing was that it produced NAs, instead of 0s, but this can be easily changed with this:

m[is.na(m)] <- 0

Convert a factor to indicator variables?

One way is to use model.matrix():

model.matrix(~Species, iris)

    (Intercept) Speciesversicolor Speciesvirginica
1             1                 0                0
2             1                 0                0
3             1                 0                0

....

148           1                 0                1
149           1                 0                1
150           1                 0                1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$Species
[1] "contr.treatment"

Split variable into multiple multiple factor variables

A fast and easy way is to use fastDummies::dummy_cols:

fastDummies::dummy_cols(df, "x")

An alternative with tidyverse functions:

library(tidyverse)

df %>% 
  left_join(., df %>% mutate(value = 1) %>% 
              pivot_wider(names_from = x, values_from = value, values_fill = 0) %>% 
              relocate(n, sort(colnames(.)[-1])))

output

> dummmy <- fastDummies::dummy_cols(df, "x")
> colnames(dummy)[-c(1,2)] <- LETTERS
> dummy

    n x A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1   1 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2   2 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3   3 E 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4   4 H 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5   5 T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
6   6 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
7   7 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
8   8 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9   9 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
10 10 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Benchmark
Since there are many solutions and the question involves a large dataset, a benchmark might help. The nnet solution is the fastest according to the benchmark.

set.seed(1)
df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))

library(microbenchmark)
bm <- microbenchmark(
  fModel.matrix(),
  fContrasts(),
  fnnet(),
  fdata.table(),
  fFastDummies(),
  fDplyr(),
  times = 10L,
  setup = gc(FALSE)
)
autoplot(bm)

Sample Image

converting a DocumentTermMatrix to factor

If you have a DocumentTermMatrix as defined in the tm package, you can just set the count of each word to one, by replacing all values in "v" by 1 as so:

dtm[["v"]] <- rep(1, length(dtm[["v"]]))

Full reprex:

library(tm)
#> Loading required package: NLP
data("crude")
dtm <- DocumentTermMatrix(crude)

head(inspect(dtm))
#> <<DocumentTermMatrix (documents: 20, terms: 1266)>>
#> Non-/sparse entries: 2255/23065
#> Sparsity           : 91%
#> Maximal term length: 17
#> Weighting          : term frequency (tf)
#> Sample             :
#>      Terms
#> Docs  and for its mln oil opec prices said that the
#>   144   9   5   6   4  11   10      3    9   10  17
#>   236   7   4   8   4   7    6      2    6    4  15
#>   237  11   3   3   1   3    1      0    0    1  30
#>   242   3   1   0   0   3    2      1    3    0   6
#>   246   9   6   3   0   4    1      0    4    2  18
#>   248   6   2   2   3   9    6      7    5    2  27
#>   273   5   4   0   9   5    5      4    5    0  21
#>   489   5   4   2   2   4    0      2    2    1   8
#>   502   6   5   2   2   4    0      2    2    1  13
#>   704   5   3   1   0   3    0      2    3    3  21
#>      Terms
#> Docs  and for its mln oil opec prices said that the
#>   144   9   5   6   4  11   10      3    9   10  17
#>   236   7   4   8   4   7    6      2    6    4  15
#>   237  11   3   3   1   3    1      0    0    1  30
#>   242   3   1   0   0   3    2      1    3    0   6
#>   246   9   6   3   0   4    1      0    4    2  18
#>   248   6   2   2   3   9    6      7    5    2  27

dtm[["v"]] <- rep(1, length(dtm[["v"]]))
head(inspect(dtm))
#> <<DocumentTermMatrix (documents: 20, terms: 1266)>>
#> Non-/sparse entries: 2255/23065
#> Sparsity           : 91%
#> Maximal term length: 17
#> Weighting          : term frequency (tf)
#> Sample             :
#>      Terms
#> Docs  and for its last oil prices reuter said the was
#>   144   1   1   1    1   1      1      1    1   1   1
#>   236   1   1   1    1   1      1      1    1   1   1
#>   237   1   1   1    1   1      0      1    0   1   1
#>   242   1   1   0    0   1      1      1    1   1   1
#>   246   1   1   1    1   1      0      1    1   1   1
#>   248   1   1   1    1   1      1      1    1   1   1
#>   273   1   1   0    1   1      1      1    1   1   1
#>   489   1   1   1    0   1      1      1    1   1   0
#>   502   1   1   1    0   1      1      1    1   1   0
#>   704   1   1   1    0   1      1      1    1   1   0
#>      Terms
#> Docs  and for its last oil prices reuter said the was
#>   144   1   1   1    1   1      1      1    1   1   1
#>   236   1   1   1    1   1      1      1    1   1   1
#>   237   1   1   1    1   1      0      1    0   1   1
#>   242   1   1   0    0   1      1      1    1   1   1
#>   246   1   1   1    1   1      0      1    1   1   1
#>   248   1   1   1    1   1      1      1    1   1   1

^{Created on 2022-06-26 by the reprex package (v2.0.1)}

fit an `lm` model for every level of a factor

You can nest the dataframe and use map to apply lm for each factor_gear.

library(dplyr)

mtcars %>%
  group_by(factor_gear) %>%
  tidyr::nest() %>%
  mutate(model = map(data, ~lm(mpg ~ cyl, data = .x)))

#  factor_gear data               model 
#  <fct>       <list>             <list>
#1 4           <tibble [12 × 11]> <lm>  
#2 3           <tibble [15 × 11]> <lm>  
#3 5           <tibble [5 × 11]>  <lm>

In the new dplyr you can use cur_data to refer to current data in group which avoids the need of nest and map.

mtcars %>%
  group_by(factor_gear) %>%
  summarise(model = list(lm(mpg ~ cyl, data = cur_data())))

Automatically Expanding an R Factor into a Collection of 1/0 Indicator Variables For Every Factor Level