Reconstruct a Categorical Variable from Dummies in R

Reconstruct a categorical variable from dummies in R

You can do this with data.table

id_cols = c("x1", "x2") 
data.table::melt.data.table(data = dt, id.vars = id_cols, 
                            na.rm = TRUE, 
                            measure = patterns("dummy"))

Example:

t = data.table(dummy_a = c(1, 0, 0), dummy_b = c(0, 1, 0), dummy_c = c(0, 0, 1), id = c(1, 2, 3))
data.table::melt.data.table(data = t, 
                            id.vars = "id", 
                            measure = patterns("dummy_"), 
                            na.rm = T)[value == 1, .(id, variable)]

Output

   id variable
1:  1  dummy_a
2:  2  dummy_b
3:  3  dummy_c

It's even easier if you remplaze 0 by NA, so na.rm = TRUE in melt will drop every row with NA

How to reconstruct a categorical variable with multiple choices

df_old <- read.table(text = "a1 a2 a3 a4 a5 a6 a7
0  0  1  1  0  1  0
1  1  1  0  0  0  0
0  1  0  0  1  0  1", header = T)

df_old %>% mutate(rowid = row_number()) %>%
  pivot_longer(!rowid) %>%
  filter(value != 0) %>%
  group_by(rowid) %>%
  mutate(choice = paste0('choice', seq_len(max(rowSums(df_old))))) %>%
  pivot_wider(id_cols = rowid, names_from = choice, values_from = name) %>%
  select(-rowid)

# A tibble: 3 x 4
# Groups:   rowid [3]
  rowid choice1 choice2 choice3
  <int> <chr>   <chr>   <chr>  
1     1 a3      a4      a6     
2     2 a1      a2      a3     
3     3 a2      a5      a7

Convert various dummy/logical variables into a single categorical variable/factor from their name in R

Try:

library(dplyr)
library(tidyr)

df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)

Which gives:

#  id       type
#1  1 conditionA
#2  2 conditionB
#3  3 conditionC
#4  4 conditionD
#5  5 conditionA

Update

To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join() the other columns:

df %>% 
  select(starts_with("condition"), id) %>% 
  gather(type, value, -id) %>% 
  na.omit() %>% 
  select(-value) %>% 
  left_join(., df %>% select(-starts_with("condition"))) %>%
  arrange(id)

Using dplyr to gather dummy variables

This can be done using the 'tidyverse' library - specificially 'tidyr' and 'dplyr'. The following code produces the output you are after.

library(tidyverse)
type %>% gather(TypeOfCar, Count) %>% filter(Count >= 1) %>% select(TypeOfCar)

Output:

   TypeOfCar
    <chr>
1 convertible
2 convertible
3 convertible
4 convertible
5       coupe
6       sedan

Hopefully this solves your problem, let me know if any changes are needed! Thanks.

Reconstruct a categorical variable from dummies in pandas

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here

Creating categorical variables from mutually exclusive dummy variables

Update (2019): Please use dplyr::coalesce(), it works pretty much the same.

My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:

#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)

df$factor1 <- with(df, coalesce.na(conditionA, conditionB))

(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)

Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:

df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0), 
                                       coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')

Convert categorical variable into binary columns in R

Try this:

library(dplyr)
library(tidyr)
df %>%
  separate_rows(answer_openq, sep = ',') %>%
  pivot_wider(names_from = answer_openq, values_from = answer_openq, 
              values_fn = function(x) 1, values_fill = 0)
# A tibble: 4 × 5
  respondent     a     c     b     d
       <int> <dbl> <dbl> <dbl> <dbl>
1          1     1     0     0     0
2          2     1     1     0     0
3          3     0     0     1     0
4          4     1     0     0     1

Reconstruct a Categorical Variable from Dummies in R