Reconstruct a Categorical Variable from Dummies in R

Reconstruct a categorical variable from dummies in R

You can do this with data.table

id_cols = c("x1", "x2") 
data.table::melt.data.table(data = dt, id.vars = id_cols,
na.rm = TRUE,
measure = patterns("dummy"))

Example:

t = data.table(dummy_a = c(1, 0, 0), dummy_b = c(0, 1, 0), dummy_c = c(0, 0, 1), id = c(1, 2, 3))
data.table::melt.data.table(data = t,
id.vars = "id",
measure = patterns("dummy_"),
na.rm = T)[value == 1, .(id, variable)]

Output

   id variable
1: 1 dummy_a
2: 2 dummy_b
3: 3 dummy_c

It's even easier if you remplaze 0 by NA, so na.rm = TRUE in melt will drop every row with NA

How to reconstruct a categorical variable with multiple choices

df_old <- read.table(text = "a1 a2 a3 a4 a5 a6 a7
0 0 1 1 0 1 0
1 1 1 0 0 0 0
0 1 0 0 1 0 1", header = T)

df_old %>% mutate(rowid = row_number()) %>%
pivot_longer(!rowid) %>%
filter(value != 0) %>%
group_by(rowid) %>%
mutate(choice = paste0('choice', seq_len(max(rowSums(df_old))))) %>%
pivot_wider(id_cols = rowid, names_from = choice, values_from = name) %>%
select(-rowid)

# A tibble: 3 x 4
# Groups: rowid [3]
rowid choice1 choice2 choice3
<int> <chr> <chr> <chr>
1 1 a3 a4 a6
2 2 a1 a2 a3
3 3 a2 a5 a7

Convert various dummy/logical variables into a single categorical variable/factor from their name in R

Try:

library(dplyr)
library(tidyr)

df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)

Which gives:

#  id       type
#1 1 conditionA
#2 2 conditionB
#3 3 conditionC
#4 4 conditionD
#5 5 conditionA

Update

To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join() the other columns:

df %>% 
select(starts_with("condition"), id) %>%
gather(type, value, -id) %>%
na.omit() %>%
select(-value) %>%
left_join(., df %>% select(-starts_with("condition"))) %>%
arrange(id)

Using dplyr to gather dummy variables

This can be done using the 'tidyverse' library - specificially 'tidyr' and 'dplyr'. The following code produces the output you are after.

library(tidyverse)
type %>% gather(TypeOfCar, Count) %>% filter(Count >= 1) %>% select(TypeOfCar)

Output:

   TypeOfCar
<chr>
1 convertible
2 convertible
3 convertible
4 convertible
5 coupe
6 sedan

Hopefully this solves your problem, let me know if any changes are needed! Thanks.

Reconstruct a categorical variable from dummies in pandas

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here

Creating categorical variables from mutually exclusive dummy variables

Update (2019): Please use dplyr::coalesce(), it works pretty much the same.

My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:

#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)

df$factor1 <- with(df, coalesce.na(conditionA, conditionB))

(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)

Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:

df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0), 
coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')

Convert categorical variable into binary columns in R

Try this:

library(dplyr)
library(tidyr)
df %>%
separate_rows(answer_openq, sep = ',') %>%
pivot_wider(names_from = answer_openq, values_from = answer_openq,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 4 × 5
respondent a c b d
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 0
2 2 1 1 0 0
3 3 0 0 1 0
4 4 1 0 0 1


Related Topics



Leave a reply



Submit