Create Several Dummy Variables from One String Variable

Create dummy variables from string with multiple values

Overview

To create dummy variables for each unique value in good_at required the following steps:

  • Separate good_at into multiple rows
  • Generate dummy variables - using dummy::dummy() - for each value in good_at for each name-sex pair
  • Reshape data into 4 columns: name, sex, key and value
    • key contains all the dummy variable column names
    • value contains the values in each dummy variable
  • Keep only records where value is not zero
  • Reshape data into one record per name-sex pair and as many columns as there are in key
  • Casting the dummy columns as logical vectors.

Code

# load necessary packages ----
library(dummy)
library(tidyverse)

# load necessary data ----
df <-
read.table(text = "name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming"
, header = TRUE
, stringsAsFactors = FALSE)

# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
df %>%
separate_rows(good_at, sep = ";")

# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
df_clean %>%
select(good_at) %>%
dummy() %>%
bind_cols(df_clean) %>%
# drop "good_at" column
select(-good_at) %>%
# make the tibble long by reshaping it into 4 columns:
# name, sex, key and value
# where key are the all dummy variable column names
# and value are the values in each dummy variable
gather(key, value, -name, -sex) %>%
# keep records where
# value is not equal to zero
# note: this is due to "Tom" having both a
# "good_at_Drawing" value of 0 and 1.
filter(value != 0) %>%
# make the tibble wide
# with one record per name-sex pair
# and as many columns as there are in key
# with their values from value
# and filling NA values to 0
spread(key, value, fill = 0) %>%
# for each name-sex pair
# cast the dummy variables into logical vectors
group_by(name, sex) %>%
mutate_all(funs(as.integer(.) %>% as.logical())) %>%
ungroup() %>%
# just for safety let's join
# the original "good_at" column
left_join(y = df, by = c("name", "sex")) %>%
# bring the original "good_at" column to the left-hand side
# of the tibble
select(name, sex, good_at, matches("good_at_"))

# view result ----
df_dummies
# A tibble: 4 x 9
# name sex good_at good_at_Cooking good_at_Drawing good_at_Hiking
# <chr> <chr> <chr> <lgl> <lgl> <lgl>
# 1 Char… M Swimmi… FALSE FALSE FALSE
# 2 Mary F Cookin… TRUE FALSE FALSE
# 3 Sam M Running FALSE FALSE FALSE
# 4 Tom M Drawin… FALSE TRUE TRUE
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
# good_at_Swimming <lgl>

# end of script #

Create several dummy variables from one string variable

Since your concatenated data are concatenated character strings (not concatenated numerical values) you'll need to add type = "character" to get the function to work as you expect it.

The function's default setting is for numeric values, hence the error about NaN and so on.

The naming has been made more consistent with the short forms of the other functions in the same family. Thus, it is now cSplit_e (though the old function name would still work).

library(splitstackshape)
cSplit_e(profs, "teaches", ",", type = "character", fill = 0)
# teaches teaches_1st teaches_2nd teaches_3rd
# 1 1st 1 0 0
# 2 1st, 2nd 1 1 0
# 3 2nd, 3rd 0 1 1
# 4 1st, 2nd, 3rd 1 1 1

The help page for ?concat.split.expanded is the same as that of cSplit_e. If you have any tips on making it clearer to understand, please raise an issue at the package's GitHub page.

Dummy variables from a string variable

UPDATE

The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.

Install it, and use concat.split.expanded to get the desired result:

library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
# A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1

Original post

A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split(), can be found here.

The usage, for your example data, would be:

## Keeping the original column
concat.split(DF, "B", structure="expanded")
# A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA
# 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA
# 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA
# 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1

## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 NA NA NA NA NA NA NA
# 2 2 1 1 1 NA NA 1 NA NA NA NA
# 3 3 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 NA 1 1 1 1 1 1 NA NA NA
# 5 5 NA NA 1 1 1 1 1 NA NA 1

Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1

Update

Most of the overhead in the concat.split function probably comes in things like converting from a matrix to a data.frame, renaming the columns, and so on. The actual code used to do the splitting is a GASP for loop, but test it out, and you'll find that it performs pretty well:

b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)
## Fill it in
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
## View your result
m

Create a dummy to indicating presence of string fragment in any of multiple variables

A base R solution :

 cols = endsWith(names(df),"line")
df['Apartment_dummy'] = as.integer(grepl('apartment',do.call(paste,df[cols])))

Now we can write a function that even considers the data to be used ie,data bein an argument.

premises_dummy=function(varname,strings){
cols = endsWith(names(df),"line")
df[varname]= as.integer(grepl(strings,do.call(paste,df[cols])))
df
}
premises_dummy(varname = 'Apartment_dummy', strings = 'apartment')
address.1.line address.2.line address.3.line Apartment_dummy
1 apartment 5 london 1
2 25 spring street new york 0
3 nice house apartment 2 paris 1

create dummy variable from string variable

Use str_detect() from the package stringr

library(stringr)

as.integer(str_detect(infringements$Title,"Delegated"))

Generating dummy variable based on two string variables

Your data example can be simplified to

sysuse auto 
egen var1 = concat(mpg foreign), punct(" ")
egen var2 = concat(trunk foreign), punct(" ")

as mapping to string in this instance is not needed for mpg trunk any more than it was needed for foreign. concat() maps to string on the fly, and the only issues with numeric variables (neither applying here) are if fractional parts are present or you want to see value labels.

Now that it is confirmed that multiple words can be present, we can work with a slightly more interesting example.

Here are two methods. One is to loop over the words in one variable and also the words in the other variable to check if there are any matches.

Stata's definition of a word here is that words are delimited by spaces. That being so, we can check for the occurrence of " word " within " variable ", where the leading and trailing spaces are needed because in say "frog toad newt" neither "frog" nor "newt" occurs with both leading and trailing spaces. In the OP's example the check may not be needed, but it often is, just as a search for "1" or "2" or "3" finds any of those within "11 12 13", which is wrong if you seek any as a word and not as a single character.

More is said on search for words within strings in a paper in press at the Stata Journal and likely to appear in 22(4) 2022.

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 var1 str5 var2
"FR DE" "FR"
"FR DE GB" "GB"
"GB" "FR"
"IT FR" "GB DE"
end

gen wc = wordcount(var1)
su wc, meanonly
local max1 = r(max)
replace wc = wordcount(var2)
su wc, meanonly
local max2 = r(max)
drop wc

gen match = 0

quietly forval i = 1/`max1' {
forval j = 1/`max2' {
replace match = 1 if word(var1, `i') == word(var2, `j') & word(var1, `i') != ""
}
}

gen MATCH = 0

forval i = 1/`max1' {
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")
}

list

+----------------------------------+
| var1 var2 match MATCH |
|----------------------------------|
1. | FR DE FR 1 1 |
2. | FR DE GB GB 1 1 |
3. | GB FR 0 0 |
4. | IT FR GB DE 0 0 |
+----------------------------------+

EDIT

replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")  & !missing(var1, var2)

is better code to avoid the uninteresting match of " " with " ".

R: create dummy variables based on a categorical variable *of lists*

We can use mtabulate

library(qdapTools)
cbind(df[1], mtabulate(df$y))
# x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

Creating dummy variables in R based on multiple chr values within each cell

Here's one solution:

# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
stringsAsFactors = FALSE)

# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)

# use strsplit to break your jumbles of tags into separate elements, with a
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")

# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))

# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results;
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))

# assign the tags as column names
names(df2) <- tags

Voila:

> df2
Y F P C Z K A J U H M O L E S R T Q V B I X G
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1


Related Topics



Leave a reply



Submit