Create dummy variables from string with multiple values
Overview
To create dummy variables for each unique value in good_at
required the following steps:
- Separate
good_at
into multiple rows - Generate dummy variables - using
dummy::dummy()
- for each value ingood_at
for eachname
-sex
pair - Reshape data into 4 columns:
name
,sex
,key
andvalue
key
contains all the dummy variable column namesvalue
contains the values in each dummy variable
- Keep only records where
value
is not zero - Reshape data into one record per name-sex pair and as many columns as there are in
key
- Casting the dummy columns as logical vectors.
Code
# load necessary packages ----
library(dummy)
library(tidyverse)
# load necessary data ----
df <-
read.table(text = "name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming"
, header = TRUE
, stringsAsFactors = FALSE)
# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
df %>%
separate_rows(good_at, sep = ";")
# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
df_clean %>%
select(good_at) %>%
dummy() %>%
bind_cols(df_clean) %>%
# drop "good_at" column
select(-good_at) %>%
# make the tibble long by reshaping it into 4 columns:
# name, sex, key and value
# where key are the all dummy variable column names
# and value are the values in each dummy variable
gather(key, value, -name, -sex) %>%
# keep records where
# value is not equal to zero
# note: this is due to "Tom" having both a
# "good_at_Drawing" value of 0 and 1.
filter(value != 0) %>%
# make the tibble wide
# with one record per name-sex pair
# and as many columns as there are in key
# with their values from value
# and filling NA values to 0
spread(key, value, fill = 0) %>%
# for each name-sex pair
# cast the dummy variables into logical vectors
group_by(name, sex) %>%
mutate_all(funs(as.integer(.) %>% as.logical())) %>%
ungroup() %>%
# just for safety let's join
# the original "good_at" column
left_join(y = df, by = c("name", "sex")) %>%
# bring the original "good_at" column to the left-hand side
# of the tibble
select(name, sex, good_at, matches("good_at_"))
# view result ----
df_dummies
# A tibble: 4 x 9
# name sex good_at good_at_Cooking good_at_Drawing good_at_Hiking
# <chr> <chr> <chr> <lgl> <lgl> <lgl>
# 1 Char… M Swimmi… FALSE FALSE FALSE
# 2 Mary F Cookin… TRUE FALSE FALSE
# 3 Sam M Running FALSE FALSE FALSE
# 4 Tom M Drawin… FALSE TRUE TRUE
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
# good_at_Swimming <lgl>
# end of script #
Create several dummy variables from one string variable
Since your concatenated data are concatenated character strings (not concatenated numerical values) you'll need to add type = "character"
to get the function to work as you expect it.
The function's default setting is for numeric values, hence the error about NaN
and so on.
The naming has been made more consistent with the short forms of the other functions in the same family. Thus, it is now cSplit_e
(though the old function name would still work).
library(splitstackshape)
cSplit_e(profs, "teaches", ",", type = "character", fill = 0)
# teaches teaches_1st teaches_2nd teaches_3rd
# 1 1st 1 0 0
# 2 1st, 2nd 1 1 0
# 3 2nd, 3rd 0 1 1
# 4 1st, 2nd, 3rd 1 1 1
The help page for ?concat.split.expanded
is the same as that of cSplit_e
. If you have any tips on making it clearer to understand, please raise an issue at the package's GitHub page.
Dummy variables from a string variable
UPDATE
The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for
loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.
Install it, and use concat.split.expanded
to get the desired result:
library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
# A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Original post
A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split()
, can be found here.
The usage, for your example data, would be:
## Keeping the original column
concat.split(DF, "B", structure="expanded")
# A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA
# 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA
# 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA
# 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1
## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 NA NA NA NA NA NA NA
# 2 2 1 1 1 NA NA 1 NA NA NA NA
# 3 3 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 NA 1 1 1 1 1 1 NA NA NA
# 5 5 NA NA 1 1 1 1 1 NA NA 1
Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)
temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Update
Most of the overhead in the concat.split
function probably comes in things like converting from a matrix
to a data.frame
, renaming the columns, and so on. The actual code used to do the splitting is a GASP for
loop, but test it out, and you'll find that it performs pretty well:
b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)
## Fill it in
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
## View your result
m
Create a dummy to indicating presence of string fragment in any of multiple variables
A base R solution :
cols = endsWith(names(df),"line")
df['Apartment_dummy'] = as.integer(grepl('apartment',do.call(paste,df[cols])))
Now we can write a function that even considers the data to be used ie,data bein an argument.
premises_dummy=function(varname,strings){
cols = endsWith(names(df),"line")
df[varname]= as.integer(grepl(strings,do.call(paste,df[cols])))
df
}
premises_dummy(varname = 'Apartment_dummy', strings = 'apartment')
address.1.line address.2.line address.3.line Apartment_dummy
1 apartment 5 london 1
2 25 spring street new york 0
3 nice house apartment 2 paris 1
create dummy variable from string variable
Use str_detect()
from the package stringr
library(stringr)
as.integer(str_detect(infringements$Title,"Delegated"))
Generating dummy variable based on two string variables
Your data example can be simplified to
sysuse auto
egen var1 = concat(mpg foreign), punct(" ")
egen var2 = concat(trunk foreign), punct(" ")
as mapping to string in this instance is not needed for mpg trunk
any more than it was needed for foreign
. concat()
maps to string on the fly, and the only issues with numeric variables (neither applying here) are if fractional parts are present or you want to see value labels.
Now that it is confirmed that multiple words can be present, we can work with a slightly more interesting example.
Here are two methods. One is to loop over the words in one variable and also the words in the other variable to check if there are any matches.
Stata's definition of a word here is that words are delimited by spaces. That being so, we can check for the occurrence of " word "
within " variable "
, where the leading and trailing spaces are needed because in say "frog toad newt"
neither "frog"
nor "newt"
occurs with both leading and trailing spaces. In the OP's example the check may not be needed, but it often is, just as a search for "1"
or "2"
or "3"
finds any of those within "11 12 13"
, which is wrong if you seek any as a word and not as a single character.
More is said on search for words within strings in a paper in press at the Stata Journal and likely to appear in 22(4) 2022.
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 var1 str5 var2
"FR DE" "FR"
"FR DE GB" "GB"
"GB" "FR"
"IT FR" "GB DE"
end
gen wc = wordcount(var1)
su wc, meanonly
local max1 = r(max)
replace wc = wordcount(var2)
su wc, meanonly
local max2 = r(max)
drop wc
gen match = 0
quietly forval i = 1/`max1' {
forval j = 1/`max2' {
replace match = 1 if word(var1, `i') == word(var2, `j') & word(var1, `i') != ""
}
}
gen MATCH = 0
forval i = 1/`max1' {
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")
}
list
+----------------------------------+
| var1 var2 match MATCH |
|----------------------------------|
1. | FR DE FR 1 1 |
2. | FR DE GB GB 1 1 |
3. | GB FR 0 0 |
4. | IT FR GB DE 0 0 |
+----------------------------------+
EDIT
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ") & !missing(var1, var2)
is better code to avoid the uninteresting match of " "
with " "
.
R: create dummy variables based on a categorical variable *of lists*
We can use mtabulate
library(qdapTools)
cbind(df[1], mtabulate(df$y))
# x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1
Creating dummy variables in R based on multiple chr values within each cell
Here's one solution:
# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
stringsAsFactors = FALSE)
# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)
# use strsplit to break your jumbles of tags into separate elements, with a
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")
# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))
# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results;
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))
# assign the tags as column names
names(df2) <- tags
Voila:
> df2
Y F P C Z K A J U H M O L E S R T Q V B I X G
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1
Related Topics
Compute Projection/Hat Matrix via Qr Factorization, Svd (And Cholesky Factorization)
How to Define Fill Colours in Ggplot Histogram
How to Prep Transaction Data into Basket for Arules
Manipulating Files with Non-English Names in R
Differencebetween Short (&,|) and Long (&&, ||) Forms of And, or Logical Operators in R
Blend of Na.Omit and Na.Pass Using Aggregate
Plot a Character Vector Against a Numeric Vector in R
How to Split Data Frame by Column Names in R
Can Sparklyr Be Used with Spark Deployed on Yarn-Managed Hadoop Cluster
Harvest (Rvest) Multiple HTML Pages from a List of Urls
Adding Multiple Lag Variables Using Dplyr and for Loops
How to Run a R Language(.R) File Using Batch File
Update a Ggplot Using a for Loop (R)
How to Read a Text File into Gnu R with a Multiple-Byte Separator
How to Rbind All the Data.Frames in Your Working Environment