Dummy variables from a string variable
UPDATE
The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for
loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.
Install it, and use concat.split.expanded
to get the desired result:
library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
# A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Original post
A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split()
, can be found here.
The usage, for your example data, would be:
## Keeping the original column
concat.split(DF, "B", structure="expanded")
# A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA
# 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA
# 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA
# 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1
## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 NA NA NA NA NA NA NA
# 2 2 1 1 1 NA NA 1 NA NA NA NA
# 3 3 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 NA 1 1 1 1 1 1 NA NA NA
# 5 5 NA NA 1 1 1 1 1 NA NA 1
Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)
temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Update
Most of the overhead in the concat.split
function probably comes in things like converting from a matrix
to a data.frame
, renaming the columns, and so on. The actual code used to do the splitting is a GASP for
loop, but test it out, and you'll find that it performs pretty well:
b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)
## Fill it in
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
## View your result
m
Creating Dummy Variables from String Column
You can use pd.Series.str.get_dummies
and a dictionary mapping:
d = {1: 'yes', 0: 'no'}
res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))
In my opinion, it's best to convert to strings for display purposes only. Boolean values are more efficiently held and manipulated in Boolean series.
Result
print(res)
ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked \
0 1 yes no no yes no
1 1 yes no no no yes
2 1 yes no no yes no
3 2 no yes no yes no
4 2 no no yes no no
5 3 no no no no no
Locked Port De-Accessed Tubing Changed
0 yes no no
1 no yes yes
2 no no no
3 no no no
4 no no no
5 no yes no
Setup
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
'Intervention': ['Blood Draw, Flushed, Locked',
'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
'Blood Draw, Flushed', 'Blood return Verified, Flushed',
'Cap Changed', 'Port De-Accessed']})
create dummy variable from string variable
Use str_detect()
from the package stringr
library(stringr)
as.integer(str_detect(infringements$Title,"Delegated"))
Create dummy variables from string with multiple values
Overview
To create dummy variables for each unique value in good_at
required the following steps:
- Separate
good_at
into multiple rows - Generate dummy variables - using
dummy::dummy()
- for each value ingood_at
for eachname
-sex
pair - Reshape data into 4 columns:
name
,sex
,key
andvalue
key
contains all the dummy variable column namesvalue
contains the values in each dummy variable
- Keep only records where
value
is not zero - Reshape data into one record per name-sex pair and as many columns as there are in
key
- Casting the dummy columns as logical vectors.
Code
# load necessary packages ----
library(dummy)
library(tidyverse)
# load necessary data ----
df <-
read.table(text = "name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming"
, header = TRUE
, stringsAsFactors = FALSE)
# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
df %>%
separate_rows(good_at, sep = ";")
# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
df_clean %>%
select(good_at) %>%
dummy() %>%
bind_cols(df_clean) %>%
# drop "good_at" column
select(-good_at) %>%
# make the tibble long by reshaping it into 4 columns:
# name, sex, key and value
# where key are the all dummy variable column names
# and value are the values in each dummy variable
gather(key, value, -name, -sex) %>%
# keep records where
# value is not equal to zero
# note: this is due to "Tom" having both a
# "good_at_Drawing" value of 0 and 1.
filter(value != 0) %>%
# make the tibble wide
# with one record per name-sex pair
# and as many columns as there are in key
# with their values from value
# and filling NA values to 0
spread(key, value, fill = 0) %>%
# for each name-sex pair
# cast the dummy variables into logical vectors
group_by(name, sex) %>%
mutate_all(funs(as.integer(.) %>% as.logical())) %>%
ungroup() %>%
# just for safety let's join
# the original "good_at" column
left_join(y = df, by = c("name", "sex")) %>%
# bring the original "good_at" column to the left-hand side
# of the tibble
select(name, sex, good_at, matches("good_at_"))
# view result ----
df_dummies
# A tibble: 4 x 9
# name sex good_at good_at_Cooking good_at_Drawing good_at_Hiking
# <chr> <chr> <chr> <lgl> <lgl> <lgl>
# 1 Char… M Swimmi… FALSE FALSE FALSE
# 2 Mary F Cookin… TRUE FALSE FALSE
# 3 Sam M Running FALSE FALSE FALSE
# 4 Tom M Drawin… FALSE TRUE TRUE
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
# good_at_Swimming <lgl>
# end of script #
Create a dummy to indicating presence of string fragment in any of multiple variables
A base R solution :
cols = endsWith(names(df),"line")
df['Apartment_dummy'] = as.integer(grepl('apartment',do.call(paste,df[cols])))
Now we can write a function that even considers the data to be used ie,data bein an argument.
premises_dummy=function(varname,strings){
cols = endsWith(names(df),"line")
df[varname]= as.integer(grepl(strings,do.call(paste,df[cols])))
df
}
premises_dummy(varname = 'Apartment_dummy', strings = 'apartment')
address.1.line address.2.line address.3.line Apartment_dummy
1 apartment 5 london 1
2 25 spring street new york 0
3 nice house apartment 2 paris 1
dummy variable columns based on strings from other columns
One tidyverse
possibility could be:
example %>%
mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
unnest() %>%
spread(treatment2, treatment2) %>%
mutate_at(vars(-id_number, -treatment), ~ (!is.na(.)) * 1)
id_number treatment A B C D
1 0 A 1 0 0 0
2 1 A+B+C+D 1 1 1 1
3 2 C+B 0 1 1 0
4 3 B+A 1 1 0 0
5 4 C 0 0 1 0
Or:
example %>%
mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
unnest() %>%
mutate(val = 1) %>%
spread(treatment2, val, fill = 0)
Convert an entire string to a dummy variable
@Alex97:
The method pd.get_dummies
should answer your issue.
Solution
df # your DataFrame as shown on your picture
# generate dummies
dum_winners = pd.get_dummies(df.Winner, prefix="win", prefix_sep=" ")
dum_losers = pd.get_dummies(df.Loser, prefix="los", prefix_sep=" ")
# update df with dummies
df = pd.concat([df, dum_winners, dum_losers], axis=1)\
.drop(columns=["Winner", "Loser"])
The solution above get as input:
Other Features Winner Loser
0 2 John D. Jason S.
1 4 Jason S. Eric N.
and returns:
Other Features win Jason S. win John D. los Eric N. los Jason S.
0 2 0 1 0 1
1 4 1 0 1 0
Details
Here are documentation links:
concat
to merge DataFrames: linkget_dummies
to generate
a one hot encoded matrix: link
Create dummy and categorical variables from specific word(s) in text column in Python dataframe
Here's my take on this.
Since you're dealing with text, pandas.Series.str.contains
should be plenty (no need to use re.search
.
np.where
and np.select
are useful when it comes to assigning new variables based on conditions.
import pandas as pd
import numpy as np
Cars_listing = pd.DataFrame({
'Cars_notes':
['"This Audi has ABS braking, leather interior and bucket seats..."',
'"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
'"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
'"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
'"The Renault Le Car has been sitting in the garage, a little rust..."',
'"The Kia Sorento for sale has a CD player, new tires..."',
'"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})
# 1. car_type
Cars_listing['car_type'] = np.select(
condlist=[ # note you could use the case-insensitive search with `case=False`
Cars_listing['Cars_notes'].str.contains('ford', case=False),
Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
],
choicelist=[1, 2, 3], # dummy variables
default=0 # you could set it to `np.nan` etc
)
# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)
# 3. imperfection
Cars_listing['imperfection'] = np.where(
Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)
# 4. sporty
Cars_listing['sporty'] = np.where(
Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
Cars_notes car_type ABS_brakes imperfection sporty
0 """This Audi has ..." 2 1 0 0
1 """The Ford F150 ..." 1 0 0 0
2 """Our Nissan Sen..." 0 1 0 0
3 """This Toyota Co..." 3 0 1 0
4 """The Renault Le..." 2 0 1 0
5 """The Kia Sorent..." 3 0 0 0
6 """Red Dodge Vipe..." 0 0 0 1
Related Topics
How to Count Runs in a Sequence
How to Set Multiple Legends/Scales For the Same Aesthetic in Ggplot2
Assign Multiple Objects to .Globalenv from Within a Function
R Ifelse to Replace Values in a Column
How to Read in Numbers With a Comma as Decimal Separator
Remove Extra Legends in Ggplot2
Custom Sorting (Non-Alphabetical)
Forcing Garbage Collection to Run in R With the Gc() Command
Generate Sequence Within Group in R
Remove an Entire Column from a Data.Frame in R
How to Calculate Mean/Median Per Group in a Dataframe in R
Coalesce Two String Columns With Alternating Missing Values to One
Format Numbers With Million (M) and Billion (B) Suffixes