Dummy Variables from a String Variable

Dummy variables from a string variable

UPDATE

The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.

Install it, and use concat.split.expanded to get the desired result:

library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
#   A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1    1    1    1    0    0    0    0    0    0    0
# 2 2    1    1    1    0    0    1    0    0    0    0
# 3 3    1    1    1    0    1    0    1    0    0    0
# 4 4    0    1    1    1    1    1    1    0    0    0
# 5 5    0    0    1    1    1    1    1    0    0    1

Original post

A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split(), can be found here.

The usage, for your example data, would be:

## Keeping the original column
concat.split(DF, "B", structure="expanded")
#   A            B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1        1,3,2   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2      2,1,3,6   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3    3,2,5,1,7   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  3,7,4,2,6,5  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5 4,10,7,3,5,6  NA  NA   1   1   1   1   1  NA  NA    1

## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5  NA  NA   1   1   1   1   1  NA  NA    1

Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1   0   0   0   0   0   0    0
# 2 2   1   1   1   0   0   1   0   0   0    0
# 3 3   1   1   1   0   1   0   1   0   0    0
# 4 4   0   1   1   1   1   1   1   0   0    0
# 5 5   0   0   1   1   1   1   1   0   0    1

Update

Most of the overhead in the concat.split function probably comes in things like converting from a matrix to a data.frame, renaming the columns, and so on. The actual code used to do the splitting is a GASP for loop, but test it out, and you'll find that it performs pretty well:

b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)      
## Fill it in
for (i in 1:nrow(DF)) {
  m[i, temp[[i]]] = 1
}
## View your result
m

Creating Dummy Variables from String Column

You can use pd.Series.str.get_dummies and a dictionary mapping:

d = {1: 'yes', 0: 'no'}
res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))

In my opinion, it's best to convert to strings for display purposes only. Boolean values are more efficiently held and manipulated in Boolean series.

Result

print(res)

   ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked  \
0   1        yes                    no          no     yes             no   
1   1        yes                    no          no      no            yes   
2   1        yes                    no          no     yes             no   
3   2         no                   yes          no     yes             no   
4   2         no                    no         yes      no             no   
5   3         no                    no          no      no             no   

  Locked Port De-Accessed Tubing Changed  
0    yes               no             no  
1     no              yes            yes  
2     no               no             no  
3     no               no             no  
4     no               no             no  
5     no              yes             no

Setup

df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
                   'Intervention': ['Blood Draw, Flushed, Locked',
                                    'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
                                    'Blood Draw, Flushed', 'Blood return Verified, Flushed',
                                    'Cap Changed', 'Port De-Accessed']})

create dummy variable from string variable

Use str_detect() from the package stringr

library(stringr)

as.integer(str_detect(infringements$Title,"Delegated"))

Create dummy variables from string with multiple values

Overview

To create dummy variables for each unique value in good_at required the following steps:

Separate good_at into multiple rows
Generate dummy variables - using dummy::dummy() - for each value in good_at for each name-sex pair
Reshape data into 4 columns: name, sex, key and value
- key contains all the dummy variable column names
- value contains the values in each dummy variable
Keep only records where value is not zero
Reshape data into one record per name-sex pair and as many columns as there are in key
Casting the dummy columns as logical vectors.

Code

# load necessary packages ----
library(dummy)
library(tidyverse)

# load necessary data ----
df <-
  read.table(text = "name    sex     good_at
1 Tom      M   Drawing;Hiking
             2 Mary     F   Cooking;Joking
             3 Sam      M      Running
             4 Charlie  M      Swimming"
             , header = TRUE
             , stringsAsFactors = FALSE)

# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
  df %>%
  separate_rows(good_at, sep = ";")

# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
  df_clean %>%
  select(good_at) %>%
  dummy() %>%
  bind_cols(df_clean) %>%
  # drop "good_at" column 
  select(-good_at) %>%
  # make the tibble long by reshaping it into 4 columns:
  # name, sex, key and value
  # where key are the all dummy variable column names
  # and value are the values in each dummy variable
  gather(key, value, -name, -sex) %>%
  # keep records where
  # value is not equal to zero
  # note: this is due to "Tom" having both a 
  # "good_at_Drawing" value of 0 and 1. 
  filter(value != 0) %>%
  # make the tibble wide
  # with one record per name-sex pair
  # and as many columns as there are in key
  # with their values from value
  # and filling NA values to 0
  spread(key, value, fill = 0) %>%
  # for each name-sex pair
  # cast the dummy variables into logical vectors
  group_by(name, sex) %>%
  mutate_all(funs(as.integer(.) %>% as.logical())) %>%
  ungroup() %>%
  # just for safety let's join
  # the original "good_at" column
  left_join(y = df, by = c("name", "sex")) %>%
  # bring the original "good_at" column to the left-hand side 
  # of the tibble
  select(name, sex, good_at, matches("good_at_"))

# view result ----
df_dummies
# A tibble: 4 x 9
#   name  sex   good_at good_at_Cooking good_at_Drawing good_at_Hiking
#   <chr> <chr> <chr>   <lgl>           <lgl>           <lgl>         
# 1 Char… M     Swimmi… FALSE           FALSE           FALSE         
# 2 Mary  F     Cookin… TRUE            FALSE           FALSE         
# 3 Sam   M     Running FALSE           FALSE           FALSE         
# 4 Tom   M     Drawin… FALSE           TRUE            TRUE          
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
#   good_at_Swimming <lgl>

# end of script #

Create a dummy to indicating presence of string fragment in any of multiple variables

A base R solution :

 cols = endsWith(names(df),"line")
 df['Apartment_dummy'] = as.integer(grepl('apartment',do.call(paste,df[cols])))

Now we can write a function that even considers the data to be used ie,data bein an argument.

premises_dummy=function(varname,strings){
   cols = endsWith(names(df),"line")
   df[varname]= as.integer(grepl(strings,do.call(paste,df[cols])))
   df
 }
 premises_dummy(varname = 'Apartment_dummy', strings = 'apartment')
    address.1.line address.2.line address.3.line Apartment_dummy
1      apartment 5         london                              1
2 25 spring street       new york                              0
3       nice house    apartment 2          paris               1

dummy variable columns based on strings from other columns

One tidyverse possibility could be:

example %>%
 mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
 unnest() %>%
 spread(treatment2, treatment2) %>%
 mutate_at(vars(-id_number, -treatment), ~ (!is.na(.)) * 1)

  id_number treatment A B C D
1         0         A 1 0 0 0
2         1   A+B+C+D 1 1 1 1
3         2       C+B 0 1 1 0
4         3       B+A 1 1 0 0
5         4         C 0 0 1 0

Or:

example %>%
 mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
 unnest() %>%
 mutate(val = 1) %>%
 spread(treatment2, val, fill = 0)

Convert an entire string to a dummy variable

@Alex97:

The method pd.get_dummies should answer your issue.

Solution

df # your DataFrame as shown on your picture

# generate dummies
dum_winners = pd.get_dummies(df.Winner, prefix="win", prefix_sep=" ")
dum_losers = pd.get_dummies(df.Loser, prefix="los", prefix_sep=" ")

# update df with dummies
df = pd.concat([df, dum_winners, dum_losers], axis=1)\
    .drop(columns=["Winner", "Loser"])

The solution above get as input:

   Other Features    Winner     Loser
0               2   John D.  Jason S.
1               4  Jason S.   Eric N.

and returns:

   Other Features  win Jason S.  win John D.  los Eric N.  los Jason S.
0               2             0            1            0             1
1               4             1            0            1             0

Details

Here are documentation links:

concat to merge DataFrames: link
get_dummies to generate
a one hot encoded matrix: link

Create dummy and categorical variables from specific word(s) in text column in Python dataframe

Here's my take on this.

Since you're dealing with text, pandas.Series.str.contains should be plenty (no need to use re.search.

np.where and np.select are useful when it comes to assigning new variables based on conditions.

import pandas as pd
import numpy as np

Cars_listing = pd.DataFrame({
    'Cars_notes': 
    ['"This Audi has ABS braking, leather interior and bucket seats..."',
    '"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
    '"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
    '"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
    '"The Renault Le Car has been sitting in the garage, a little rust..."',
    '"The Kia Sorento for sale has a CD player, new tires..."',
    '"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})


# 1. car_type
Cars_listing['car_type'] = np.select(
    condlist=[ # note you could use the case-insensitive search with `case=False`
        Cars_listing['Cars_notes'].str.contains('ford', case=False),
        Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
        Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
    ],
    choicelist=[1, 2, 3], # dummy variables
    default=0 # you could set it to `np.nan` etc
)

# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
    Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)

# 3. imperfection
Cars_listing['imperfection'] = np.where(
    Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)

# 4. sporty
Cars_listing['sporty'] = np.where(
    Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)

    Cars_notes              car_type    ABS_brakes  imperfection    sporty
0   """This Audi has ..."   2           1           0               0
1   """The Ford F150 ..."   1           0           0               0
2   """Our Nissan Sen..."   0           1           0               0
3   """This Toyota Co..."   3           0           1               0
4   """The Renault Le..."   2           0           1               0
5   """The Kia Sorent..."   3           0           0               0
6   """Red Dodge Vipe..."   0           0           0               1

Dummy Variables from a String Variable