How to Create a Single Dummy Variable with Conditions in Multiple Columns

How to Create a Single Dummy Variable with conditions in multiple columns?

You can use rowSums (vectorized solution) like this :

set.seed(123)
dat <- matrix(sample(c(35,1:100),size=15*20,rep=T),ncol=15,byrow=T)
cbind(dat,rowSums(dat[,9:15] == 35) > 0)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
[1,] 29 79 41 89 94 4 53 90 55 46 96 45 68 57 10 0
[2,] 90 24 4 33 96 89 69 64 100 66 71 54 60 29 14 0
[3,] 97 91 69 80 2 48 76 21 32 23 14 41 41 37 15 0
[4,] 14 23 47 26 86 4 44 80 12 56 20 12 76 90 37 0
[5,] 67 9 38 27 82 45 81 82 80 44 76 63 71 35 48 1
[6,] 22 38 61 35 11 24 67 42 79 10 43 99 90 89 17 0
[7,] 13 65 34 66 32 18 79 9 47 51 60 33 49 96 48 0
[8,] 89 92 61 41 14 94 30 6 95 72 14 55 96 59 40 0
[9,] 65 32 31 22 37 99 15 9 14 69 62 90 67 74 52 0
[10,] 66 83 79 98 44 31 41 1 18 85 23 24 7 24 73 0
[11,] 85 50 39 24 11 39 57 21 44 22 50 35 65 37 35 1
[12,] 53 74 22 41 26 63 18 87 75 67 62 37 53 88 58 0
[13,] 84 31 71 26 60 48 26 57 92 91 27 32 99 62 94 0
[14,] 47 41 66 15 57 24 97 60 52 40 88 36 29 17 17 0
[15,] 48 25 21 68 4 70 35 41 82 92 28 97 73 69 5 0
[16,] 39 48 56 70 92 62 43 54 5 26 40 19 84 15 81 0
[17,] 55 66 17 63 31 73 40 97 97 73 25 22 59 27 53 0
[18,] 79 16 40 47 87 93 89 68 95 52 58 33 35 2 50 1
[19,] 87 35 7 16 77 74 98 47 7 65 76 13 40 22 5 0
[20,] 39 6 22 5 67 30 10 7 88 76 82 99 10 10 80 0

EDIT

I replace the cbind by transform. Since the column will be boolean I coerce it to get 0/1.

 transform(dat,x=as.numeric((rowSums(dat[,9:15] == 35) > 0)))

The result is a data.frame.( coerced from matrix by transform)

EDIT2 ( as suggested by @flodel)

data$indicator <- as.integer(rowSums(data[paste0("col", 9:15)] == 35) > 0)

where data is the OP's data.frame.

Dummy variable with multiple conditions

We can use | with & to create the logical expression

i1 <- with(df, (x > -100  & x <- 90)|(x > -80 & x < -50)|(y > 50 & y < 45))
df1dummy_var[i1] <- 1

How to create dummy variable based on the value of two columns in R?

With tidyverse you could try the following.

Use group_by with Country to consider all the Time values within each Country.

To satisfy DummyTime123 criteria, you need all values of 1, 2, and 3 in the Time values within a Country. If TRUE, then using + this becomes 1.

For DummyTime23, it sounds like you want both 2 and 3 in Time but do not want any values of Time to be 1. Using & you can make sure both criteria are satisfied.

Let me know if this provides the results expected.

library(tidyverse)

df %>%
group_by(Country) %>%
mutate(DummyTime123 = +all(1:3 %in% Time),
DummyTime23 = +(all(2:3 %in% Time) & !any(Time == 1)))

Output

  Country  Time DummyTime123 DummyTime23
<chr> <dbl> <int> <int>
1 US 1 1 0
2 US 1 1 0
3 US 2 1 0
4 US 3 1 0
5 IT 1 0 0
6 IT 2 0 0
7 IT 1 0 0
8 FR 2 0 1
9 FR 3 0 1

Add a new column having a dummy variable for complete group based on a condition

You can do it like this also.

df['col_2'] = (df.groupby('id')['col_1']
.transform(lambda x: x.rolling(3).sum().eq(3).any())
.astype(int))
df

Output:

   id  date  col_1  col_2
0 A 2015 1 1
1 A 2016 1 1
2 A 2017 1 1
3 A 2018 0 1
4 B 2015 1 0
5 B 2016 0 0
6 B 2017 1 0
7 B 2018 1 0
8 C 2015 0 1
9 C 2016 1 1
10 C 2017 1 1
11 C 2018 1 1

Creating a dummy variable based on whether words appear in multiple columns

base R

found <- sapply(dat[c("protesterdemand1", "protesterdemand2", "protesterdemand3", "protesterdemand1")],
grepl, pattern = "political behavior|police brutality|removal of politician", ignore.case = TRUE) # ignore is just-in-case, over to you
found
# protesterdemand1 protesterdemand2 protesterdemand3 protesterdemand1.1
# [1,] TRUE FALSE FALSE TRUE
# [2,] TRUE FALSE FALSE TRUE
# [3,] TRUE FALSE FALSE TRUE
# [4,] FALSE FALSE FALSE FALSE
# [5,] TRUE FALSE FALSE TRUE
# [6,] TRUE FALSE FALSE TRUE

dat$sensitive_issue <- rowSums(found) > 0

dat
# Country COWcode Year Region Protest protesterviolence protesterdemand1 protesterdemand2 protesterdemand3
# 1 Canada 20 1990 North America 1 0 political behavior, process labor wage dispute
# 2 Canada 20 1990 North America 1 0 political behavior, process
# 3 Canada 20 1990 North America 1 0 political behavior, process
# 4 Canada 20 1990 North America 1 1 land farm issue
# 5 Canada 20 1990 North America 1 1 political behavior, process
# 6 Canada 20 1990 North America 1 0 police brutality
# protesterdemand4 stateresponse1 stateresponse2 stateresponse3 stateresponse4 stateresponse5 stateresponse6 stateresponse7
# 1 ignore
# 2 ignore
# 3 ignore
# 4 accomodation
# 5 crowd dispersal arrests accomodation
# 6 crowd dispersal shootings
# participants participants_category sensitive_issue
# 1 1000s TRUE
# 2 1000 TRUE
# 3 500 TRUE
# 4 100s FALSE
# 5 950 TRUE
# 6 200 TRUE

Create dummy variables for every unique value in a column based on a condition from a second column in R

Here is a crude way to do this

df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))

tmp=which(df$natural.death==1) #index of deaths
lng=length(tmp) #number of deaths

#create matrix with zeros and lng columns, append to df
df=cbind(df,data.frame(matrix(0,nrow=nrow(df),ncol=lng)))
#change the newly added column names
colnames(df)[(ncol(df)-lng+1):ncol(df)]=paste0("id",1:lng)

for (i in 1:lng) { #loop over new columns
df[tmp[i],paste0("id",i)]=1 #at index i of death and column id+i set df to 1
}

country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0

Building dummy variable with many conditions (R)

Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.

Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.

# Load libraries 
library(dplyr)
library(reshape2)

# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()

# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998

# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))


Related Topics



Leave a reply



Submit