How to Create a New Column Based on Multiple Conditions from Multiple Columns

How to create a new column based on multiple conditions in another column

You can use Pandas.shift for creating A_(i-1) and use Numpy.select for checking multiple conditions like below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[5,12,14,22,20,33,11,8,15,11]})
df['A_prv'] = df['A'].shift(1)

conditions = [
(df.index==0),
((df['A_prv'] - df['A'] >= 0) & (df['A'].le(10))),
((df['A_prv'] - df['A'] >= 2) & (df['A'].between(10, 20, inclusive='right'))),
# ^^^ 10 < df['A'] <= 20 ^^^
((df['A_prv'] - df['A'] >= 5) & (df['A'].ge(20)))
]
choices = [2, 1, 1, 1]
df['B'] = np.select(conditions, choices, default=0)
print(df)

Output:

    A  A_prv  B
0 5 NaN 2
1 12 5.0 0
2 14 12.0 0
3 22 14.0 0
4 20 22.0 1
5 33 20.0 0
6 11 33.0 1
7 8 11.0 1
8 15 8.0 0
9 11 15.0 1

How do I create a new column based on multiple conditions from multiple columns?

We can use %in% for comparing multiple elements in a column, & to check if both conditions are TRUE.

library(dplyr)
df %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
as.character(kids) == "Yes" &
as.numeric(as.character(distance)) < 10)+1] )

It is better to create the data.frame with stringsAsFactors=FALSE as by default it is TRUE. If we check the str(df), we can find that all the columns are factor class. Also, if there are missing values, instead of "", NA can be used to avoid converting the class of a numeric column to something else.

If we rewrite the creation of 'df'

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)

the above code can be simplified

df1 %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10)+1] )

For better understanding, some people prefer ifelse

df1 %>% 
mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))

This can be also done easily with base R methods

df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
kids == "Yes" &
distance < 10,
"Yes", ""))

How do I assign values based on multiple conditions for existing columns?

You can do this using np.where, the conditions use bitwise & and | for and and or with parentheses around the multiple conditions due to operator precedence. So where the condition is true 5 is returned and 0 otherwise:

In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df

Out[29]:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0

Create new column based on multiple conditions in multiple columns

With your large (> 40 million rows) data set, the data.table package might be a good choice:

library(data.table)

cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df

     id icpc icpc2  reg.date condit
1: 123 D95 F15 19JUN2015 yes
2: 123 F85 15AUG2016 yes
3: 332 A01 16MAR2010 yes
4: 332 A04 20JAN2018 yes
5: 332 K20 20FEB2017 yes
6: 100 B10 01JUN2017 no
7: 100 A04 11JAN2008 no
8: 113 T08 18MAR2018 yes
9: 113 P28 19JAN2017 yes
10: 113 D95 A01 16JAN2013 yes
11: 113 A04 01MAY2009 yes
12: 551 B12 A01 03APR2011 yes
13: 551 D95 09MAY2015 yes

Data:

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

Edit: for multiple conditions:

cond1 <- c("D95", "A01") # A
cond2 <- c("A04", "T08") # B
cond3 <- "B10" # C

setDT(df)[, condit := if(any(icpc %in% cond1 | icpc2 %in% cond1)) "A" else
if(any(icpc %in% cond2 | icpc2 %in% cond2)) "B" else
if(any(icpc %in% cond3 | icpc2 %in% cond3)) "C" else "", by=id]

id icpc icpc2 reg.date condit
1: 123 D95 F15 19JUN2015 A
2: 123 F85 15AUG2016 A
3: 332 A01 16MAR2010 A
4: 332 A04 20JAN2018 A
5: 332 K20 20FEB2017 A
6: 100 B10 01JUN2017 B
7: 100 A04 11JAN2008 B
8: 113 T08 18MAR2018 A
9: 113 P28 19JAN2017 A
10: 113 D95 A01 16JAN2013 A
11: 113 A04 01MAY2009 A
12: 551 B12 B10 03APR2011 C
13: 551 D96 09MAY2015 C

Data: (slightly modified from the original since no "C" condition was found.

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D96"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "B10", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

Tested on a data frame with 40M rows:
system.time(...)

#    user  system elapsed 
# 111.11 1.17 111.97

Using dplyr:

# Error: cannot allocate vector of size 274.7 Mb
# Timing stopped at: 4.19 1.11 5.39

New column based on conditions of multiple columns

Consider below approach

select *, 
case row_number() over(partition by person_id order by date nulls last) * if(date is null, 0, 1)
when 0 then 'incomplete'
when 1 then 'start'
when 2 then 'in progress'
when 3 then 'completed'
else 'game over'
end status
from data

if applied to sample data in your question - output is

Sample Image

It is not 100% clear from your question - but I think you want to count occurrences not just by person_id but also by activity - not sure - in this case just add activity to partition by as in partition by person_id, activity

Create a new column based on multiple conditions in other columns in R

using case_when dplyr to solve would be like:

dat %>%
mutate(col3 = case_when(
col1 != 0 & col2 != 0 ~ 1,
TRUE ~ 2
))

Mutate a new column based on multiple conditions in R

Use case_when :

library(dplyr)

df %>%
mutate(direction = case_when(city %in% c('bj', 'tj') ~ 'north',
city %in% c('sz', 'nj', 'sh') ~ 'east',
city %in% c('xa', 'lz') ~ 'west',
city %in% c('wh') ~ 'center',
city %in% c('gz', 'sz') ~ 'south',
))

# id city direction
#1 1 bj north
#2 2 tj north
#3 3 gz south
#4 4 sz east
#5 5 nj east
#6 6 xa west
#7 7 lz west
#8 8 wh center
#9 9 sh east


Related Topics



Leave a reply



Submit