How to Create a New Column Based on Multiple Conditions from Multiple Columns

How to create a new column based on multiple conditions in another column

You can use Pandas.shift for creating A_(i-1) and use Numpy.select for checking multiple conditions like below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[5,12,14,22,20,33,11,8,15,11]})
df['A_prv'] = df['A'].shift(1)

conditions = [
    (df.index==0),
    ((df['A_prv'] - df['A'] >= 0) & (df['A'].le(10))),
    ((df['A_prv'] - df['A'] >= 2) & (df['A'].between(10, 20, inclusive='right'))),
                                     # ^^^  10 < df['A'] <= 20 ^^^
    ((df['A_prv'] - df['A'] >= 5) & (df['A'].ge(20)))
]
choices = [2, 1, 1, 1]
df['B'] = np.select(conditions, choices, default=0)
print(df)

Output:

    A  A_prv  B
0   5    NaN  2
1  12    5.0  0
2  14   12.0  0
3  22   14.0  0
4  20   22.0  1
5  33   20.0  0
6  11   33.0  1
7   8   11.0  1
8  15    8.0  0
9  11   15.0  1

How do I create a new column based on multiple conditions from multiple columns?

We can use %in% for comparing multiple elements in a column, & to check if both conditions are TRUE.

library(dplyr)
df %>%
     mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") & 
           as.character(kids) == "Yes" & 
           as.numeric(as.character(distance)) < 10)+1] )

It is better to create the data.frame with stringsAsFactors=FALSE as by default it is TRUE. If we check the str(df), we can find that all the columns are factor class. Also, if there are missing values, instead of "", NA can be used to avoid converting the class of a numeric column to something else.

If we rewrite the creation of 'df'

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)

the above code can be simplified

df1 %>%
    mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
        kids == "Yes" &
        distance < 10)+1] )

For better understanding, some people prefer ifelse

df1 %>% 
   mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
                kids == "Yes" &
                distance < 10, 
                          "Yes", ""))

This can be also done easily with base R methods

df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
              kids == "Yes" & 
              distance < 10, 
                       "Yes", ""))

How do I assign values based on multiple conditions for existing columns?

You can do this using np.where, the conditions use bitwise & and | for and and or with parentheses around the multiple conditions due to operator precedence. So where the condition is true 5 is returned and 0 otherwise:

In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df

Out[29]:
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Create new column based on multiple conditions in multiple columns

With your large (> 40 million rows) data set, the data.table package might be a good choice:

library(data.table)

cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df

     id icpc icpc2  reg.date condit
 1: 123  D95   F15 19JUN2015    yes
 2: 123  F85       15AUG2016    yes
 3: 332  A01       16MAR2010    yes
 4: 332  A04       20JAN2018    yes
 5: 332  K20       20FEB2017    yes
 6: 100  B10       01JUN2017     no
 7: 100  A04       11JAN2008     no
 8: 113  T08       18MAR2018    yes
 9: 113  P28       19JAN2017    yes
10: 113  D95   A01 16JAN2013    yes
11: 113  A04       01MAY2009    yes
12: 551  B12   A01 03APR2011    yes
13: 551  D95       09MAY2015    yes

Data:

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01", 
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", 
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01", 
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010", 
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018", 
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

Edit: for multiple conditions:

cond1 <- c("D95", "A01") # A
cond2 <- c("A04", "T08") # B
cond3 <- "B10"           # C

setDT(df)[, condit := if(any(icpc %in% cond1 | icpc2 %in% cond1)) "A" else 
                         if(any(icpc %in% cond2 | icpc2 %in% cond2)) "B" else
                            if(any(icpc %in% cond3 | icpc2 %in% cond3)) "C" else "", by=id]

     id icpc icpc2  reg.date condit
 1: 123  D95   F15 19JUN2015      A
 2: 123  F85       15AUG2016      A
 3: 332  A01       16MAR2010      A
 4: 332  A04       20JAN2018      A
 5: 332  K20       20FEB2017      A
 6: 100  B10       01JUN2017      B
 7: 100  A04       11JAN2008      B
 8: 113  T08       18MAR2018      A
 9: 113  P28       19JAN2017      A
10: 113  D95   A01 16JAN2013      A
11: 113  A04       01MAY2009      A
12: 551  B12   B10 03APR2011      C
13: 551  D96       09MAY2015      C

Data: (slightly modified from the original since no "C" condition was found.

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01", 
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", 
"D96"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01", 
"", "B10", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010", 
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018", 
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

Tested on a data frame with 40M rows:
system.time(...)

#    user  system elapsed 
#  111.11    1.17  111.97

Using dplyr:

# Error: cannot allocate vector of size 274.7 Mb
# Timing stopped at: 4.19 1.11 5.39

New column based on conditions of multiple columns

Consider below approach

select *, 
  case row_number() over(partition by person_id order by date nulls last) * if(date is null, 0, 1)
    when 0 then 'incomplete'
    when 1 then 'start'
    when 2 then 'in progress'
    when 3 then 'completed'
    else 'game over'
  end status
from data

if applied to sample data in your question - output is

Sample Image

It is not 100% clear from your question - but I think you want to count occurrences not just by person_id but also by activity - not sure - in this case just add activity to partition by as in partition by person_id, activity

Create a new column based on multiple conditions in other columns in R

using case_when dplyr to solve would be like:

dat %>%
  mutate(col3 = case_when(
    col1 != 0 & col2 != 0 ~ 1,
    TRUE ~ 2
  ))

Mutate a new column based on multiple conditions in R

Use case_when :

library(dplyr)

df %>%
  mutate(direction = case_when(city %in% c('bj', 'tj') ~ 'north', 
                               city %in% c('sz', 'nj', 'sh') ~ 'east', 
                               city %in% c('xa', 'lz') ~ 'west', 
                               city %in% c('wh') ~ 'center', 
                               city %in% c('gz', 'sz') ~ 'south', 
                               ))

#  id city direction
#1  1   bj     north
#2  2   tj     north
#3  3   gz     south
#4  4   sz      east
#5  5   nj      east
#6  6   xa      west
#7  7   lz      west
#8  8   wh    center
#9  9   sh      east