How to create a new column based on multiple conditions in another column
You can use Pandas.shift
for creating A_(i-1)
and use Numpy.select
for checking multiple conditions like below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5,12,14,22,20,33,11,8,15,11]})
df['A_prv'] = df['A'].shift(1)
conditions = [
(df.index==0),
((df['A_prv'] - df['A'] >= 0) & (df['A'].le(10))),
((df['A_prv'] - df['A'] >= 2) & (df['A'].between(10, 20, inclusive='right'))),
# ^^^ 10 < df['A'] <= 20 ^^^
((df['A_prv'] - df['A'] >= 5) & (df['A'].ge(20)))
]
choices = [2, 1, 1, 1]
df['B'] = np.select(conditions, choices, default=0)
print(df)
Output:
A A_prv B
0 5 NaN 2
1 12 5.0 0
2 14 12.0 0
3 22 14.0 0
4 20 22.0 1
5 33 20.0 0
6 11 33.0 1
7 8 11.0 1
8 15 8.0 0
9 11 15.0 1
How do I create a new column based on multiple conditions from multiple columns?
We can use %in%
for comparing multiple elements in a column, &
to check if both conditions are TRUE.
library(dplyr)
df %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
as.character(kids) == "Yes" &
as.numeric(as.character(distance)) < 10)+1] )
It is better to create the data.frame
with stringsAsFactors=FALSE
as by default it is TRUE
. If we check the str(df)
, we can find that all the columns are factor
class. Also, if there are missing values, instead of ""
, NA
can be used to avoid converting the class
of a numeric
column to something else.
If we rewrite the creation of 'df'
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)
the above code can be simplified
df1 %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10)+1] )
For better understanding, some people prefer ifelse
df1 %>%
mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
This can be also done easily with base R
methods
df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
How do I assign values based on multiple conditions for existing columns?
You can do this using np.where
, the conditions use bitwise &
and |
for and
and or
with parentheses around the multiple conditions due to operator precedence. So where the condition is true 5
is returned and 0
otherwise:
In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df
Out[29]:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
Create new column based on multiple conditions in multiple columns
With your large (> 40 million rows) data set, the data.table package might be a good choice:
library(data.table)
cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df
id icpc icpc2 reg.date condit
1: 123 D95 F15 19JUN2015 yes
2: 123 F85 15AUG2016 yes
3: 332 A01 16MAR2010 yes
4: 332 A04 20JAN2018 yes
5: 332 K20 20FEB2017 yes
6: 100 B10 01JUN2017 no
7: 100 A04 11JAN2008 no
8: 113 T08 18MAR2018 yes
9: 113 P28 19JAN2017 yes
10: 113 D95 A01 16JAN2013 yes
11: 113 A04 01MAY2009 yes
12: 551 B12 A01 03APR2011 yes
13: 551 D95 09MAY2015 yes
Data:
df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L,
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))
Edit: for multiple conditions:
cond1 <- c("D95", "A01") # A
cond2 <- c("A04", "T08") # B
cond3 <- "B10" # C
setDT(df)[, condit := if(any(icpc %in% cond1 | icpc2 %in% cond1)) "A" else
if(any(icpc %in% cond2 | icpc2 %in% cond2)) "B" else
if(any(icpc %in% cond3 | icpc2 %in% cond3)) "C" else "", by=id]
id icpc icpc2 reg.date condit
1: 123 D95 F15 19JUN2015 A
2: 123 F85 15AUG2016 A
3: 332 A01 16MAR2010 A
4: 332 A04 20JAN2018 A
5: 332 K20 20FEB2017 A
6: 100 B10 01JUN2017 B
7: 100 A04 11JAN2008 B
8: 113 T08 18MAR2018 A
9: 113 P28 19JAN2017 A
10: 113 D95 A01 16JAN2013 A
11: 113 A04 01MAY2009 A
12: 551 B12 B10 03APR2011 C
13: 551 D96 09MAY2015 C
Data: (slightly modified from the original since no "C" condition was found.
df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L,
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D96"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "B10", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))
Tested on a data frame with 40M rows:
system.time(...)
# user system elapsed
# 111.11 1.17 111.97
Using dplyr:
# Error: cannot allocate vector of size 274.7 Mb
# Timing stopped at: 4.19 1.11 5.39
New column based on conditions of multiple columns
Consider below approach
select *,
case row_number() over(partition by person_id order by date nulls last) * if(date is null, 0, 1)
when 0 then 'incomplete'
when 1 then 'start'
when 2 then 'in progress'
when 3 then 'completed'
else 'game over'
end status
from data
if applied to sample data in your question - output is
It is not 100% clear from your question - but I think you want to count occurrences not just by person_id but also by activity - not sure - in this case just add activity
to partition by
as in partition by person_id, activity
Create a new column based on multiple conditions in other columns in R
using case_when
dplyr to solve would be like:
dat %>%
mutate(col3 = case_when(
col1 != 0 & col2 != 0 ~ 1,
TRUE ~ 2
))
Mutate a new column based on multiple conditions in R
Use case_when
:
library(dplyr)
df %>%
mutate(direction = case_when(city %in% c('bj', 'tj') ~ 'north',
city %in% c('sz', 'nj', 'sh') ~ 'east',
city %in% c('xa', 'lz') ~ 'west',
city %in% c('wh') ~ 'center',
city %in% c('gz', 'sz') ~ 'south',
))
# id city direction
#1 1 bj north
#2 2 tj north
#3 3 gz south
#4 4 sz east
#5 5 nj east
#6 6 xa west
#7 7 lz west
#8 8 wh center
#9 9 sh east
Related Topics
How to Set Unique Row and Column Names of a Matrix When Its Dimension Is Unknown
Ggplot2 Equivalent of Matplot():Plot a Matrix/Array by Columns
Use Pipe Without Feeding First Argument
Ggplot Legend Issue W/ Geom_Point and Geom_Text
How to Get the Text Between Two Words in R
Rollmean with Dplyr and Magrittr
How to Generate Bin Frequency Table in R
Using Multiple Ellipses Arguments in R
Calculate Rolling Correlation Using Rollapply
Earliest Date for Each Id in R
Arithmetic Mean on a Multidimensional Array on R and Matlab: Drastic Difference of Performances
Basic - T-Test -> Grouping Factor Must Have Exactly 2 Levels
How to Convert a String in a Function into an Object
How to Open an .Xlsb File in R
Interactively Change the Selectinput Choices
Geom_Boxplot() from Ggplot2:Forcing an Empty Level to Appear