How to Concisely Replace Column Values Given Multiple Conditions

How do I concisely replace column values given multiple conditions?

The contents you are passing as the condlist and choicelist parameters are ordinary Python lists. List contents can be produced in a concise way in the language by using list comprehensions, that is the syntax: [expression_using_item for item in sequence]

In other words, your code can be written as:

df["new_col"] = np.select(
condlist=[
df["col"].str.contains(f"cat{i}", na=False, case=False) for i in range(1, 26)],
choicelist=[f"NEW_cat{i}" for i in range(1, 26)],
default="DEFAULT_cat",
)

(and if the category names is not a numeric sequence, and you are giving these names here just as an example, you create a sequence (list) with all the explicit category names and plug that in place of
the range() call in the snippet above)

replace column values in pyspark dataframe based multiple conditions

Please consider firstly converting your pandas df to a spark one, since you are using pypark syntax. Then I would advise rewriting your code into a more concise and clearer way, using isin:

from pyspark.sql import functions as F
df = spark.createDataFrame(df)

df_ = df.withColumn("output", F.when(
(F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
),"NW").otherwise(""))


>>> df_.show(truncate=False)

+-----+-----+------+
|dcode|zone |output|
+-----+-----+------+
|480s |North|NW |
|480s |West |NW |
|499s |East | |
|499s |North|NW |
|650s |East | |
|650s |North|NW |
|702s |North| |
|702s |West | |
|736s |North| |
|736s |South| |
|736s |West | |
|737s |North| |
|737s |West | |
+-----+-----+------+

multiple if else conditions in pandas dataframe and derive multiple columns

You need chained comparison using upper and lower bound

def flag_df(df):

if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return 'Red'
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return 'Yellow'
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return 'Orange'
elif (df['height'] > 8):
return np.nan

df2['Flag'] = df2.apply(flag_df, axis = 1)

student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange

Note: You can do this with a very nested np.where but I prefer to apply a function for multiple if-else

Edit: answering @Cecilia's questions

  1. what is the returned object is not strings but some calculations, for example, for the first condition, we want to return df['height']*2

Not sure what you tried but you can return a derived value instead of string using

def flag_df(df):

if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan

  1. what if there are 'NaN' values in osome columns and I want to use df['xxx'] is None as a condition, the code seems like not working

Again not sure what code did you try but using pandas isnull would do the trick

def flag_df(df):

if pd.isnull(df['height']):
return df['height']
elif (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan

Assigning a column value based on multiple column conditions in python

using query + map

df['Col3'] = df.ID.map(df.query('Col1 == 34').set_index('ID').Col2)

print(df)

ID Col1 Col2 Col3
0 1 50 12:23:01 12:25:11
1 1 34 12:25:11 12:25:11
2 1 65 12:32:25 12:25:11
3 1 98 12:45:08 12:25:11
4 2 23 11:09:10 11:14:26
5 2 12 11:12:43 11:14:26
6 2 56 11:13:12 11:14:26
7 2 34 11:14:26 11:14:26
8 2 77 11:16:02 11:14:26
9 3 64 14:01:11 14:01:13
10 3 34 14:01:13 14:01:13
11 3 48 14:02:32 14:01:13

dealing with duplicates

# keep first instance
df.ID.map(df.query('Col1 == 34') \
.drop_duplicates(subset=['ID']).set_index('ID').Col2)

Or

# keep last instance
df.ID.map(df.query('Col1 == 34') \
.drop_duplicates(subset=['ID'], keep='last').set_index('ID').Col2)

How to replace all column values except the first column values based on some condition

You can use numpy.select:

>>> import numpy as np
>>> subset = df[['column1', 'column2', 'column3']]
>>> df.loc[:, subset.columns] = np.select([subset < 0.4,
(0.4 <= subset) & (subset <= 0.6),
subset > 0.6],
['Low', 'Medium', 'High'])

my_index column1 column2 column3
0 54 Medium Medium High
1 101 Low Medium High
2 75 High High Low

Applying a specific function to replace value of column based on criteria from another column in dataframe

The reason it overwrites is because the indexing on the left hand side is defaulting to the entire dataframe, if you apply the mask to the left hand also using loc then it only affects those rows where the condition is met:

In [272]:

df.loc[df['apply_f'] == True, 'value'] = df[df['apply_f'] == True]['name'].apply(lambda row: f(row))
df
Out[272]:
apply_f name value
0 False SEBASTIEN 9
1 False JOHN 4
2 True JENNY 5

The use of loc in the above is because say I used the same boolean mask semantics this may or may not work and will raised an error in the latest pandas versions:

In[274]:
df[df['apply_f'] == True]['value'] = df[df['apply_f'] == True]['name'].apply(lambda row: f(row))
df
-c:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Out[274]:
apply_f name value
0 False SEBASTIEN 9.000000
1 False JOHN 4.000000
2 True JENNY inf

For what you are doing it would be more concise and readable to use numpy where:

In [279]:

df['value'] = np.where(df['apply_f']==True, len(df['name']), df['value'])
df
Out[279]:
apply_f name value
0 False SEBASTIEN 9
1 False JOHN 4
2 True JENNY 3

I understand that your example is to demonstrate an issue but you can also use where for certain situations.

Replacing NA's by column specific condition

dplyr

library(dplyr)
df %>%
mutate(across(-CAR, ~ if_else(CAR == 1 & is.na(.), 0, .)))
# CAR BIKE PLANE BOAT SCOOTER
# 1 1 2 8 1 2
# 2 1 0 0 2 3
# 3 3 4 6 NA 6
# 4 9 NA 7 4 9
# 5 1 9 9 0 0

base R

df[,-1] <- lapply(df[,-1], function(x) ifelse(df$CAR == 1 & is.na(x), 0, x))
df
# CAR BIKE PLANE BOAT SCOOTER
# 1 1 2 8 1 2
# 2 1 0 0 2 3
# 3 3 4 6 NA 6
# 4 9 NA 7 4 9
# 5 1 9 9 0 0

Replace whole string if it contains substring in pandas

You can use str.contains to mask the rows that contain 'ball' and then overwrite with the new value:

In [71]:
df.loc[df['sport'].str.contains('ball'), 'sport'] = 'ball sport'
df

Out[71]:
name sport
0 Bob tennis
1 Jane ball sport
2 Alice ball sport

To make it case-insensitive pass `case=False:

df.loc[df['sport'].str.contains('ball', case=False), 'sport'] = 'ball sport'


Related Topics



Leave a reply



Submit