How do I concisely replace column values given multiple conditions?
The contents you are passing as the condlist
and choicelist
parameters are ordinary Python lists. List contents can be produced in a concise way in the language by using list comprehensions, that is the syntax: [expression_using_item for item in sequence]
In other words, your code can be written as:
df["new_col"] = np.select(
condlist=[
df["col"].str.contains(f"cat{i}", na=False, case=False) for i in range(1, 26)],
choicelist=[f"NEW_cat{i}" for i in range(1, 26)],
default="DEFAULT_cat",
)
(and if the category names is not a numeric sequence, and you are giving these names here just as an example, you create a sequence (list) with all the explicit category names and plug that in place of
the range()
call in the snippet above)
replace column values in pyspark dataframe based multiple conditions
Please consider firstly converting your pandas
df
to a spark
one, since you are using pypark
syntax. Then I would advise rewriting your code into a more concise and clearer way, using isin
:
from pyspark.sql import functions as F
df = spark.createDataFrame(df)
df_ = df.withColumn("output", F.when(
(F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
),"NW").otherwise(""))
>>> df_.show(truncate=False)
+-----+-----+------+
|dcode|zone |output|
+-----+-----+------+
|480s |North|NW |
|480s |West |NW |
|499s |East | |
|499s |North|NW |
|650s |East | |
|650s |North|NW |
|702s |North| |
|702s |West | |
|736s |North| |
|736s |South| |
|736s |West | |
|737s |North| |
|737s |West | |
+-----+-----+------+
multiple if else conditions in pandas dataframe and derive multiple columns
You need chained comparison using upper and lower bound
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return 'Red'
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return 'Yellow'
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return 'Orange'
elif (df['height'] > 8):
return np.nan
df2['Flag'] = df2.apply(flag_df, axis = 1)
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
Note: You can do this with a very nested np.where but I prefer to apply a function for multiple if-else
Edit: answering @Cecilia's questions
- what is the returned object is not strings but some calculations, for example, for the first condition, we want to return df['height']*2
Not sure what you tried but you can return a derived value instead of string using
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
- what if there are 'NaN' values in osome columns and I want to use df['xxx'] is None as a condition, the code seems like not working
Again not sure what code did you try but using pandas isnull
would do the trick
def flag_df(df):
if pd.isnull(df['height']):
return df['height']
elif (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
Assigning a column value based on multiple column conditions in python
using query
+ map
df['Col3'] = df.ID.map(df.query('Col1 == 34').set_index('ID').Col2)
print(df)
ID Col1 Col2 Col3
0 1 50 12:23:01 12:25:11
1 1 34 12:25:11 12:25:11
2 1 65 12:32:25 12:25:11
3 1 98 12:45:08 12:25:11
4 2 23 11:09:10 11:14:26
5 2 12 11:12:43 11:14:26
6 2 56 11:13:12 11:14:26
7 2 34 11:14:26 11:14:26
8 2 77 11:16:02 11:14:26
9 3 64 14:01:11 14:01:13
10 3 34 14:01:13 14:01:13
11 3 48 14:02:32 14:01:13
dealing with duplicates
# keep first instance
df.ID.map(df.query('Col1 == 34') \
.drop_duplicates(subset=['ID']).set_index('ID').Col2)
Or
# keep last instance
df.ID.map(df.query('Col1 == 34') \
.drop_duplicates(subset=['ID'], keep='last').set_index('ID').Col2)
How to replace all column values except the first column values based on some condition
You can use numpy.select
:
>>> import numpy as np
>>> subset = df[['column1', 'column2', 'column3']]
>>> df.loc[:, subset.columns] = np.select([subset < 0.4,
(0.4 <= subset) & (subset <= 0.6),
subset > 0.6],
['Low', 'Medium', 'High'])
my_index column1 column2 column3
0 54 Medium Medium High
1 101 Low Medium High
2 75 High High Low
Applying a specific function to replace value of column based on criteria from another column in dataframe
The reason it overwrites is because the indexing on the left hand side is defaulting to the entire dataframe, if you apply the mask to the left hand also using loc
then it only affects those rows where the condition is met:
In [272]:
df.loc[df['apply_f'] == True, 'value'] = df[df['apply_f'] == True]['name'].apply(lambda row: f(row))
df
Out[272]:
apply_f name value
0 False SEBASTIEN 9
1 False JOHN 4
2 True JENNY 5
The use of loc
in the above is because say I used the same boolean mask semantics this may or may not work and will raised an error in the latest pandas versions:
In[274]:
df[df['apply_f'] == True]['value'] = df[df['apply_f'] == True]['name'].apply(lambda row: f(row))
df
-c:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Out[274]:
apply_f name value
0 False SEBASTIEN 9.000000
1 False JOHN 4.000000
2 True JENNY inf
For what you are doing it would be more concise and readable to use numpy where
:
In [279]:
df['value'] = np.where(df['apply_f']==True, len(df['name']), df['value'])
df
Out[279]:
apply_f name value
0 False SEBASTIEN 9
1 False JOHN 4
2 True JENNY 3
I understand that your example is to demonstrate an issue but you can also use where
for certain situations.
Replacing NA's by column specific condition
dplyrlibrary(dplyr)
df %>%
mutate(across(-CAR, ~ if_else(CAR == 1 & is.na(.), 0, .)))
# CAR BIKE PLANE BOAT SCOOTER
# 1 1 2 8 1 2
# 2 1 0 0 2 3
# 3 3 4 6 NA 6
# 4 9 NA 7 4 9
# 5 1 9 9 0 0
base Rdf[,-1] <- lapply(df[,-1], function(x) ifelse(df$CAR == 1 & is.na(x), 0, x))
df
# CAR BIKE PLANE BOAT SCOOTER
# 1 1 2 8 1 2
# 2 1 0 0 2 3
# 3 3 4 6 NA 6
# 4 9 NA 7 4 9
# 5 1 9 9 0 0
Replace whole string if it contains substring in pandas
library(dplyr)
df %>%
mutate(across(-CAR, ~ if_else(CAR == 1 & is.na(.), 0, .)))
# CAR BIKE PLANE BOAT SCOOTER
# 1 1 2 8 1 2
# 2 1 0 0 2 3
# 3 3 4 6 NA 6
# 4 9 NA 7 4 9
# 5 1 9 9 0 0
df[,-1] <- lapply(df[,-1], function(x) ifelse(df$CAR == 1 & is.na(x), 0, x))
df
# CAR BIKE PLANE BOAT SCOOTER
# 1 1 2 8 1 2
# 2 1 0 0 2 3
# 3 3 4 6 NA 6
# 4 9 NA 7 4 9
# 5 1 9 9 0 0
Replace whole string if it contains substring in pandas
You can use str.contains
to mask the rows that contain 'ball' and then overwrite with the new value:
In [71]:
df.loc[df['sport'].str.contains('ball'), 'sport'] = 'ball sport'
df
Out[71]:
name sport
0 Bob tennis
1 Jane ball sport
2 Alice ball sport
To make it case-insensitive pass `case=False:
df.loc[df['sport'].str.contains('ball', case=False), 'sport'] = 'ball sport'
Related Topics
Pandas - Drop Last Column of Dataframe
Conversion of String to Upper Case Without Inbuilt Methods
Finding Non-Numeric Rows in Dataframe in Pandas
Python MySQL Connector: Caching_Sha2_Password Plugin
Calculate Monthly Returns from Daily Returns in Pandas(Cumpound)
Plotly Graph Does Not Show When Jupyter Notebook Is Converted to Slides
Swapping Columns in a Numpy Array
Modulenotfounderror: What Does It Mean _Main_ Is Not a Package
How to Print Superscript in Python
Split Data Directory into Training and Test Directory With Sub Directory Structure Preserved
How to Delete a Column That Contains Only Zeros in Pandas
How to Divide a Given Time Interval into Equal Intervals
Type Conversion in Python Attributeerror: 'Str' Object Has No Attribute 'Astype'
How to Split Folder of Images into Test/Training/Validation Sets With Stratified Sampling
Pip Install Pandas: Installing Dependencies Error
How to Save a Numpy Array as a 16-Bit Image Using "Normal" (Enthought) Python
Python/Pandas: How to Match List of Strings With a Dataframe Column