Pandas conditional creation of a series/dataframe column
If you only have two choices to select from:
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
yields
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
If you have more than two conditions then use np.select
. For example, if you want color
to be
yellow
when(df['Set'] == 'Z') & (df['Type'] == 'A')
- otherwise
blue
when(df['Set'] == 'Z') & (df['Type'] == 'B')
- otherwise
purple
when(df['Type'] == 'B')
- otherwise
black
,
then use
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
which yields
Set Type color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black
Pandas conditional creation of a dataframe column: based on multiple conditions
Change your code to
conditions = [
(df['col1'] > df['col2']) & (df['col2'] > df['col3']),
(df['col1'] > df['col2']),
(df['col1'] < df['col2']) & (df['col2'] < df['col3']),
(df['col1'] < df['col2'])]
choices = [2,1,-2,-1]
df['new'] = np.select(conditions, choices, default=0)
Pandas conditional creation of a series/dataframe column for entries containing lists
You can also convert to string and use str.contains
:
find=3
df['color'] = np.where(df["Set"].astype(str).str.contains(str(find)),'green','red')
Or with a dataframe where the condition will be
pd.DataFrame(df["Set"].to_list(),index=df.index).eq(3).any(1)
print(df)
Type Set color
1 A [1, 2, 3] green
2 B [1, 2, 3] green
3 B [3, 2, 1] green
4 C [2, 4, 1] red
Pandas conditional creation of a new dataframe column
You can use loc
with isin
and last fillna
:
df.loc[df.Col2.isin(['Z','X']), 'Col3'] = 'J'
df.loc[df.Col2 == 'Y', 'Col3'] = 'K'
df['Col3'] = df.Col3.fillna(df.Col1)
print (df)
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C
Creating new pandas column based on Series conditional
Try this:
df['z'] = np.where((df.x > 1.0) | (df.y < -1.0), 'outlier', 'normal')
Creating a new column based on if-elif-else condition
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1
option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
Create new column in df based on conditional with strings
Not sure I understand your errors, but I see that the error also shows Teams()
, instead of Team()
.
In any case, in your example, row
is actually a pandas series, when you slice it, you get the actual strings, which does not have a method isin()
. Changing your function definition should work:
def Team(row):
if row['Name'] in team1:
return 'Team 1'
elif row['Name'] in team2:
return 'Team 2'
else:
return 'No Team'
df['Team'] = df.apply(Team, axis=1)
df
Let me also suggest using directly the pandas series, instead of the whole dataframe. That should be faster as well. The .apply()
method for series are similar to the ones in dataframes but you won't need to pass the axis=1
argument.
def Team(name):
if name in team1:
return 'Team 1'
elif name in team2:
return 'Team 2'
else:
return 'No Team'
df['Team'] = df.Name.apply(Team)
df
Docs:
- Pandas DataFrame
apply
- Pandas Series
apply
Python: Add a complex conditional column without for loop
Use df.assign()
for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
- or alternatively
df.query
(if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[]
as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
- or
df.iterrows()
as a fallback
df.assign()
(or df.query
) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[]
calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f
, HxFPos->r
, HxTotalBtn->btn
for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask
is gated by (f <= 6) or (btn <= 30)
. (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask']
.
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
Related Topics
Why Is Reading Lines from Stdin Much Slower in C++ Than Python
Why Does Multiprocessing Use Only a Single Core After I Import Numpy
Cqlsh Connection Error: 'Ref() Does Not Take Keyword Arguments'
How to Upgrade Python Version to 3.7
Syntax Error on Print With Python 3
How to Concatenate Str and Int Objects
Why Dict.Get(Key) Instead of Dict[Key]
Is There Go Up Line Character? (Opposite of \N)
Django [Errno 13] Permission Denied: '/Var/Www/Media/Animals/User_Uploads'
Can Python Select What Network Adapter When Opening a Socket
Relative Imports For the Billionth Time