Pandas Conditional Creation of a Series/Dataframe Column

Pandas conditional creation of a series/dataframe column

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0 Z A green
1 Z B green
2 X B red
3 Y C red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black

Pandas conditional creation of a dataframe column: based on multiple conditions

Change your code to

conditions = [
(df['col1'] > df['col2']) & (df['col2'] > df['col3']),
(df['col1'] > df['col2']),
(df['col1'] < df['col2']) & (df['col2'] < df['col3']),
(df['col1'] < df['col2'])]
choices = [2,1,-2,-1]
df['new'] = np.select(conditions, choices, default=0)

Pandas conditional creation of a series/dataframe column for entries containing lists

You can also convert to string and use str.contains:

find=3
df['color'] = np.where(df["Set"].astype(str).str.contains(str(find)),'green','red')

Or with a dataframe where the condition will be

pd.DataFrame(df["Set"].to_list(),index=df.index).eq(3).any(1)


print(df)

Type Set color
1 A [1, 2, 3] green
2 B [1, 2, 3] green
3 B [3, 2, 1] green
4 C [2, 4, 1] red

Pandas conditional creation of a new dataframe column

You can use loc with isin and last fillna:

df.loc[df.Col2.isin(['Z','X']), 'Col3'] = 'J'
df.loc[df.Col2 == 'Y', 'Col3'] = 'K'
df['Col3'] = df.Col3.fillna(df.Col1)
print (df)
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C

Creating new pandas column based on Series conditional

Try this:

df['z'] = np.where((df.x > 1.0) | (df.y < -1.0), 'outlier', 'normal')

Creating a new column based on if-elif-else condition

To formalize some of the approaches laid out above:

Create a function that operates on the rows of your dataframe like so:

def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val

Then apply it to your dataframe passing in the axis=1 option:

In [1]: df['C'] = df.apply(f, axis=1)

In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1

Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.

Edit

Here is the vectorized version

df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))

Create new column in df based on conditional with strings

Not sure I understand your errors, but I see that the error also shows Teams(), instead of Team().

In any case, in your example, row is actually a pandas series, when you slice it, you get the actual strings, which does not have a method isin(). Changing your function definition should work:

def Team(row):
if row['Name'] in team1:
return 'Team 1'
elif row['Name'] in team2:
return 'Team 2'
else:
return 'No Team'

df['Team'] = df.apply(Team, axis=1)
df

Let me also suggest using directly the pandas series, instead of the whole dataframe. That should be faster as well. The .apply() method for series are similar to the ones in dataframes but you won't need to pass the axis=1 argument.

def Team(name):
if name in team1:
return 'Team 1'
elif name in team2:
return 'Team 2'
else:
return 'No Team'

df['Team'] = df.Name.apply(Team)
df

Docs:

  • Pandas DataFrame apply
  • Pandas Series apply

Python: Add a complex conditional column without for loop

Use df.assign() for a complex vectorized expression

Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:

  • .loc[]
  • df.assign()
  • or alternatively df.query (if you like SQL syntax)

or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:

  • df.apply(your_function_or_lambda, axis=1)
  • or df.iterrows() as a fallback

df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.

Solution with df.assign()

Rewrite your fomula for clarity

When we remove all the unneeded .loc[] calls your formula boils down to:

HxFPos > 6 or HxTotalBtn > 30: 
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0

pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:

(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0

So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)

Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest 

Vectorize your expressions

So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:

>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)

0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool

Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].

Similarly:

((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)


Related Topics



Leave a reply



Submit