Why Does Pandas Apply Calculate Twice

Why does pandas apply calculate twice

This behavior has been fixed with pandas 1.1, please upgrade!

Now, apply and applymap on DataFrame evaluates first row/column only once.

Initially, we had GroupBy.apply and Series/df.apply evaluating the first group twice. The reason the first group is evaluated twice is because apply wants to know whether it can "optimize" the calculation (sometimes this is possible if apply receives a numpy or cythonized function). With pandas 0.25, this behavior was fixed for GroupBy.apply. Now, with pandas 1.1, this will also be fixed for df.apply.


Old Behavior [pandas <= 1.0.X]

pd.__version__ 
# '1.0.4'

df.apply(mul2)
hello
hello

a
0 2.00
1 4.00
2 1.34
3 2.68

New Behavior [pandas >= 1.1]

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

df.apply(mul2)
hello

a
0 2.00
1 4.00
2 1.34
3 2.68

Pandas function: DataFrame.apply() runs top row twice

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.

The apply function is executed twice for the first 2 groups in a grouped pandas DataFrame

This is not a bug. This is by design.

The apply function needs to know the shape of the groups. Since the first two groups have different shapes. It will print group them twice, first time for getting the shape and second time for running the code on it.

In pandas version 1.1.0 this has been fixed, as mentioned in the "What's New" page of the [documentation]

apply and applymap on DataFrame evaluates first row/column only once¶

Previous behavior:

df.apply(func, axis=1)
a 1
b 3
Name: 0, dtype: int64
a 1
b 3
Name: 0, dtype: int64
a 2
b 6
Name: 1, dtype: int64
Out[4]:
a b
0 1 3
1 2 6

New behavior:

df.apply(func, axis=1)
a 1
b 3
Name: 0, Length: 2, dtype: int64
a 2
b 6
Name: 1, Length: 2, dtype: int64
Out[79]:
a b
0 1 3
1 2 6

[2 rows x 2 columns]

Also mentioned here on GitHub.

Pandas's apply: first row elaborated twice

This is a known issue with both GroupBy.apply (pandas < 0.25) and df.apply (pandas < 1.1). The reason the first group is evaluated twice is because apply wants to know whether it can "optimize" the calculation (sometimes this is possible if apply receives a numpy or cythonized function).

With pandas 0.25, this behavior was fixed for GroupBy.apply. See here. Now with pandas 1.1, the same behavior will be fixed for df.apply.

When 1.1 is out, you'll be able to upgrade and then you will only see the first group evaluated only once:

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

df[["sum12", "sum23", "sum13"]] = df.apply(lambda row: calc(row), axis=1)
print(df)
Row: [1, 4, 7]
Row: [2, 5, 8]
Row: [3, 6, 9]
col1 col2 col3 sum12 sum23 sum13
0 1 4 7 5 11 8
1 2 5 8 7 13 10
2 3 6 9 9 15 12

Pandas 1.1.0 apply function is altering the row in place

  • As per Pandas 1.1.0 What's New Doc: apply and applymap on DataFrame evaluates first row/column only once, .apply does not evaluate the first row twice.
  • The issue is, the dataframe is replaced when row is returned.
    • This seems to be a result of BUG: DataFrame.apply with func altering row in-place #35633
      • Also see Backport PR #35633 on branch 1.1.x (BUG: DataFrame.apply with func altering row in-place) #35666
    • Remove row['finalValue'] = finalValue and return finalValue instead of row.
  • Call the function with df['finalValue'] = df.apply(setFinalValue, axis=1).
import pandas as pd

data = {'name': ['FW12611', 'FW12612', 'FW12613'],
'attrName': ['HW type', 'HW type', 'HW type'],
'string_value': ['None', 'None', 'None'],
'dict_value': ['ALU1', 'ALU1', 'ALU1']}

df = pd.DataFrame(data)

def setFinalValue(row):
print(row)
rtrName = row['name']
attrName = row['attrName'].replace(" ","")
dict_value = row['dict_value']
string_value = row['string_value']
finalValue = 'N/A'

if attrName in ['Val1','Val2','Val3']:
finalValue = dict_value
elif attrName in ['Val4','Val5',]:
finalValue = string_value
else:
finalValue = "N/A"

print('\n')
return finalValue

# apply the function
df['finalValue'] = df.apply(setFinalValue, axis=1)

[out]:
name FW12611
attrName HW type
string_value None
dict_value ALU1
Name: 0, dtype: object

name FW12612
attrName HW type
string_value None
dict_value ALU1
Name: 1, dtype: object

name FW12613
attrName HW type
string_value None
dict_value ALU1
Name: 2, dtype: object

# display(df)
name attrName string_value dict_value finalValue
0 FW12611 HW type None ALU1 N/A
1 FW12612 HW type None ALU1 N/A
2 FW12613 HW type None ALU1 N/A


Related Topics



Leave a reply



Submit