Why does pandas apply calculate twice
This behavior has been fixed with pandas 1.1, please upgrade!
Now, apply
and applymap
on DataFrame evaluates first row/column only once.
Initially, we had GroupBy.apply
and Series/df.apply
evaluating the first group twice. The reason the first group is evaluated twice is because apply wants to know whether it can "optimize" the calculation (sometimes this is possible if apply receives a numpy or cythonized function). With pandas 0.25, this behavior was fixed for GroupBy.apply. Now, with pandas 1.1, this will also be fixed for df.apply.
Old Behavior [pandas <= 1.0.X]
pd.__version__
# '1.0.4'
df.apply(mul2)
hello
hello
a
0 2.00
1 4.00
2 1.34
3 2.68
New Behavior [pandas >= 1.1]
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
df.apply(mul2)
hello
a
0 2.00
1 4.00
2 1.34
3 2.68
Pandas function: DataFrame.apply() runs top row twice
This is by design, as described here and here
The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.
The apply function is executed twice for the first 2 groups in a grouped pandas DataFrame
This is not a bug. This is by design.
The apply
function needs to know the shape of the groups. Since the first two groups have different shapes. It will print group them twice, first time for getting the shape and second time for running the code on it.
In pandas version 1.1.0 this has been fixed, as mentioned in the "What's New" page of the [documentation]
apply and applymap on DataFrame evaluates first row/column only once¶
Previous behavior:
df.apply(func, axis=1)
a 1
b 3
Name: 0, dtype: int64
a 1
b 3
Name: 0, dtype: int64
a 2
b 6
Name: 1, dtype: int64
Out[4]:
a b
0 1 3
1 2 6New behavior:
df.apply(func, axis=1)
a 1
b 3
Name: 0, Length: 2, dtype: int64
a 2
b 6
Name: 1, Length: 2, dtype: int64
Out[79]:
a b
0 1 3
1 2 6
[2 rows x 2 columns]
Also mentioned here on GitHub.
Pandas's apply: first row elaborated twice
This is a known issue with both GroupBy.apply
(pandas < 0.25) and df.apply
(pandas < 1.1). The reason the first group is evaluated twice is because apply
wants to know whether it can "optimize" the calculation (sometimes this is possible if apply
receives a numpy or cythonized function).
With pandas 0.25, this behavior was fixed for GroupBy.apply
. See here. Now with pandas 1.1, the same behavior will be fixed for df.apply
.
When 1.1 is out, you'll be able to upgrade and then you will only see the first group evaluated only once:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
df[["sum12", "sum23", "sum13"]] = df.apply(lambda row: calc(row), axis=1)
print(df)
Row: [1, 4, 7]
Row: [2, 5, 8]
Row: [3, 6, 9]
col1 col2 col3 sum12 sum23 sum13
0 1 4 7 5 11 8
1 2 5 8 7 13 10
2 3 6 9 9 15 12
Pandas 1.1.0 apply function is altering the row in place
- As per Pandas 1.1.0 What's New Doc: apply and applymap on DataFrame evaluates first row/column only once,
.apply
does not evaluate the first row twice. - The issue is, the dataframe is replaced when
row
is returned.- This seems to be a result of BUG: DataFrame.apply with func altering row in-place #35633
- Also see Backport PR #35633 on branch 1.1.x (BUG: DataFrame.apply with func altering row in-place) #35666
- Remove
row['finalValue'] = finalValue
and returnfinalValue
instead ofrow
.
- This seems to be a result of BUG: DataFrame.apply with func altering row in-place #35633
- Call the function with
df['finalValue'] = df.apply(setFinalValue, axis=1)
.
import pandas as pd
data = {'name': ['FW12611', 'FW12612', 'FW12613'],
'attrName': ['HW type', 'HW type', 'HW type'],
'string_value': ['None', 'None', 'None'],
'dict_value': ['ALU1', 'ALU1', 'ALU1']}
df = pd.DataFrame(data)
def setFinalValue(row):
print(row)
rtrName = row['name']
attrName = row['attrName'].replace(" ","")
dict_value = row['dict_value']
string_value = row['string_value']
finalValue = 'N/A'
if attrName in ['Val1','Val2','Val3']:
finalValue = dict_value
elif attrName in ['Val4','Val5',]:
finalValue = string_value
else:
finalValue = "N/A"
print('\n')
return finalValue
# apply the function
df['finalValue'] = df.apply(setFinalValue, axis=1)
[out]:
name FW12611
attrName HW type
string_value None
dict_value ALU1
Name: 0, dtype: object
name FW12612
attrName HW type
string_value None
dict_value ALU1
Name: 1, dtype: object
name FW12613
attrName HW type
string_value None
dict_value ALU1
Name: 2, dtype: object
# display(df)
name attrName string_value dict_value finalValue
0 FW12611 HW type None ALU1 N/A
1 FW12612 HW type None ALU1 N/A
2 FW12613 HW type None ALU1 N/A
Related Topics
How to Read Contents of an Table in Ms-Word File Using Python
Django 1.7 - "No Migrations to Apply" When Run Migrate After Makemigrations
Anaconda Python: Where Are the Virtual Environments Stored
Reading the Target of a .Lnk File in Python
Why Do Two Identical Lists Have a Different Memory Footprint
Sorting a Dictionary with Lists as Values, According to an Element from the List
Putting a 'Cookie' in a 'Cookiejar'
Pivot String Column on Pyspark Dataframe
Safest Way to Convert Float to Integer in Python
Scikit-Learn Gridsearchcv with Multiple Repetitions
Convert a List of Tuples to a List of Lists
Segmenting License Plate Characters
Python Float to Int Conversion
Importerror: No Module Named 'Django.Core.Urlresolvers'
How to Check If There Exists a Process with a Given Pid in Python
Ipython Notebook Clear Cell Output in Code