Does pandas iterrows have performance issues?
Generally, iterrows
should only be used in very, very specific cases. This is the general order of precedence for performance of various operations:
1) vectorization
2) using a custom cython routine
3) apply
a) reductions that can be performed in cython
b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)
Using a custom Cython routine is usually too complicated, so let's skip that for now.
1) Vectorization is ALWAYS, ALWAYS the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish DataFrame
, it may be faster to use other methods.
3) apply
usually can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the apply
expression. For example, df.apply(lambda x: np.sum(x))
will be executed pretty swiftly, though of course, df.sum(1)
is even better. However something like df.apply(lambda x: x['b'] + 1)
will be executed in Python space, and consequently is much slower.
4) itertuples
does not box the data into a Series
. It just returns the data in the form of tuples.
5) iterrows
DOES box the data into a Series
. Unless you really need this, use another method.
6) Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame
does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat
.
Performance issues with pandas iterrows
How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0
Iterrows performance
Assuming your empty cells are NaN
values, this gives you the first non-NA value of each row for the group of columns you are interested in:
df[df>0][columns1].bfill(axis=1).iloc[:,0]
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 20.0
6 NaN
7 20.0
8 NaN
Thus, this will give you the abs(a-b)
you're searching for:
res = (df[df>0][columns1].bfill(axis=1).iloc[:,0]
-df[df>0][columns2].bfill(axis=1).iloc[:,0]).abs()
res
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 22977.5
6 NaN
7 NaN
8 NaN
You can either combine it with your initialized discount
column:
res.combine_first(df.discount)
or fill the blanks:
res.fillna(0)
Iterating over rows in a dataframe in Pandas: is there a difference between using df.index and df.iterrows() as iterators?
When we doing for loop , look up index get the data require additional loc
for index in df.index:
value = df.loc['index','col']
When we do df.iterrows
for index, row in df.iterrows():
value = row['col']
Since you already with pandas , both of them are not recommended. Unless you need certain function and cannot be vectorized.
However, IMO, I preferred df.index
Speeding up loop over dataframes
I have faced a similar problem, using itertuples instead of iterrows shows significant reduction in time.
why iterrows have issues.
Hope this helps.
Related Topics
Evaluating a Mathematical Expression in a String
How to Parse an Iso 8601-Formatted Date
Difference Between @Staticmethod and @Classmethod
Why Can't I Call Read() Twice on an Open File
How to Sort a Dictionary by Key
Split Strings into Words With Multiple Word Boundary Delimiters
What Does the Star and Doublestar Operator Mean in a Function Call
Could Not Open Resource File, Pygame Error: "Filenotfounderror: No Such File or Directory."
Why Doesn't Calling a String Method Do Anything Unless Its Output Is Assigned
Error: Unable to Find Vcvarsall.Bat
How to Parse Xml and Get Instances of a Particular Node Attribute
How to Access Object Attribute Given String Corresponding to Name of That Attribute
How to Add Sequential Counter Column on Groups Using Pandas Groupby
Do Python Regular Expressions Have an Equivalent to Ruby'S Atomic Grouping
What Is the 'Self' Parameter in Class Methods
Linux Command-Line Call Not Returning What It Should from Os.System