Why Apply Sometimes Isn't Faster Than For-Loop in a Pandas Dataframe

Why apply sometimes isn't faster than for-loop in a Pandas dataframe?

It is my understanding that .apply is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

    if axis == 0:
series_gen = (self._ixs(i, axis=1)
for i in range(len(self.columns)))
res_index = self.columns
res_columns = self.index
elif axis == 1:
res_index = self.index
res_columns = self.columns
values = self.values
series_gen = (Series.from_array(arr, index=res_columns, name=name,
dtype=dtype)
for i, (arr, name) in enumerate(zip(values,
res_index)))
else: # pragma : no cover
raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))

i = None
keys = []
results = {}
if ignore_failures:
successes = []
for i, v in enumerate(series_gen):
try:
results[i] = func(v)
keys.append(v.name)
successes.append(i)
except Exception:
pass
# so will work with MultiIndex
if len(successes) < len(res_index):
res_index = res_index.take(successes)
else:
try:
for i, v in enumerate(series_gen):
results[i] = func(v)
keys.append(v.name)
except Exception as e:
if hasattr(e, 'args'):
# make sure i is defined
if i is not None:
k = res_index[i]
e.args = e.args + ('occurred at index %s' %
pprint_thing(k), )
raise

if len(results) > 0 and is_sequence(results[0]):
if not isinstance(results[0], Series):
index = res_columns
else:
index = None

result = self._constructor(data=results, index=index)
result.columns = res_index

if axis == 1:
result = result.T
result = result._convert(datetime=True, timedelta=True, copy=False)

else:

result = Series(results)
result.index = res_index

return result

Specifically:

for i, v in enumerate(series_gen):
results[i] = func(v)
keys.append(v.name)

Where series_gen was constructed based on the requested axis.

To get more performance out of a function, you can follow the advice given here.

Essentially, your options are:

  1. Write a C extension
  2. Use numba (a JIT compiler)
  3. Use pandas.eval to squeeze performance out of large Dataframes

Why is pandas apply lambda slower than loop here?

apply uses loops under the hood, so if you need better performance the best and the fastest methods are vecorized alternatives.

No loops, only chain 2 conditions vectorized solution:

m1 = all_actions['Lower'] <= all_actions['Mid']
m2 = all_actions['Mid'] <= all_actions['Upper']
qualified_actions = m1 & m2

Thanks Jon Clements for another solution:

all_actions.Mid.between(all_actions.Lower, all_actions.Upper)

Timings:

np.random.seed(2017)
N = 45000
all_actions=pd.DataFrame(np.random.randint(50, size=(N,3)),columns=['Lower','Mid','Upper'])

#print (all_actions)


In [85]: %%timeit
...: qualified_actions = []
...: for row in all_actions.index:
...: if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
...: qualified_actions.append(True)
...: else:
...: qualified_actions.append(False)
...:
...:
__main__:259: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
1 loop, best of 3: 579 ms per loop

In [86]: %%timeit
...: (all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1))
...:
1 loop, best of 3: 1.17 s per loop

In [87]: %%timeit
...: ((all_actions['Lower'] <= all_actions['Mid']) & (all_actions['Mid'] <= all_actions['Upper']))
...:
1000 loops, best of 3: 509 µs per loop

In [90]: %%timeit
...: (all_actions.Mid.between(all_actions.Lower, all_actions.Upper))
...:
1000 loops, best of 3: 520 µs per loop

Are for-loops in pandas really bad? When should I care?

TLDR; No, for loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (...depending on what you're doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let's examine these situations individually.



Iteration v/s Vectorization on Small Data

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What's even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison

Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

Sample Image

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note

It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment,
but this means that if your code is dependent on indexing alignment,
this will break. In some cases, vectorised operations over the
underlying NumPy arrays can be considered as bringing in the "best of
both worlds", allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:



NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts

Taking another example - this time, with another vanilla python construct that is faster than a for loop - collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict() # value_counts
dict(zip(*np.unique(ser, return_counts=True))) # np.unique
Counter(ser) # Counter

Sample Image

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note

More trivia (courtesy @user2357112). The Counter is implemented with a C
accelerator,
so while it still has to work with python objects instead of the
underlying C datatypes, it is still faster than a for loop. Python
power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don't give you the performance you need, there is always cython and numba. Let's add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]

return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Sample Image

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.



Operations with Mixed/object dtypes

String-based Comparison

Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp

Sample Image

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key

Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

# Dictionary value extraction.
ser.map(operator.itemgetter('value')) # map
pd.Series([x.get('value') for x in ser]) # list comprehension

Sample Image

Positional List Indexing

Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
try:
return lst[0]
# Handle empty lists and NaNs gracefully.
except (IndexError, TypeError):
return np.nan
ser.map(get_0th)                                          # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe

Note

If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

Sample Image

List Flattening

A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp

Sample Image

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it's that slow).



Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other "vectorized" string operations (such as str.split, str.find, str.translate, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python's re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction

Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension

Sample Image

More Examples

Full disclosure - I am the author (in part or whole) of these posts listed below.

  • Fast punctuation removal with pandas

  • String concatenation of two pandas columns

  • Remove unwanted parts from strings in a column

  • Replace all but the last occurrence of a character in a dataframe



Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

  • Create new column with incremental values in a faster and efficient way - Answer by Divakar

  • Fast punctuation removal with pandas - Answer by Paul Panzer

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it's up to you to decide whether these solutions are worth the trouble of implementing.



Appendix: Code Snippets

import perfplot  
import operator
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

<!- ->

# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
n_range=[2**k for k in range(0, 15)],
xlabel='N'
)

<!- ->

# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=['value_counts', 'np.unique', 'Counter'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=lambda x, y: dict(x) == dict(y)
)

<!- ->

# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=['vectorized !=', 'query (numexpr)', 'list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)

<!- ->

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter('value')),
lambda ser: pd.Series([x.get('value') for x in ser]),
],
labels=['map', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)

<!- ->

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)

<!- ->

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=['stack', 'itertools.chain', 'nested list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None

)

<!- _>

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=['str.extract', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)

Why Pandas apply can be faster than vectorized built-ins

To put it short, your question is whether

s.astype(np.str).str[0].astype(np.int)

fuses your operations together, then iterates over the series, or creates a temporary series for each operation, and how to verify this?

My hypothesis (and I guess yours) is that it is the latter. You have the right explanation there but how to test?

My suggestion is:

s1=s.astype(np.str)
s2=s1.str[0]
s3=s2.astype(np.int)

See how long each operation takes and how long the 3 operations take together. Most likely each operation will take about the same amount of time (the complexity of each operation is about the same) which would strongly indicate that our hypothesis is right. If the first two operations take no time, but last, pretty much all of the time, probably our hypothesis is wrong.

Implementing for loops as batches

apply can be used to invoke a function on each row and is much faster than a for loop (vectorized functions are even faster). I've done a few things to make life easier and more performant:

  • convert your text file into a dict. This will be more performant and easier to work with than raw text.
  • put all the corruption logic in a function. This will be easier to maintain and allows us to use apply
  • cleaned up/modified the logic a bit. What I show below is not exactly what you asked but should be easy to adapt.

ok, here is the code:

import io
import random

# this generates a dict {'word1':['list', 'of', 'misspellings'],} where s is a string copied above file
df2 = pd.DataFrame(io.StringIO(s), columns=["subs"])
sub_dict = df2.subs.str.strip().str.split("->", expand=True).set_index(1)[0].str.split(",").to_dict()
sub_dict["fake"] = ["fak", "fkae", "fke"]
sub_dict["tell"] = ["tel"]
sub_dict["this"] = ["tis", "htsi"]
sub_dict["data"] = ["dat", "dta"]

def corrupt(sentence, sub_dict, p=0.5):
# logic is similar but not identical to your code

for k, v in sub_dict.items():
if k in sentence and random.random() <= p:

corrupted_word = random.choice(v)
sentence = sentence.replace(k, corrupted_word)
return sentence

Now the apply bit:

df["corrupted"] = df.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict))

# works as expected, see second sentence
Forname Surname Sentence corrupted
0 Bob Smith Hi, this is a perfectly constructred sentence! Hi, this is a perfectly constructred sentence!
1 Alice Smith Can you tell this is fake data? Can you tel htsi is fake dta?
2 John Smith This poster needs help! This poster needs help!
3 Michael Smith Apparently, this poster is sturggling a bit LOL Apparently, this poster is sturggling a bit LOL
4 Daniel Smith More fake data here; ok. More fke dat here; ok.
5 Sarah Smith Will need to think up of better ideas. Will need to think up of better ideas.
6 Matthew Smith Love a good bit of Python, me. Love a good bit of Python, me.
7 Jane Smith Is this a sentence?! (I think so). Is this a sentence?! (I think so).
8 Peter Smith Remarkable - isn't it? Remarkable - isn't it?
9 Chloe Smith Foo Bar... that's all that is left to say. Foo Bar... that's all that is left to say.

now lets compare performance with a for-loop:

df_test1 = df.sample(n=10000, replace=True)
df_test2 = df.sample(n=10000, replace=True)
def loop(df):
for idx, string in enumerate(df['Sentence']):
corrupted_sentence = corrupt(string, sub_dict)
df['Sentence'][idx] = corrupted_sentence

%timeit df_test1.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict))
# 36.5 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit loop(df_test2)
# 5.19 s ± 98.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

woohoo! It's way faster.

apply vs nested for loops

In pandas loops are not recommended because slow if exist some vectorized solution.

Notice: In function apply are loops under the hood too.

So use pandas.get_dummies and DataFrame.add_prefix and join for add to original df:

df = df.join(pd.get_dummies(df['Hour'].astype(str)).add_prefix('HE'))
print (df)
Date Hour HE1 HE2 HE3 HE4
0 2005-01-01 1 1 0 0 0
1 2005-01-01 2 0 1 0 0
2 2005-01-01 3 0 0 1 0
3 2005-01-01 4 0 0 0 1

Similar function have different performance:

df = pd.concat([df] * 1000, ignore_index=True)

In [62]: %timeit df.join(pd.get_dummies(df['Hour'].astype(str)).add_prefix('HE'))
3.54 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#U9-Forward solution
In [63]: %timeit df.join(df['Hour'].astype(str).str.get_dummies().add_prefix('HE'))
61.6 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

When should I (not) want to use pandas apply() in my code?

apply, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

"If apply is so bad, then why is it in the API?"

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

Some of the things apply can do:

  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_type argument).
  • Accept positional/keyword arguments to pass to the user-defined functions.

...Among others. For more information, see Row or Column-wise Function Application in the documentation.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary.



Related Topics



Leave a reply



Submit