Performance of Pandas Apply VS Np.Vectorize to Create New Column from Existing Columns

Performance of Pandas apply vs np.vectorize to create new column from existing columns

I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2

Python-level loops

Now we can look at some timings. Below are all Python-level loops which produce either pd.Series, np.ndarray or list objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.

# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)
N = 10**5

%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s

Some takeaways:

  1. The tuple-based methods (the first 4) are a factor more efficient than pd.Series-based methods (the last 3).
  2. np.vectorize, list comprehension + zip and map methods, i.e. the top 3, all have roughly the same performance. This is because they use tuple and bypass some Pandas overhead from pd.DataFrame.itertuples.
  3. There is a significant speed improvement from using raw=True with pd.DataFrame.apply versus without. This option feeds NumPy arrays to the custom function instead of pd.Series objects.

pd.DataFrame.apply: just another loop

To see exactly the objects Pandas passes around, you can amend your function trivially:

def foo(row):
print(type(row))
assert False # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)

Output: <class 'pandas.core.series.Series'>. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.

Do the same exercise again with raw=True and you'll see <class 'numpy.ndarray'>. All this is described in the docs, but seeing it is more convincing.

np.vectorize: fake vectorisation

The docs for np.vectorize has the following note:

The vectorized function evaluates pyfunc over successive tuples of
the input arrays like the python map function, except it uses the
broadcasting rules of numpy.

The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map is instructive, since the map version above has almost identical performance. The source code shows what's happening: np.vectorize converts your input function into a Universal function ("ufunc") via np.frompyfunc. There is some optimisation, e.g. caching, which can lead to some performance improvement.

In short, np.vectorize does what a Python-level loop should do, but pd.DataFrame.apply adds a chunky overhead. There's no JIT-compilation which you see with numba (see below). It's just a convenience.

True vectorisation: what you should use

Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:

%timeit np.where(df['B'] == 0, 0, df['A'] / df['B'])       # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0) # 1.96 ms

Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba below, if performance is critical and this is part of your bottleneck.

numba.njit: greater efficiency

When loops are considered viable they are usually optimised via numba with underlying NumPy arrays to move as much as possible to C.

Indeed, numba improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.

from numba import njit

@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res

%timeit divide(df['A'].values, df['B'].values) # 717 µs

Using @njit(parallel=True) may provide a further boost for larger arrays.


1 Numeric types include: int, float, datetime, bool, category. They exclude object dtype and can be held in contiguous memory blocks.

2
There are at least 2 reasons why NumPy operations are efficient versus Python:

  • Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
  • NumPy methods are usually C-based. In addition, optimised algorithms
    are used where possible.

Correct way to use np.vectorize to apply functions to all columns in Pandas dataframe

The correct way to use np.vectorize is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.

At least that's the case when working with numpy arrays. With DataFrames, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.


But lets look at your example in some detail.

Your sample frame:

In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
...:
In [178]: test_df
Out[178]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
In [179]: def myfunc(a, b):
...: return a+b
...:

Your apply:

In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**

I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply with default axis and raw implies you have a function that only works with one column at a time.

In [182]: test_df+3
Out[182]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

With raw you are passing numpy arrays to the lambda. Array for the whole frame is:

In [184]: test_df.to_numpy()
Out[184]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]:
array([[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That's much faster. But to return a frame takes time.

In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]:

Testing vectorize.

In [218]: f=np.vectorize(myfunc)

f applies myfunc to each element of the input array iteratively. It has a clear performance disclaimer.

Even for this small array it is slow compared to direct application of the function to the array:

In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Passing the frame itself

In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And iteratively applying to columns - much slower:

In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

but a lot of that extra time comes from "extracting" columns:

In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Quoting the np.vectorize docs:

Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.

Why Pandas apply can be faster than vectorized built-ins

To put it short, your question is whether

s.astype(np.str).str[0].astype(np.int)

fuses your operations together, then iterates over the series, or creates a temporary series for each operation, and how to verify this?

My hypothesis (and I guess yours) is that it is the latter. You have the right explanation there but how to test?

My suggestion is:

s1=s.astype(np.str)
s2=s1.str[0]
s3=s2.astype(np.int)

See how long each operation takes and how long the 3 operations take together. Most likely each operation will take about the same amount of time (the complexity of each operation is about the same) which would strongly indicate that our hypothesis is right. If the first two operations take no time, but last, pretty much all of the time, probably our hypothesis is wrong.

Using np.vectorize to create a column in a data frame

Here is version that doesn't use np.vectorize

def scoreFunHigh(val, mean, diff, multip):

upper = mean * (1 + diff)
lower = mean * (1 - diff)

if val > upper:
return multip * 1
elif val < lower:
return multip * (-1)
else:
return 0

letterMeanX = df.groupby('letters')['x'].apply(lambda x: np.nanmean(x))
df['letter x score'] = df.apply(
lambda row: scoreFunHigh(row['x'], letterMeanX[row['letters']], 0.2, 10), axis=1)

Output

     x   y letters  letter x score
0 52 76 A 0
1 90 99 B 10
2 87 43 C 10
3 44 73 D 0
4 49 3 A 0
.. .. .. ... ...
95 16 51 D -10
96 38 3 A 0
97 43 47 B 0
98 58 39 C 0
99 41 26 D 0

How to speed up Pandas apply function to create a new column in the dataframe?

This looks like IO bound and not CPU bound issue. Multiprocessing would not help. The major bottleneck is your call to Nominatim(). You make a http request to their API for every non-NaN column. This means if 'India' is in 5 places, you will make 5 calls for India which wastefully returns the same geolocation for 5 rows.

The optimisation of these would require a mixture of caching most frequently location locally and also the new few calls during calls.

  1. Create a DataFrame with most N frequent locations.
  2. Call Nominatim() on the most frequent locations and save this as lookup dict/json e.g. location_geo = df.set_index('location').to_dict()['geolocation']
  3. Save it e.g. 'json.dump...`
  4. In your function, we will check if the location is in your cached location_geo dictionary, and return the value. If not then make a call to Nominatim API.

In the end you would have something like this:

import json
from functools import lru_cache
from geopy.geocoders import Nominatim
import numpy as np

geolocator = Nominatim()

# load most frequently locations
with open('our_save_freq_location.json', 'r') as f:
location_geolocation = json.load(f)

@lru_cache
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
# look first in our dictionary, if not call Nominatim
loc = locations_geolocation.get(location, geolocator.geocode(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name

How iterate in a efficient way over Pandas dataframe with Numpy.vectorize?

When you define a function to be vectorized, then:

  • each column should be a separate parameter,
  • you should call it passing corresponding columns,
  • "other" parameters (not taken from the source array), should be marked
    as "excluded" parameters.

Another detail is that a vectorized function should not print anything,
but it should return some value - the result of processing parameters from
the current source row.

So you could e.g. proceed as follows

  1. Define your function as:

    def myFunct(col1, col2, hg):
    return f'{hg} / {col1} / {col2}'

    Don't use the word vectorize in the name of the function. For now it is an
    "ordinary" function. It will be vectorized in a moment.

  2. Create the vectorized function:

    vfunct = np.vectorize(myFunct, excluded=['hg'])
  3. And finally call it:

    vfunct(df.tweets_id, df.tokenized_text, '#Python')

The result, for my sample data, is:

array(['#Python / 101 / aaa bbb ccc ddd',
'#Python / 102 / eee fff ggg hhh iii jjj',
'#Python / 103 / kkk lll mmm nnn ooo ppp'], dtype='<U39')

It is up to what you do with this result. You may e.g. set it as a new column of your
source DataFrame.

Expanding zscore np.vectorize rather than apply

Rework your transformation to be vectorial (per group):

(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(lambda d: (d-d.expanding(5).mean())/d.expanding(5).std())
)

Or using a function:

def expanding_zscore(d, window=5):
return (d-d.expanding(window).mean())/d.expanding(window).std()

(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(expanding_zscore, window=5)
)

output:

                   x         y
date
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 0.797018 0.845773
... ... ...
2027-05-14 -1.216591 -0.121771
2027-05-15 -1.550736 1.191920
2027-05-16 -1.659481 -0.304257
2027-05-17 0.295209 -0.521772
2027-05-18 1.702968 -0.462038

why does np.vectorize work here when np.where throws a TypeError?

The full traceback from your where expression is:

Traceback (most recent call last):
File "/usr/lib/python3.8/enum.py", line 641, in __new__
return cls._value2member_map_[value]
TypeError: unhashable type: 'Series'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-27-16f5edc71240>", line 3, in <module>
MyEnum(df['myenum']).name,
File "/usr/lib/python3.8/enum.py", line 339, in __call__
return cls.__new__(cls, value)
File "/usr/lib/python3.8/enum.py", line 648, in __new__
if member._value_ == value:
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 1537, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

It's produced by giving the whole series to MyEnum:

In [30]: MyEnum(df['myenum'])
Traceback (most recent call last):
File "/usr/lib/python3.8/enum.py", line 641, in __new__
return cls._value2member_map_[value]
TypeError: unhashable type: 'Series'
...

The problem isn't with the where at all.

The where works fine if we provide it with a valid list of strings:

In [33]: np.where(
...: df['myenum'] > 0,
...: [vectorize_enum_value(x) for x in df['myenum']],
...: ''
...: )
Out[33]:
array(['', 'First', 'Second', '', '', '', 'Second', 'First', ''],
dtype='<U6')

That 2nd argument, the list comprehension is basically the same as the vectorize.

where is a function; Python evaluates function arguments before passing them in. So each argument has to work. where is not an iterator, like apply or even vectorize.

Pandas create two new columns based on 2 existing columns

What you want to do is to pivot your table, and then add a column with aggregated data from the original table.

df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()


Leave a reply



Submit