Performance of Pandas apply vs np.vectorize to create new column from existing columns
I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2
Python-level loops
Now we can look at some timings. Below are all Python-level loops which produce either pd.Series
, np.ndarray
or list
objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.
# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
N = 10**5
%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s
Some takeaways:
- The
tuple
-based methods (the first 4) are a factor more efficient thanpd.Series
-based methods (the last 3). np.vectorize
, list comprehension +zip
andmap
methods, i.e. the top 3, all have roughly the same performance. This is because they usetuple
and bypass some Pandas overhead frompd.DataFrame.itertuples
.- There is a significant speed improvement from using
raw=True
withpd.DataFrame.apply
versus without. This option feeds NumPy arrays to the custom function instead ofpd.Series
objects.
pd.DataFrame.apply
: just another loop
To see exactly the objects Pandas passes around, you can amend your function trivially:
def foo(row):
print(type(row))
assert False # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)
Output: <class 'pandas.core.series.Series'>
. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.
Do the same exercise again with raw=True
and you'll see <class 'numpy.ndarray'>
. All this is described in the docs, but seeing it is more convincing.
np.vectorize
: fake vectorisation
The docs for np.vectorize
has the following note:
The vectorized function evaluates
pyfunc
over successive tuples of
the input arrays like the python map function, except it uses the
broadcasting rules of numpy.
The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map
is instructive, since the map
version above has almost identical performance. The source code shows what's happening: np.vectorize
converts your input function into a Universal function ("ufunc") via np.frompyfunc
. There is some optimisation, e.g. caching, which can lead to some performance improvement.
In short, np.vectorize
does what a Python-level loop should do, but pd.DataFrame.apply
adds a chunky overhead. There's no JIT-compilation which you see with numba
(see below). It's just a convenience.
True vectorisation: what you should use
Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:
%timeit np.where(df['B'] == 0, 0, df['A'] / df['B']) # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0) # 1.96 ms
Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba
below, if performance is critical and this is part of your bottleneck.
numba.njit
: greater efficiency
When loops are considered viable they are usually optimised via numba
with underlying NumPy arrays to move as much as possible to C.
Indeed, numba
improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.
from numba import njit
@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res
%timeit divide(df['A'].values, df['B'].values) # 717 µs
Using @njit(parallel=True)
may provide a further boost for larger arrays.
1 Numeric types include: int
, float
, datetime
, bool
, category
. They exclude object
dtype and can be held in contiguous memory blocks.
2
There are at least 2 reasons why NumPy operations are efficient versus Python:
- Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
- NumPy methods are usually C-based. In addition, optimised algorithms
are used where possible.
Correct way to use np.vectorize to apply functions to all columns in Pandas dataframe
The correct way to use np.vectorize
is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.
At least that's the case when working with numpy
arrays. With DataFrames
, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.
But lets look at your example in some detail.
Your sample frame:
In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
...:
In [178]: test_df
Out[178]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
In [179]: def myfunc(a, b):
...: return a+b
...:
Your apply:
In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**
I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply
with default axis and raw
implies you have a function that only works with one column at a time.
In [182]: test_df+3
Out[182]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With raw
you are passing numpy arrays to the lambda
. Array for the whole frame is:
In [184]: test_df.to_numpy()
Out[184]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]:
array([[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That's much faster. But to return a frame takes time.
In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]:
Testing vectorize
.
In [218]: f=np.vectorize(myfunc)
f
applies myfunc
to each element of the input array iteratively. It has a clear performance disclaimer.
Even for this small array it is slow compared to direct application of the function to the array:
In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Passing the frame itself
In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And iteratively applying to columns - much slower:
In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
but a lot of that extra time comes from "extracting" columns:
In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Quoting the np.vectorize
docs:
Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Why Pandas apply can be faster than vectorized built-ins
To put it short, your question is whether
s.astype(np.str).str[0].astype(np.int)
fuses your operations together, then iterates over the series, or creates a temporary series for each operation, and how to verify this?
My hypothesis (and I guess yours) is that it is the latter. You have the right explanation there but how to test?
My suggestion is:
s1=s.astype(np.str)
s2=s1.str[0]
s3=s2.astype(np.int)
See how long each operation takes and how long the 3 operations take together. Most likely each operation will take about the same amount of time (the complexity of each operation is about the same) which would strongly indicate that our hypothesis is right. If the first two operations take no time, but last, pretty much all of the time, probably our hypothesis is wrong.
Using np.vectorize to create a column in a data frame
Here is version that doesn't use np.vectorize
def scoreFunHigh(val, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if val > upper:
return multip * 1
elif val < lower:
return multip * (-1)
else:
return 0
letterMeanX = df.groupby('letters')['x'].apply(lambda x: np.nanmean(x))
df['letter x score'] = df.apply(
lambda row: scoreFunHigh(row['x'], letterMeanX[row['letters']], 0.2, 10), axis=1)
Output
x y letters letter x score
0 52 76 A 0
1 90 99 B 10
2 87 43 C 10
3 44 73 D 0
4 49 3 A 0
.. .. .. ... ...
95 16 51 D -10
96 38 3 A 0
97 43 47 B 0
98 58 39 C 0
99 41 26 D 0
How to speed up Pandas apply function to create a new column in the dataframe?
This looks like IO bound and not CPU bound issue. Multiprocessing would not help. The major bottleneck is your call to Nominatim()
. You make a http request to their API for every non-NaN column. This means if 'India' is in 5 places, you will make 5 calls for India which wastefully returns the same geolocation for 5 rows.
The optimisation of these would require a mixture of caching most frequently location
locally and also the new few calls during calls.
- Create a DataFrame with most N frequent locations.
- Call
Nominatim()
on the most frequent locations and save this as lookup dict/json e.g.location_geo = df.set_index('location').to_dict()['geolocation']
- Save it e.g. 'json.dump...`
- In your function, we will check if the location is in your cached location_geo dictionary, and return the value. If not then make a call to Nominatim API.
In the end you would have something like this:
import json
from functools import lru_cache
from geopy.geocoders import Nominatim
import numpy as np
geolocator = Nominatim()
# load most frequently locations
with open('our_save_freq_location.json', 'r') as f:
location_geolocation = json.load(f)
@lru_cache
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
# look first in our dictionary, if not call Nominatim
loc = locations_geolocation.get(location, geolocator.geocode(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name
How iterate in a efficient way over Pandas dataframe with Numpy.vectorize?
When you define a function to be vectorized, then:
- each column should be a separate parameter,
- you should call it passing corresponding columns,
- "other" parameters (not taken from the source array), should be marked
as "excluded" parameters.
Another detail is that a vectorized function should not print anything,
but it should return some value - the result of processing parameters from
the current source row.
So you could e.g. proceed as follows
Define your function as:
def myFunct(col1, col2, hg):
return f'{hg} / {col1} / {col2}'Don't use the word vectorize in the name of the function. For now it is an
"ordinary" function. It will be vectorized in a moment.Create the vectorized function:
vfunct = np.vectorize(myFunct, excluded=['hg'])
And finally call it:
vfunct(df.tweets_id, df.tokenized_text, '#Python')
The result, for my sample data, is:
array(['#Python / 101 / aaa bbb ccc ddd',
'#Python / 102 / eee fff ggg hhh iii jjj',
'#Python / 103 / kkk lll mmm nnn ooo ppp'], dtype='<U39')
It is up to what you do with this result. You may e.g. set it as a new column of your
source DataFrame.
Expanding zscore np.vectorize rather than apply
Rework your transformation to be vectorial (per group):
(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(lambda d: (d-d.expanding(5).mean())/d.expanding(5).std())
)
Or using a function:
def expanding_zscore(d, window=5):
return (d-d.expanding(window).mean())/d.expanding(window).std()
(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(expanding_zscore, window=5)
)
output:
x y
date
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 0.797018 0.845773
... ... ...
2027-05-14 -1.216591 -0.121771
2027-05-15 -1.550736 1.191920
2027-05-16 -1.659481 -0.304257
2027-05-17 0.295209 -0.521772
2027-05-18 1.702968 -0.462038
why does np.vectorize work here when np.where throws a TypeError?
The full traceback from your where
expression is:
Traceback (most recent call last):
File "/usr/lib/python3.8/enum.py", line 641, in __new__
return cls._value2member_map_[value]
TypeError: unhashable type: 'Series'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-27-16f5edc71240>", line 3, in <module>
MyEnum(df['myenum']).name,
File "/usr/lib/python3.8/enum.py", line 339, in __call__
return cls.__new__(cls, value)
File "/usr/lib/python3.8/enum.py", line 648, in __new__
if member._value_ == value:
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 1537, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
It's produced by giving the whole series to MyEnum
:
In [30]: MyEnum(df['myenum'])
Traceback (most recent call last):
File "/usr/lib/python3.8/enum.py", line 641, in __new__
return cls._value2member_map_[value]
TypeError: unhashable type: 'Series'
...
The problem isn't with the where
at all.
The where
works fine if we provide it with a valid list of strings:
In [33]: np.where(
...: df['myenum'] > 0,
...: [vectorize_enum_value(x) for x in df['myenum']],
...: ''
...: )
Out[33]:
array(['', 'First', 'Second', '', '', '', 'Second', 'First', ''],
dtype='<U6')
That 2nd argument, the list comprehension is basically the same as the vectorize
.
where
is a function; Python evaluates function arguments before passing them in. So each argument has to work. where
is not an iterator, like apply
or even vectorize
.
Pandas create two new columns based on 2 existing columns
What you want to do is to pivot your table, and then add a column with aggregated data from the original table.
df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()
Tier1 | Tier2 | total | |
---|---|---|---|
joblogs@gmail.com | 5 | 2 | 11641.33 |
Related Topics
How to Execute Raw SQL in Flask-Sqlalchemy App
How to Find All Comments with Beautiful Soup
How I Open Remote Server Folder Using Python
Str' Object Does Not Support Item Assignment
How to "Test" Nonetype in Python
Transpose Column to Row with Spark
Sftp in Python? (Platform Independent)
Type Hints: Solve Circular Dependency
Split a Large Pandas Dataframe
Rendering Text with Multiple Lines in Pygame
Pil Installation Fails Missing:Stdarg.H
Dictionaries and Default Values
Why Isn't My Pandas 'Apply' Function Referencing Multiple Columns Working
Does Python Support Multithreading? Can It Speed Up Execution Time