Performance of Pandas apply vs np.vectorize to create new column from existing columns
I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2
Python-level loops
Now we can look at some timings. Below are all Python-level loops which produce either pd.Series
, np.ndarray
or list
objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.
# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
N = 10**5
%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s
Some takeaways:
- The
tuple
-based methods (the first 4) are a factor more efficient thanpd.Series
-based methods (the last 3). np.vectorize
, list comprehension +zip
andmap
methods, i.e. the top 3, all have roughly the same performance. This is because they usetuple
and bypass some Pandas overhead frompd.DataFrame.itertuples
.- There is a significant speed improvement from using
raw=True
withpd.DataFrame.apply
versus without. This option feeds NumPy arrays to the custom function instead ofpd.Series
objects.
pd.DataFrame.apply
: just another loop
To see exactly the objects Pandas passes around, you can amend your function trivially:
def foo(row):
print(type(row))
assert False # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)
Output: <class 'pandas.core.series.Series'>
. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.
Do the same exercise again with raw=True
and you'll see <class 'numpy.ndarray'>
. All this is described in the docs, but seeing it is more convincing.
np.vectorize
: fake vectorisation
The docs for np.vectorize
has the following note:
The vectorized function evaluates
pyfunc
over successive tuples of
the input arrays like the python map function, except it uses the
broadcasting rules of numpy.
The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map
is instructive, since the map
version above has almost identical performance. The source code shows what's happening: np.vectorize
converts your input function into a Universal function ("ufunc") via np.frompyfunc
. There is some optimisation, e.g. caching, which can lead to some performance improvement.
In short, np.vectorize
does what a Python-level loop should do, but pd.DataFrame.apply
adds a chunky overhead. There's no JIT-compilation which you see with numba
(see below). It's just a convenience.
True vectorisation: what you should use
Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:
%timeit np.where(df['B'] == 0, 0, df['A'] / df['B']) # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0) # 1.96 ms
Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba
below, if performance is critical and this is part of your bottleneck.
numba.njit
: greater efficiency
When loops are considered viable they are usually optimised via numba
with underlying NumPy arrays to move as much as possible to C.
Indeed, numba
improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.
from numba import njit
@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res
%timeit divide(df['A'].values, df['B'].values) # 717 µs
Using @njit(parallel=True)
may provide a further boost for larger arrays.
1 Numeric types include: int
, float
, datetime
, bool
, category
. They exclude object
dtype and can be held in contiguous memory blocks.
2
There are at least 2 reasons why NumPy operations are efficient versus Python:
- Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
- NumPy methods are usually C-based. In addition, optimised algorithms
are used where possible.
Vectorize() vs apply()
Vectorize
is just a wrapper for mapply
. It just builds you an mapply
loop for whatever function you feed it. Thus there are often easier things do to than Vectorize()
it and the explicit *apply
solutions end up being computationally equivalent or perhaps superior.
Also, for your specific example, you've heard of mget
, right?
Using Apply or Vectorize to apply custom function to a dataframe
For the particular task requested it could be
celebrities$newcol <- with(celebrities, age + income)
The +
function is inherently vectorized. Using apply
with sum
is inefficient. Using apply
could have been greatly simplified by omitting the first column because that would avoid the coercion to a character matrix caused by the first column.
celebrities$newcol <- apply(celebrities[-1], function(x) sum(x) )
That way you would avoid coercing the vectors to "character" and then needing to coerce back the formerly-numeric columns to numeric
. Using sum
inside apply does get around the fact that sum is not vectorized, but it's an example of inefficient R coding.
You get automatic vectorization if the "inner" algorithm can be constructed completely from vectorized functions: the Math and Ops groups being the usual components. See ?Ops
. Otherwise, you may need to use mapply
or Vectorize
.
Why Pandas apply can be faster than vectorized built-ins
To put it short, your question is whether
s.astype(np.str).str[0].astype(np.int)
fuses your operations together, then iterates over the series, or creates a temporary series for each operation, and how to verify this?
My hypothesis (and I guess yours) is that it is the latter. You have the right explanation there but how to test?
My suggestion is:
s1=s.astype(np.str)
s2=s1.str[0]
s3=s2.astype(np.int)
See how long each operation takes and how long the 3 operations take together. Most likely each operation will take about the same amount of time (the complexity of each operation is about the same) which would strongly indicate that our hypothesis is right. If the first two operations take no time, but last, pretty much all of the time, probably our hypothesis is wrong.
Is the *apply family really not vectorized?
First of all, in your example you make tests on a "data.frame" which is not fair for colMeans
, apply
and "[.data.frame"
since they have an overhead:
system.time(as.matrix(m)) #called by `colMeans` and `apply`
# user system elapsed
# 1.03 0.00 1.05
system.time(for(i in 1:ncol(m)) m[, i]) #in the `for` loop
# user system elapsed
# 12.93 0.01 13.07
On a matrix, the picture is a bit different:
mm = as.matrix(m)
system.time(colMeans(mm))
# user system elapsed
# 0.01 0.00 0.01
system.time(apply(mm, 2, mean))
# user system elapsed
# 1.48 0.03 1.53
system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
# user system elapsed
# 1.22 0.00 1.21
Regading the main part of the question, the main difference between lapply
/mapply
/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about?
An example where we find the mean in each of a "list"s elements:
(EDIT May 11 '16 : I believe the example with finding the "mean" is not a good setup for the differences between evaluating an R function iteratively and compiled code, (1) because of the particularity of R's mean algorithm on "numeric"s over a simple sum(x) / length(x)
and (2) it should make more sense to test on "list"s with length(x) >> lengths(x)
. So, the "mean" example is moved to the end and replaced with another.)
As a simple example we could consider the finding of the opposite of each length == 1
element of a "list":
In a tmp.c
file:
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>
/* call a C function inside another */
double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }
SEXP sapply_oppC(SEXP x)
{
SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
for(int i = 0; i < LENGTH(x); i++)
REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);
UNPROTECT(1);
return(ans);
}
/* call an R function inside a C function;
* will be used with 'f' as a closure and as a builtin */
SEXP sapply_oppR(SEXP x, SEXP f)
{
SEXP call = PROTECT(allocVector(LANGSXP, 2));
SETCAR(call, install(CHAR(STRING_ELT(f, 0))));
SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
for(int i = 0; i < LENGTH(x); i++) {
SETCADR(call, VECTOR_ELT(x, i));
REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];
}
UNPROTECT(2);
return(ans);
}
And in R side:
system("R CMD SHLIB /home/~/tmp.c")
dyn.load("/home/~/tmp.so")
with data:
set.seed(007)
myls = rep_len(as.list(c(NA, runif(3))), 1e7)
#a closure wrapper of `-`
oppR = function(x) -x
for_oppR = compiler::cmpfun(function(x, f)
{
f = match.fun(f)
ans = numeric(length(x))
for(i in seq_along(x)) ans[[i]] = f(x[[i]])
return(ans)
})
Benchmarking:
#call a C function iteratively
system.time({ sapplyC = .Call("sapply_oppC", myls) })
# user system elapsed
# 0.048 0.000 0.047
#evaluate an R closure iteratively
system.time({ sapplyRC = .Call("sapply_oppR", myls, "oppR") })
# user system elapsed
# 3.348 0.000 3.358
#evaluate an R builtin iteratively
system.time({ sapplyRCprim = .Call("sapply_oppR", myls, "-") })
# user system elapsed
# 0.652 0.000 0.653
#loop with a R closure
system.time({ forR = for_oppR(myls, "oppR") })
# user system elapsed
# 4.396 0.000 4.409
#loop with an R builtin
system.time({ forRprim = for_oppR(myls, "-") })
# user system elapsed
# 1.908 0.000 1.913
#for reference and testing
system.time({ sapplyR = unlist(lapply(myls, oppR)) })
# user system elapsed
# 7.080 0.068 7.170
system.time({ sapplyRprim = unlist(lapply(myls, `-`)) })
# user system elapsed
# 3.524 0.064 3.598
all.equal(sapplyR, sapplyRprim)
#[1] TRUE
all.equal(sapplyR, sapplyC)
#[1] TRUE
all.equal(sapplyR, sapplyRC)
#[1] TRUE
all.equal(sapplyR, sapplyRCprim)
#[1] TRUE
all.equal(sapplyR, forR)
#[1] TRUE
all.equal(sapplyR, forRprim)
#[1] TRUE
(Follows the original example of mean finding):
#all computations in C
all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
SEXP tmp, ans;
PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));
double *ptmp, *pans = REAL(ans);
for(int i = 0; i < LENGTH(R_ls); i++) {
pans[i] = 0.0;
PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
ptmp = REAL(tmp);
for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];
pans[i] /= LENGTH(tmp);
UNPROTECT(1);
}
UNPROTECT(1);
return(ans);
')
#a very simple `lapply(x, mean)`
C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '
SEXP call, ans, ret;
PROTECT(call = allocList(2));
SET_TYPEOF(call, LANGSXP);
SETCAR(call, install("mean"));
PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));
PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));
for(int i = 0; i < LENGTH(R_ls); i++) {
SETCADR(call, VECTOR_ELT(R_ls, i));
SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));
}
double *pret = REAL(ret);
for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];
UNPROTECT(3);
return(ret);
')
R_lapply = function(x) unlist(lapply(x, mean))
R_loop = function(x)
{
ans = numeric(length(x))
for(i in seq_along(x)) ans[i] = mean(x[[i]])
return(ans)
}
R_loopcmp = compiler::cmpfun(R_loop)
set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)
all.equal(all_C(myls), C_and_R(myls))
#[1] TRUE
all.equal(all_C(myls), R_lapply(myls))
#[1] TRUE
all.equal(all_C(myls), R_loop(myls))
#[1] TRUE
all.equal(all_C(myls), R_loopcmp(myls))
#[1] TRUE
microbenchmark::microbenchmark(all_C(myls),
C_and_R(myls),
R_lapply(myls),
R_loop(myls),
R_loopcmp(myls),
times = 15)
#Unit: milliseconds
# expr min lq median uq max neval
# all_C(myls) 37.29183 38.19107 38.69359 39.58083 41.3861 15
# C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822 15
# R_lapply(myls) 98.48009 103.80717 106.55519 109.54890 116.3150 15
# R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128 15
# R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976 15
Correct way to use np.vectorize to apply functions to all columns in Pandas dataframe
The correct way to use np.vectorize
is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.
At least that's the case when working with numpy
arrays. With DataFrames
, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.
But lets look at your example in some detail.
Your sample frame:
In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
...:
In [178]: test_df
Out[178]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
In [179]: def myfunc(a, b):
...: return a+b
...:
Your apply:
In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**
I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply
with default axis and raw
implies you have a function that only works with one column at a time.
In [182]: test_df+3
Out[182]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With raw
you are passing numpy arrays to the lambda
. Array for the whole frame is:
In [184]: test_df.to_numpy()
Out[184]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]:
array([[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That's much faster. But to return a frame takes time.
In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]:
Testing vectorize
.
In [218]: f=np.vectorize(myfunc)
f
applies myfunc
to each element of the input array iteratively. It has a clear performance disclaimer.
Even for this small array it is slow compared to direct application of the function to the array:
In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Passing the frame itself
In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And iteratively applying to columns - much slower:
In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
but a lot of that extra time comes from "extracting" columns:
In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Quoting the np.vectorize
docs:
Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
How to Vectorize this pandas apply function that uses other column values as new column names
Assuming unique obs, you can pivot
and merge
:
df2 = df.merge(df.pivot('obs', 'purchase', 'amount'), on='obs')
Output:
obs purchase amount Coffee Juice
0 1 Coffee 1 1.0 NaN
1 2 Juice 1 NaN 1.0
2 3 Coffee 2 2.0 NaN
Using np.vectorize to create a column in a data frame
Here is version that doesn't use np.vectorize
def scoreFunHigh(val, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if val > upper:
return multip * 1
elif val < lower:
return multip * (-1)
else:
return 0
letterMeanX = df.groupby('letters')['x'].apply(lambda x: np.nanmean(x))
df['letter x score'] = df.apply(
lambda row: scoreFunHigh(row['x'], letterMeanX[row['letters']], 0.2, 10), axis=1)
Output
x y letters letter x score
0 52 76 A 0
1 90 99 B 10
2 87 43 C 10
3 44 73 D 0
4 49 3 A 0
.. .. .. ... ...
95 16 51 D -10
96 38 3 A 0
97 43 47 B 0
98 58 39 C 0
99 41 26 D 0
Speed up or vectorize pandas apply function - require a conditional application of a function
Attempted to create a reproducible example that should generalize to your problem. You can run the code with different row sizes to compare the results between different methods, it also shouldn't be difficult to extend one of these methods to using cython or multiprocessing for even faster speeds. You mentioned that your data is quite large, I haven't tested the memory usage for each of the methods so it's worth trying out on your own machine.
import numpy as np
import pandas as pd
import time as t
# Example Functions
def foo(x):
return x + x
def bar(x):
return x * x
# Example Functions for multiple columns
def foo2(x, y):
return x + y
def bar2(x, y):
return x * y
# Create function dictionary
funcs = {'foo': foo, 'bar': bar}
funcs2 = {'foo': foo2, 'bar': bar2}
n_rows = 1000000
# Generate Sample Data
names = np.random.choice(list(funcs.keys()), size=n_rows)
values = np.random.normal(100, 20, size=n_rows)
df = pd.DataFrame()
df['name'] = names
df['value'] = values
# Create copy for comparison using different methods
df_copy = df.copy()
# Modified original master function
def masterFunc(row, functs):
correctFunction = funcs[row['name']]
return correctFunction(row['value']) + 3*row['value']
t1 = t.time()
df['output'] = df.apply(lambda x: masterFunc(x, funcs), axis=1)
t2 = t.time()
print("Time for all rows/functions: ", t2 - t1)
# For Functions that Can be vectorized using numpy
t3 = t.time()
output_dataframe_list = []
for func_name, func in funcs.items():
df_subset = df_copy.loc[df_copy['name'] == func_name,:]
df_subset['output'] = func(df_subset['value'].values) + 3 * df_subset['value'].values
output_dataframe_list.append(df_subset)
output_df = pd.concat(output_dataframe_list)
t4 = t.time()
print("Time for all rows/functions: ", t4 - t3)
# Using a for loop over numpy array of values is still faster than dataframe apply using
t5 = t.time()
output_dataframe_list2 = []
for func_name, func in funcs2.items():
df_subset = df_copy.loc[df_copy['name'] == func_name,:]
col1_values = df_subset['value'].values
outputs = np.zeros(len(col1_values))
for i, v in enumerate(col1_values):
outputs[i] = func(col1_values[i], col1_values[i]) + 3 * col1_values[i]
df_subset['output'] = np.array(outputs)
output_dataframe_list2.append(df_subset)
output_df2 = pd.concat(output_dataframe_list2)
t6 = t.time()
print("Time for all rows/functions: ", t6 - t5)
Related Topics
Control Transparency of Smoother and Confidence Interval
Different Results with Randomforest() and Caret's Randomforest (Method = "Rf")
Plotting Multiple Lines from a Data Frame with Ggplot2
Remove a Character from the Entire Data Frame
Grid.Arrange Using List of Plots
Percentage Histogram with Facet_Wrap
Create a Table in R with Header Expanding on Two Columns Using Xtable or Any Package
S4 Classes: Multiple Types Per Slot
How to Increase the Resolution of My Plot in R
Index Element from List in Rcpp
Using Mean with .Sd and .Sdcols in Data.Table
Ggplot2 Find Number of Counts in Histogram Maximum
How to Reset All Options() Arguments to Their Default Values
How to Pass the "..." Parameters in the Parent Function to Its Two Children Functions in R