Why is mean() so slow?
It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)
sum
and length
are both Primitive functions. so will be fast (but how are you handling NA values?)
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
times = 10000)
Unit: nanoseconds
expr min lq median uq max neval
mean(t1) 10266 10951 11293 11635 1470714 10000
sum(t1)/length(t1) 684 1027 1369 1711 104367 10000
mean.default(t1) 2053 2396 2738 2739 1167195 10000
.Internal(mean(t1)) 342 343 685 685 86574 10000
The internal bit of mean
is faster even than sum
/length
.
See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal
).
Note that if we increase the length of the vector, then the primitive approach is fastest
t1 <- rnorm(1e7)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
+ times = 100)
Unit: milliseconds
expr min lq median uq max neval
mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137 100
sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824 100
mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896 100
.Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054 100
Now method dispatch is only a fraction of the overall "time" required.
Why is statistics.mean() so slow?
Python's statistics
module is not built for speed, but for precision
In the specs for this module, it appears that
The built-in sum can lose accuracy when dealing with floats of wildly
differing magnitude. Consequently, the above naive mean fails this
"torture test"
assert mean([1e30, 1, 3, -1e30]) == 1
returning 0 instead of 1, a purely computational error of 100%.
Using math.fsum inside mean will make it more accurate with float
data, but it also has the side-effect of converting any arguments to
float even when unnecessary. E.g. we should expect the mean of a list
of Fractions to be a Fraction, not a float.
Conversely, if we take a look at the implementation of _sum()
in this module, the first lines of the method's docstring seem to confirm that:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
[...] """
So yeah, statistics
implementation of sum
, instead of being a simple one-liner call to Python's built-in sum()
function, takes about 20 lines by itself with a nested for
loop in its body.
This happens because statistics._sum
chooses to guarantee the maximum precision for all types of number it could encounter (even if they widely differ from one another), instead of simply emphasizing speed.
Hence, it appears normal that the built-in sum
proves a hundred times faster. The cost of it being a much lower precision in you happen to call it with exotic numbers.
Other options
If you need to prioritize speed in your algorithms, you should have a look at Numpy instead, the algorithms of which being implemented in C.
NumPy mean is not as precise as statistics
by a long shot but it implements (since 2013) a routine based on pairwise summation which is better than a naive sum/len
(more info in the link).
However...
import numpy as np
import statistics
np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])
print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))
> NumPy mean: 0.0
> Statistics mean: 1.0
pandas: DataFrame.mean() very slow. How can I calculate means of columns faster?
Here's a similar sized from , but without an object column
In [10]: nrows = 10000000
In [11]: df = pd.concat([DataFrame(randn(int(nrows),34),columns=[ 'f%s' % i for i in range(34) ]),DataFrame(randint(0,10,size=int(nrows*19)).reshape(int(nrows),19),columns=[ 'i%s' % i for i in range(19) ])],axis=1)
In [12]: df.iloc[1000:10000,0:20] = np.nan
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 53 columns):
f0 9991000 non-null values
f1 9991000 non-null values
f2 9991000 non-null values
f3 9991000 non-null values
f4 9991000 non-null values
f5 9991000 non-null values
f6 9991000 non-null values
f7 9991000 non-null values
f8 9991000 non-null values
f9 9991000 non-null values
f10 9991000 non-null values
f11 9991000 non-null values
f12 9991000 non-null values
f13 9991000 non-null values
f14 9991000 non-null values
f15 9991000 non-null values
f16 9991000 non-null values
f17 9991000 non-null values
f18 9991000 non-null values
f19 9991000 non-null values
f20 10000000 non-null values
f21 10000000 non-null values
f22 10000000 non-null values
f23 10000000 non-null values
f24 10000000 non-null values
f25 10000000 non-null values
f26 10000000 non-null values
f27 10000000 non-null values
f28 10000000 non-null values
f29 10000000 non-null values
f30 10000000 non-null values
f31 10000000 non-null values
f32 10000000 non-null values
f33 10000000 non-null values
i0 10000000 non-null values
i1 10000000 non-null values
i2 10000000 non-null values
i3 10000000 non-null values
i4 10000000 non-null values
i5 10000000 non-null values
i6 10000000 non-null values
i7 10000000 non-null values
i8 10000000 non-null values
i9 10000000 non-null values
i10 10000000 non-null values
i11 10000000 non-null values
i12 10000000 non-null values
i13 10000000 non-null values
i14 10000000 non-null values
i15 10000000 non-null values
i16 10000000 non-null values
i17 10000000 non-null values
i18 10000000 non-null values
dtypes: float64(34), int64(19)
Timings (similar machine specs to you)
In [14]: %timeit df.mean()
1 loops, best of 3: 21.5 s per loop
You can get a 2x speedup by pre-converting to floats (mean does this, but does it in a more general way, so slower)
In [15]: %timeit df.astype('float64').mean()
1 loops, best of 3: 9.45 s per loop
You problem is the object column. Mean will try to calculate for all of the columns, but because of the object column everything is upcast to object
dtype which is not efficient for calculating.
Best bet is to do
df._get_numeric_data().mean()
There is an option to do this numeric_only
, at the lower level, but for some reason we don't directly support this via the top-level functions (e.g. mean). I think will create an issue to add this parameter. However will prob be False
by default (to not-exclude).
Why is numpy ma.average 24 times slower than arr.mean?
A good way to find out why something is slower is to profile it. I'll use the 3rd party library line_profiler
and the IPython command %lprun
(see for example this blog) here:
%load_ext line_profiler
import numpy as np
arr = np.full((3, 3), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 1810 1810.0 30.5 a = asarray(a)
571 1 15 15.0 0.3 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 5 5.0 0.1 if weights is None:
576 1 3500 3500.0 59.0 avg = a.mean(axis)
577 1 591 591.0 10.0 scl = avg.dtype.type(a.count(axis))
578 else:
...
608
609 1 7 7.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 5 5.0 0.1 return avg
I removed some irrelevant lines.
So actually 30% of the time is spent in np.ma.asarray
(something that arr.mean
doesn't have to do!).
However the relative times change drastically if you use a bigger array:
arr = np.full((1000, 1000), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 609 609.0 7.6 a = asarray(a)
571 1 14 14.0 0.2 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 7 7.0 0.1 if weights is None:
576 1 6924 6924.0 86.9 avg = a.mean(axis)
577 1 404 404.0 5.1 scl = avg.dtype.type(a.count(axis))
578 else:
...
609 1 6 6.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 6 6.0 0.1 return avg
This time the np.ma.MaskedArray.mean
function almost takes up 90% of the time.
Note: You could also dig deeper and look into np.ma.asarray
or np.ma.MaskedArray.count
or np.ma.MaskedArray.mean
and check their line profilings. But I just wanted to show that there are lots of called function that add to the overhead.
So the next question is: did the relative times between np.ndarray.mean
and np.ma.average
also change? And at least on my computer the difference is much lower now:
%timeit np.ma.average(arr, axis=0)
# 2.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr.mean(axis=0)
# 1.84 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time it's not even 2 times slower. I assume for even bigger arrays the difference will get even smaller.
This is also something that is actually quite common with NumPy:
The constant factors are quite high even for plain numpy functions (see for example my answer to the question "Performance in different vectorization method in numpy"). For np.ma
these constant factors are even bigger, especially if you don't use a np.ma.MaskedArray
as input. But even though the constant factors might be high, these functions excel with big arrays.
Custom mean implementation is slower than pandas default mean. How to optimize?
Found the solution by my self. The logic is to first normalize all the values by dividing it by length of Series (# of records) and then use default df.mean()
and then multiply the normalized mean with # of records: This is an improvement from 1min 37 seconds to 3.13 seconds. But I still don't understand why pandas implementation is not using such optimization.
def mean_without_overflow_fast(col):
col /= len(col)
return col.mean() * len(col)
Use this function as follows:
print (df.apply(mean_without_overflow_fast))
Average over a period of time is very slow
What i'm trying to do is compute the average over the last X days.
This would suggest:
SELECT ip.item_id AS id, avg(x.value) AS result
FROM item_prices ip
WHERE ip.tdate <= current_date AND
ip.tdate > current_date - X * interval '1 day'
GROUP BY ip.item_id;
I don't see what your actual query has to do with the question you are asking, though.
Related Topics
Choosing Eps and Minpts for Dbscan (R)
How to Plot One Variable in Ggplot
How to Display Widgets Inline in Shiny
Why Does Median Trip Up Data.Table (Integer Versus Double)
Visualise Distances Between Texts
R Table Function: How to Sum Instead of Counting
What Is the Correct Way to Ask for User Input in an R Program
Converting a Factor to Numeric Without Losing Information R (As.Numeric() Doesn't Seem to Work)
R Shiny - Disable/Able Shinyui Elements
Calling a Function from a Namespace
Arrange a Grouped_Df by Group Variable Not Working
Increase Plot Size (Width) in Ggplot2
How to Call External R Script from R Markdown (.Rmd) in Rstudio