Why Is Mean() So Slow

Why is mean() so slow?

It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)

sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)

t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
times = 10000)

Unit: nanoseconds
expr min lq median uq max neval
mean(t1) 10266 10951 11293 11635 1470714 10000
sum(t1)/length(t1) 684 1027 1369 1711 104367 10000
mean.default(t1) 2053 2396 2738 2739 1167195 10000
.Internal(mean(t1)) 342 343 685 685 86574 10000

The internal bit of mean is faster even than sum/length.

See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).

Note that if we increase the length of the vector, then the primitive approach is fastest

t1 <- rnorm(1e7)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
+ times = 100)

Unit: milliseconds
expr min lq median uq max neval
mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137 100
sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824 100
mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896 100
.Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054 100

Now method dispatch is only a fraction of the overall "time" required.

Why is statistics.mean() so slow?

Python's statistics module is not built for speed, but for precision

In the specs for this module, it appears that

The built-in sum can lose accuracy when dealing with floats of wildly
differing magnitude. Consequently, the above naive mean fails this
"torture test"

assert mean([1e30, 1, 3, -1e30]) == 1

returning 0 instead of 1, a purely computational error of 100%.

Using math.fsum inside mean will make it more accurate with float
data, but it also has the side-effect of converting any arguments to
float even when unnecessary. E.g. we should expect the mean of a list
of Fractions to be a Fraction, not a float.

Conversely, if we take a look at the implementation of _sum() in this module, the first lines of the method's docstring seem to confirm that:

def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)

Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.

[...] """

So yeah, statistics implementation of sum, instead of being a simple one-liner call to Python's built-in sum() function, takes about 20 lines by itself with a nested for loop in its body.

This happens because statistics._sum chooses to guarantee the maximum precision for all types of number it could encounter (even if they widely differ from one another), instead of simply emphasizing speed.

Hence, it appears normal that the built-in sum proves a hundred times faster. The cost of it being a much lower precision in you happen to call it with exotic numbers.

Other options

If you need to prioritize speed in your algorithms, you should have a look at Numpy instead, the algorithms of which being implemented in C.

NumPy mean is not as precise as statistics by a long shot but it implements (since 2013) a routine based on pairwise summation which is better than a naive sum/len (more info in the link).

However...

import numpy as np
import statistics

np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])

print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))

> NumPy mean: 0.0
> Statistics mean: 1.0

pandas: DataFrame.mean() very slow. How can I calculate means of columns faster?

Here's a similar sized from , but without an object column

In [10]: nrows = 10000000

In [11]: df = pd.concat([DataFrame(randn(int(nrows),34),columns=[ 'f%s' % i for i in range(34) ]),DataFrame(randint(0,10,size=int(nrows*19)).reshape(int(nrows),19),columns=[ 'i%s' % i for i in range(19) ])],axis=1)

In [12]: df.iloc[1000:10000,0:20] = np.nan

In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 53 columns):
f0 9991000 non-null values
f1 9991000 non-null values
f2 9991000 non-null values
f3 9991000 non-null values
f4 9991000 non-null values
f5 9991000 non-null values
f6 9991000 non-null values
f7 9991000 non-null values
f8 9991000 non-null values
f9 9991000 non-null values
f10 9991000 non-null values
f11 9991000 non-null values
f12 9991000 non-null values
f13 9991000 non-null values
f14 9991000 non-null values
f15 9991000 non-null values
f16 9991000 non-null values
f17 9991000 non-null values
f18 9991000 non-null values
f19 9991000 non-null values
f20 10000000 non-null values
f21 10000000 non-null values
f22 10000000 non-null values
f23 10000000 non-null values
f24 10000000 non-null values
f25 10000000 non-null values
f26 10000000 non-null values
f27 10000000 non-null values
f28 10000000 non-null values
f29 10000000 non-null values
f30 10000000 non-null values
f31 10000000 non-null values
f32 10000000 non-null values
f33 10000000 non-null values
i0 10000000 non-null values
i1 10000000 non-null values
i2 10000000 non-null values
i3 10000000 non-null values
i4 10000000 non-null values
i5 10000000 non-null values
i6 10000000 non-null values
i7 10000000 non-null values
i8 10000000 non-null values
i9 10000000 non-null values
i10 10000000 non-null values
i11 10000000 non-null values
i12 10000000 non-null values
i13 10000000 non-null values
i14 10000000 non-null values
i15 10000000 non-null values
i16 10000000 non-null values
i17 10000000 non-null values
i18 10000000 non-null values
dtypes: float64(34), int64(19)

Timings (similar machine specs to you)

In [14]: %timeit df.mean()
1 loops, best of 3: 21.5 s per loop

You can get a 2x speedup by pre-converting to floats (mean does this, but does it in a more general way, so slower)

In [15]: %timeit df.astype('float64').mean()
1 loops, best of 3: 9.45 s per loop

You problem is the object column. Mean will try to calculate for all of the columns, but because of the object column everything is upcast to object dtype which is not efficient for calculating.

Best bet is to do

 df._get_numeric_data().mean()

There is an option to do this numeric_only, at the lower level, but for some reason we don't directly support this via the top-level functions (e.g. mean). I think will create an issue to add this parameter. However will prob be False by default (to not-exclude).

Why is numpy ma.average 24 times slower than arr.mean?

A good way to find out why something is slower is to profile it. I'll use the 3rd party library line_profiler and the IPython command %lprun (see for example this blog) here:

%load_ext line_profiler

import numpy as np
arr = np.full((3, 3), -9999, dtype=float)

%lprun -f np.ma.average np.ma.average(arr, axis=0)

Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 1810 1810.0 30.5 a = asarray(a)
571 1 15 15.0 0.3 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 5 5.0 0.1 if weights is None:
576 1 3500 3500.0 59.0 avg = a.mean(axis)
577 1 591 591.0 10.0 scl = avg.dtype.type(a.count(axis))
578 else:
...
608
609 1 7 7.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 5 5.0 0.1 return avg

I removed some irrelevant lines.

So actually 30% of the time is spent in np.ma.asarray (something that arr.mean doesn't have to do!).

However the relative times change drastically if you use a bigger array:

arr = np.full((1000, 1000), -9999, dtype=float)

%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 609 609.0 7.6 a = asarray(a)
571 1 14 14.0 0.2 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 7 7.0 0.1 if weights is None:
576 1 6924 6924.0 86.9 avg = a.mean(axis)
577 1 404 404.0 5.1 scl = avg.dtype.type(a.count(axis))
578 else:
...
609 1 6 6.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 6 6.0 0.1 return avg

This time the np.ma.MaskedArray.mean function almost takes up 90% of the time.

Note: You could also dig deeper and look into np.ma.asarray or np.ma.MaskedArray.count or np.ma.MaskedArray.mean and check their line profilings. But I just wanted to show that there are lots of called function that add to the overhead.

So the next question is: did the relative times between np.ndarray.mean and np.ma.average also change? And at least on my computer the difference is much lower now:

%timeit np.ma.average(arr, axis=0)
# 2.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr.mean(axis=0)
# 1.84 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This time it's not even 2 times slower. I assume for even bigger arrays the difference will get even smaller.


This is also something that is actually quite common with NumPy:

The constant factors are quite high even for plain numpy functions (see for example my answer to the question "Performance in different vectorization method in numpy"). For np.ma these constant factors are even bigger, especially if you don't use a np.ma.MaskedArray as input. But even though the constant factors might be high, these functions excel with big arrays.

Custom mean implementation is slower than pandas default mean. How to optimize?

Found the solution by my self. The logic is to first normalize all the values by dividing it by length of Series (# of records) and then use default df.mean() and then multiply the normalized mean with # of records: This is an improvement from 1min 37 seconds to 3.13 seconds. But I still don't understand why pandas implementation is not using such optimization.

def mean_without_overflow_fast(col):
col /= len(col)
return col.mean() * len(col)

Use this function as follows:

print (df.apply(mean_without_overflow_fast))

Average over a period of time is very slow

What i'm trying to do is compute the average over the last X days.

This would suggest:

SELECT ip.item_id AS id, avg(x.value) AS result
FROM item_prices ip
WHERE ip.tdate <= current_date AND
ip.tdate > current_date - X * interval '1 day'
GROUP BY ip.item_id;

I don't see what your actual query has to do with the question you are asking, though.



Related Topics



Leave a reply



Submit