Why Is Mean() So Slow

Why is mean() so slow?

It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)

sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)

t1 <- rnorm(10)
microbenchmark(
  mean(t1),
  sum(t1)/length(t1),
  mean.default(t1),
  .Internal(mean(t1)),
  times = 10000)

Unit: nanoseconds
                expr   min    lq median    uq     max neval
            mean(t1) 10266 10951  11293 11635 1470714 10000
  sum(t1)/length(t1)   684  1027   1369  1711  104367 10000
    mean.default(t1)  2053  2396   2738  2739 1167195 10000
 .Internal(mean(t1))   342   343    685   685   86574 10000

The internal bit of mean is faster even than sum/length.

See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).

Note that if we increase the length of the vector, then the primitive approach is fastest

t1 <- rnorm(1e7)
microbenchmark(
     mean(t1),
     sum(t1)/length(t1),
     mean.default(t1),
     .Internal(mean(t1)),
+     times = 100)

Unit: milliseconds
                expr      min       lq   median       uq      max neval
            mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137   100
  sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824   100
    mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896   100
 .Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054   100

Now method dispatch is only a fraction of the overall "time" required.

Why is statistics.mean() so slow?

Python's statistics module is not built for speed, but for precision

In the specs for this module, it appears that

The built-in sum can lose accuracy when dealing with floats of wildly
differing magnitude. Consequently, the above naive mean fails this
"torture test"

assert mean([1e30, 1, 3, -1e30]) == 1

returning 0 instead of 1, a purely computational error of 100%.

Using math.fsum inside mean will make it more accurate with float
data, but it also has the side-effect of converting any arguments to
float even when unnecessary. E.g. we should expect the mean of a list
of Fractions to be a Fraction, not a float.

Conversely, if we take a look at the implementation of _sum() in this module, the first lines of the method's docstring seem to confirm that:

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    [...] """

So yeah, statistics implementation of sum, instead of being a simple one-liner call to Python's built-in sum() function, takes about 20 lines by itself with a nested for loop in its body.

This happens because statistics._sum chooses to guarantee the maximum precision for all types of number it could encounter (even if they widely differ from one another), instead of simply emphasizing speed.

Hence, it appears normal that the built-in sum proves a hundred times faster. The cost of it being a much lower precision in you happen to call it with exotic numbers.

Other options

If you need to prioritize speed in your algorithms, you should have a look at Numpy instead, the algorithms of which being implemented in C.

NumPy mean is not as precise as statistics by a long shot but it implements (since 2013) a routine based on pairwise summation which is better than a naive sum/len (more info in the link).

However...

import numpy as np
import statistics

np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])

print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))

> NumPy mean: 0.0
> Statistics mean: 1.0

pandas: DataFrame.mean() very slow. How can I calculate means of columns faster?

Here's a similar sized from , but without an object column

In [10]: nrows = 10000000

In [11]: df = pd.concat([DataFrame(randn(int(nrows),34),columns=[ 'f%s' % i for i in range(34) ]),DataFrame(randint(0,10,size=int(nrows*19)).reshape(int(nrows),19),columns=[ 'i%s' % i for i in range(19) ])],axis=1)

In [12]: df.iloc[1000:10000,0:20] = np.nan

In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 53 columns):
f0     9991000  non-null values
f1     9991000  non-null values
f2     9991000  non-null values
f3     9991000  non-null values
f4     9991000  non-null values
f5     9991000  non-null values
f6     9991000  non-null values
f7     9991000  non-null values
f8     9991000  non-null values
f9     9991000  non-null values
f10    9991000  non-null values
f11    9991000  non-null values
f12    9991000  non-null values
f13    9991000  non-null values
f14    9991000  non-null values
f15    9991000  non-null values
f16    9991000  non-null values
f17    9991000  non-null values
f18    9991000  non-null values
f19    9991000  non-null values
f20    10000000  non-null values
f21    10000000  non-null values
f22    10000000  non-null values
f23    10000000  non-null values
f24    10000000  non-null values
f25    10000000  non-null values
f26    10000000  non-null values
f27    10000000  non-null values
f28    10000000  non-null values
f29    10000000  non-null values
f30    10000000  non-null values
f31    10000000  non-null values
f32    10000000  non-null values
f33    10000000  non-null values
i0     10000000  non-null values
i1     10000000  non-null values
i2     10000000  non-null values
i3     10000000  non-null values
i4     10000000  non-null values
i5     10000000  non-null values
i6     10000000  non-null values
i7     10000000  non-null values
i8     10000000  non-null values
i9     10000000  non-null values
i10    10000000  non-null values
i11    10000000  non-null values
i12    10000000  non-null values
i13    10000000  non-null values
i14    10000000  non-null values
i15    10000000  non-null values
i16    10000000  non-null values
i17    10000000  non-null values
i18    10000000  non-null values
dtypes: float64(34), int64(19)

Timings (similar machine specs to you)

In [14]: %timeit df.mean()
1 loops, best of 3: 21.5 s per loop

You can get a 2x speedup by pre-converting to floats (mean does this, but does it in a more general way, so slower)

In [15]: %timeit df.astype('float64').mean()
1 loops, best of 3: 9.45 s per loop

You problem is the object column. Mean will try to calculate for all of the columns, but because of the object column everything is upcast to object dtype which is not efficient for calculating.

Best bet is to do

 df._get_numeric_data().mean()

There is an option to do this numeric_only, at the lower level, but for some reason we don't directly support this via the top-level functions (e.g. mean). I think will create an issue to add this parameter. However will prob be False by default (to not-exclude).

Why is numpy ma.average 24 times slower than arr.mean?

A good way to find out why something is slower is to profile it. I'll use the 3rd party library line_profiler and the IPython command %lprun (see for example this blog) here:

%load_ext line_profiler

import numpy as np
arr = np.full((3, 3), -9999, dtype=float)

%lprun -f np.ma.average np.ma.average(arr, axis=0)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   519                                           def average(a, axis=None, weights=None, returned=False):
   ...
   570         1         1810   1810.0     30.5      a = asarray(a)
   571         1           15     15.0      0.3      m = getmask(a)
   572                                           
   573                                               # inspired by 'average' in numpy/lib/function_base.py
   574                                           
   575         1            5      5.0      0.1      if weights is None:
   576         1         3500   3500.0     59.0          avg = a.mean(axis)
   577         1          591    591.0     10.0          scl = avg.dtype.type(a.count(axis))
   578                                               else: 
   ...
   608                                           
   609         1            7      7.0      0.1      if returned:
   610                                                   if scl.shape != avg.shape:
   611                                                       scl = np.broadcast_to(scl, avg.shape).copy()
   612                                                   return avg, scl
   613                                               else:
   614         1            5      5.0      0.1          return avg

I removed some irrelevant lines.

So actually 30% of the time is spent in np.ma.asarray (something that arr.mean doesn't have to do!).

However the relative times change drastically if you use a bigger array:

arr = np.full((1000, 1000), -9999, dtype=float)

%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   519                                           def average(a, axis=None, weights=None, returned=False):
   ...
   570         1          609    609.0      7.6      a = asarray(a)
   571         1           14     14.0      0.2      m = getmask(a)
   572                                           
   573                                               # inspired by 'average' in numpy/lib/function_base.py
   574                                           
   575         1            7      7.0      0.1      if weights is None:
   576         1         6924   6924.0     86.9          avg = a.mean(axis)
   577         1          404    404.0      5.1          scl = avg.dtype.type(a.count(axis))
   578                                               else:
   ...
   609         1            6      6.0      0.1      if returned:
   610                                                   if scl.shape != avg.shape:
   611                                                       scl = np.broadcast_to(scl, avg.shape).copy()
   612                                                   return avg, scl
   613                                               else:
   614         1            6      6.0      0.1          return avg

This time the np.ma.MaskedArray.mean function almost takes up 90% of the time.

Note: You could also dig deeper and look into np.ma.asarray or np.ma.MaskedArray.count or np.ma.MaskedArray.mean and check their line profilings. But I just wanted to show that there are lots of called function that add to the overhead.

So the next question is: did the relative times between np.ndarray.mean and np.ma.average also change? And at least on my computer the difference is much lower now:

%timeit np.ma.average(arr, axis=0)
# 2.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr.mean(axis=0)
# 1.84 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This time it's not even 2 times slower. I assume for even bigger arrays the difference will get even smaller.

This is also something that is actually quite common with NumPy:

The constant factors are quite high even for plain numpy functions (see for example my answer to the question "Performance in different vectorization method in numpy"). For np.ma these constant factors are even bigger, especially if you don't use a np.ma.MaskedArray as input. But even though the constant factors might be high, these functions excel with big arrays.

Custom mean implementation is slower than pandas default mean. How to optimize?

Found the solution by my self. The logic is to first normalize all the values by dividing it by length of Series (# of records) and then use default df.mean() and then multiply the normalized mean with # of records: This is an improvement from 1min 37 seconds to 3.13 seconds. But I still don't understand why pandas implementation is not using such optimization.

def mean_without_overflow_fast(col):
    col /= len(col)
    return col.mean() * len(col)

Use this function as follows:

print (df.apply(mean_without_overflow_fast))

Average over a period of time is very slow

What i'm trying to do is compute the average over the last X days.

This would suggest:

SELECT ip.item_id AS id, avg(x.value) AS result
FROM item_prices ip
WHERE ip.tdate <= current_date AND
      ip.tdate > current_date - X * interval '1 day'
GROUP BY ip.item_id;

I don't see what your actual query has to do with the question you are asking, though.

Why Is Mean() So Slow