Why Is Sum So Much Faster Than Inject(:+)

Why is sum so much faster than inject(:+)?

Short answer

For an integer range :

  • Enumerable#sum returns (range.max-range.min+1)*(range.max+range.min)/2
  • Enumerable#inject(:+) iterates over every element.

Theory

The sum of integers between 1 and n is called a triangular number, and is equal to n*(n+1)/2.

The sum of integers between n and m is the triangular number of m minus the triangular number of n-1, which is equal to m*(m+1)/2-n*(n-1)/2, and can be written (m-n+1)*(m+n)/2.

Enumerable#sum in Ruby 2.4

This property in used in Enumerable#sum for integer ranges :

if (RTEST(rb_range_values(obj, &beg, &end, &excl))) {
if (!memo.block_given && !memo.float_value &&
(FIXNUM_P(beg) || RB_TYPE_P(beg, T_BIGNUM)) &&
(FIXNUM_P(end) || RB_TYPE_P(end, T_BIGNUM))) {
return int_range_sum(beg, end, excl, memo.v);
}
}

int_range_sum looks like this :

VALUE a;
a = rb_int_plus(rb_int_minus(end, beg), LONG2FIX(1));
a = rb_int_mul(a, rb_int_plus(end, beg));
a = rb_int_idiv(a, LONG2FIX(2));
return rb_int_plus(init, a);

which is equivalent to:

(range.max-range.min+1)*(range.max+range.min)/2

the aforementioned equality!

Complexity

Thanks a lot to @k_g and @Hynek-Pichi-Vychodil for this part!

sum

(1...1000000000000000000000000000000).sum
requires three additions, a multiplication, a substraction and a division.

It's a constant number of operations, but multiplication is O((log n)²), so Enumerable#sum is O((log n)²) for an integer range.

inject

(1...1000000000000000000000000000000).inject(:+)

requires 999999999999999999999999999998 additions!

Addition is O(log n), so Enumerable#inject is O(n log n).

With 1E30 as input, inject with never return. The sun will explode long before!

Test

It's easy to check if Ruby Integers are being added :

module AdditionInspector
def +(b)
puts "Calculating #{self}+#{b}"
super
end
end

class Integer
prepend AdditionInspector
end

puts (1..5).sum
#=> 15

puts (1..5).inject(:+)
# Calculating 1+2
# Calculating 3+3
# Calculating 6+4
# Calculating 10+5
#=> 15

Indeed, from enum.c comments :

Enumerable#sum method may not respect method redefinition of "+"
methods such as Integer#+.

Does inject starting from 0 mean the same as sum

In terms of functionality doesn't both lines of code do the same thing.

Yes, these two snippets produce identical results.

Does inject starting from 0 mean the same as sum

No, not at all. In fact, 0 is irrelevant here. You can omit it and still get the same result.

scope_value.values.map { |i| 2** i.to_i }.inject(:|)

Operations in these two snippets are very different. They only produce the same result because of special shape of your data. Which is "each number has only one bit set and no two numbers have the same bit set". Violate this rule and see results diverge.


BTW, before we had .sum, we used to emulate it with .inject(:+). This does the same thing (when used on integer arrays)

Groovy: is sum() more efficient that calculating sum using inject?

Have a look at the following benchmark:

@Grab(group='org.gperfutils', module='gbench', version='0.4.3-groovy-2.4')

def b = benchmark {
'simple-sum' {
[1, 3, 5].sum()
}
'inject-sum' {
[1, 3, 5].inject(0, { x, y -> x + y })
}
}
b.prettyPrint()

And the output:

Environment
===========
* Groovy: 2.4.0
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.5-b02, Oracle Corporation)
* JRE: 1.8.0_05
* Total Memory: 283.5 MB
* Maximum Memory: 3641 MB
* OS: Mac OS X (10.10.1, x86_64)

Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On

user system cpu real

simple-sum 218 2 220 226
inject-sum 270 2 272 276

The output indicates that it's almost the same - sum is a bit faster in almost every attempt but the difference is not significant. Also have a look how sum is implemented. In this simple case it might be faster however in more advanced scenarios the results can be inversed.

Why is .sum() faster than .any() or .max()?

max has to store a value, continuously checking for potential updates (and the CPU needs to do branche operations to effect these). sum just churns through the values.

So sum will be quicker.

How to sum array of numbers in Ruby?

Try this:

array.inject(0){ |sum, x| sum + x }

See Ruby's Enumerable Documentation

(note: the 0 base case is needed so that 0 will be returned on an empty array instead of nil)

Why is numpy sum 10 times slower than the + operator?

The main difference is larger overhead when a.sum(axis=1) is calculated. Calculating a reduction (in this case sum) is not a trivial matter:

  • one has to take the round-off errors into account and thus uses pairwise summation to reduce it.
  • tiling is important for bigger arrays, as it makes the most out of the available cache
  • In order to be able to use the SIMD-instructions/out-of-order execution abilities of modern CPUs one should calculate multiple rows in parallel

I have discussed the topics above in more details for example here and here.

However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.

For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.

Let's take a look at profiler-result to see whether the theory above holds (I use perf):

For a.sum(axis=1):

  17,39%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] reduce_loop
11,41% python umath.cpython-36m-x86_64-linux-gnu.so [.] pairwise_sum_DOUBLE
9,78% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] npyiter_buffered_reduce_iternext_ite
9,24% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_add
4,35% python python3.6 [.] _PyEval_EvalFrameDefault
2,17% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] _aligned_strided_to_contig_size8_src
2,17% python python3.6 [.] lookdict_unicode_nodummy
...

The overhead of using reduce_loop, pairwise_sum_DOUBLE is dominating.

For a[:,0]+a[:,1]):

   7,24%  python   python3.6                                   [.] _PyEval_EvalF
5,26% python python3.6 [.] PyObject_Mall
3,95% python python3.6 [.] visit_decref
3,95% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_add
2,63% python python3.6 [.] PyDict_SetDef
2,63% python python3.6 [.] _PyTuple_Mayb
2,63% python python3.6 [.] collect
2,63% python python3.6 [.] fast_function
2,63% python python3.6 [.] visit_reachab
1,97% python python3.6 [.] _PyObject_Gen

As expected: Python overhead plays a big role, a simple DOUBLE_add is used.


There are less overhead when calling a.sum()

  • for once, reduce_loop isn't called for every row but only once, which means considerable less overhead.
  • no new resulting arrays are created, there is no longer need to write 1000 doubles to the memory.

so it can be expected, that a.sum() is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).


Data obtaining by running:

perf record python run.py
perf report

and

#run.py
import numpy as np
a=np.random.rand(1000,2)

for _ in range(10000):
a.sum(axis=1)
#a[:,0]+a[:,1]

How do I create an average from a Ruby array?

Try this:

arr = [5, 6, 7, 8]
arr.inject{ |sum, el| sum + el }.to_f / arr.size
=> 6.5

Note the .to_f, which you'll want for avoiding any problems from integer division. You can also do:

arr = [5, 6, 7, 8]
arr.inject(0.0) { |sum, el| sum + el } / arr.size
=> 6.5

You can define it as part of Array as another commenter has suggested, but you need to avoid integer division or your results will be wrong. Also, this isn't generally applicable to every possible element type (obviously, an average only makes sense for things that can be averaged). But if you want to go that route, use this:

class Array
def sum
inject(0.0) { |result, el| result + el }
end

def mean
sum / size
end
end

If you haven't seen inject before, it's not as magical as it might appear. It iterates over each element and then applies an accumulator value to it. The accumulator is then handed to the next element. In this case, our accumulator is simply an integer that reflects the sum of all the previous elements.

Edit: Commenter Dave Ray proposed a nice improvement.

Edit: Commenter Glenn Jackman's proposal, using arr.inject(:+).to_f, is nice too but perhaps a bit too clever if you don't know what's going on. The :+ is a symbol; when passed to inject, it applies the method named by the symbol (in this case, the addition operation) to each element against the accumulator value.

Why is numpy's einsum faster than numpy's built in functions?

Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.

To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.

This is the test run both before and after the 1.8 upgrade:

import numpy as np
import timeit

arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)

print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'


print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'


print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'


print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'

Numpy 1.7.1:

Summation test:
0.172988510132
0.0934836149216
----------------------

Power test:
1.93524689674
0.839519000053
----------------------

Outer test:
0.130380821228
0.121401786804
----------------------

Einsum test:
0.979052495956
0.126066613197

Numpy 1.8:

Summation test:
0.116551589966
0.0920487880707
----------------------

Power test:
1.23683619499
0.815982818604
----------------------

Outer test:
0.131808176041
0.127472200394
----------------------

Einsum test:
0.781750011444
0.129271841049

I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.

Is MATLAB's bsxfun the best? Python's numpy.einsum?

I believe both of your summations can be removed, but I only removed the easier one for the time being. The summation over the second dimension is trivial, since it only affects the A_k array:

B_k = sum(A_k,2);
for k = 2:L
i = 2:k;
x(:,1,k+1) = x(:,1,k+1) + sum(bsxfun(@times,B_k(:,1,2:k),x(:,1,k+1-i)),3);
end

With this single change the runtime is reduced from ~8 seconds to ~2.5 seconds on my laptop.

The second summation could also be removed, by transforming times+sum into a matrix-vector product. It needs some singleton fiddling to get the dimensions right, but if you define an auxiliary array that is B_k with the second dimension reversed, you can generate the remaining sum as ~x*C_k with this auxiliary array C_k, give or take a few calls to reshape.


So after a closer look I realized that my original assessment was overly optimistic: you have multiplications in both dimensions in your remaining term, so it's not a simple matrix product. Anyway, we can rewrite that term to be the diagonal of a matrix product. This implies that we're computing a bunch of unnecessary matrix elements, but this still seems to be slightly faster than the bsxfun approach, and we can get rid of your pesky singleton dimension too:

L = 10000;
x = rand(4,L+1);
A_k = rand(4,4,L);
B_k = squeeze(sum(A_k,2)).';

tic
for k = 2:L
ii = 1:k-1;
x(:,k+1) = x(:,k+1) + diag(x(:,ii)*B_k(k+1-ii,:));
end
toc

This runs in ~2.2 seconds on my laptop, somewhat faster than the ~2.5 seconds obtained previously.



Related Topics



Leave a reply



Submit