Why is sum so much faster than inject(:+)?
Short answer
For an integer range :
Enumerable#sum
returns(range.max-range.min+1)*(range.max+range.min)/2
Enumerable#inject(:+)
iterates over every element.
Theory
The sum of integers between 1 and n
is called a triangular number, and is equal to n*(n+1)/2
.
The sum of integers between n
and m
is the triangular number of m
minus the triangular number of n-1
, which is equal to m*(m+1)/2-n*(n-1)/2
, and can be written (m-n+1)*(m+n)/2
.
Enumerable#sum in Ruby 2.4
This property in used in Enumerable#sum
for integer ranges :
if (RTEST(rb_range_values(obj, &beg, &end, &excl))) {
if (!memo.block_given && !memo.float_value &&
(FIXNUM_P(beg) || RB_TYPE_P(beg, T_BIGNUM)) &&
(FIXNUM_P(end) || RB_TYPE_P(end, T_BIGNUM))) {
return int_range_sum(beg, end, excl, memo.v);
}
}
int_range_sum
looks like this :
VALUE a;
a = rb_int_plus(rb_int_minus(end, beg), LONG2FIX(1));
a = rb_int_mul(a, rb_int_plus(end, beg));
a = rb_int_idiv(a, LONG2FIX(2));
return rb_int_plus(init, a);
which is equivalent to:
(range.max-range.min+1)*(range.max+range.min)/2
the aforementioned equality!
Complexity
Thanks a lot to @k_g and @Hynek-Pichi-Vychodil for this part!
sum
(1...1000000000000000000000000000000).sum
requires three additions, a multiplication, a substraction and a division.
It's a constant number of operations, but multiplication is O((log n)²), so Enumerable#sum
is O((log n)²) for an integer range.
inject
(1...1000000000000000000000000000000).inject(:+)
requires 999999999999999999999999999998 additions!
Addition is O(log n), so Enumerable#inject
is O(n log n).
With 1E30
as input, inject
with never return. The sun will explode long before!
Test
It's easy to check if Ruby Integers are being added :
module AdditionInspector
def +(b)
puts "Calculating #{self}+#{b}"
super
end
end
class Integer
prepend AdditionInspector
end
puts (1..5).sum
#=> 15
puts (1..5).inject(:+)
# Calculating 1+2
# Calculating 3+3
# Calculating 6+4
# Calculating 10+5
#=> 15
Indeed, from enum.c
comments :
Enumerable#sum
method may not respect method redefinition of"+"
methods such asInteger#+
.
Does inject starting from 0 mean the same as sum
In terms of functionality doesn't both lines of code do the same thing.
Yes, these two snippets produce identical results.
Does inject starting from 0 mean the same as sum
No, not at all. In fact, 0
is irrelevant here. You can omit it and still get the same result.
scope_value.values.map { |i| 2** i.to_i }.inject(:|)
Operations in these two snippets are very different. They only produce the same result because of special shape of your data. Which is "each number has only one bit set and no two numbers have the same bit set". Violate this rule and see results diverge.
BTW, before we had .sum
, we used to emulate it with .inject(:+)
. This does the same thing (when used on integer arrays)
Groovy: is sum() more efficient that calculating sum using inject?
Have a look at the following benchmark:
@Grab(group='org.gperfutils', module='gbench', version='0.4.3-groovy-2.4')
def b = benchmark {
'simple-sum' {
[1, 3, 5].sum()
}
'inject-sum' {
[1, 3, 5].inject(0, { x, y -> x + y })
}
}
b.prettyPrint()
And the output:
Environment
===========
* Groovy: 2.4.0
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.5-b02, Oracle Corporation)
* JRE: 1.8.0_05
* Total Memory: 283.5 MB
* Maximum Memory: 3641 MB
* OS: Mac OS X (10.10.1, x86_64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
user system cpu real
simple-sum 218 2 220 226
inject-sum 270 2 272 276
The output indicates that it's almost the same - sum
is a bit faster in almost every attempt but the difference is not significant. Also have a look how sum
is implemented. In this simple case it might be faster however in more advanced scenarios the results can be inversed.
Why is .sum() faster than .any() or .max()?
max
has to store a value, continuously checking for potential updates (and the CPU needs to do branche operations to effect these). sum
just churns through the values.
So sum
will be quicker.
How to sum array of numbers in Ruby?
Try this:
array.inject(0){ |sum, x| sum + x }
See Ruby's Enumerable Documentation
(note: the 0
base case is needed so that 0
will be returned on an empty array instead of nil
)
Why is numpy sum 10 times slower than the + operator?
The main difference is larger overhead when a.sum(axis=1)
is calculated. Calculating a reduction (in this case sum
) is not a trivial matter:
- one has to take the round-off errors into account and thus uses pairwise summation to reduce it.
- tiling is important for bigger arrays, as it makes the most out of the available cache
- In order to be able to use the SIMD-instructions/out-of-order execution abilities of modern CPUs one should calculate multiple rows in parallel
I have discussed the topics above in more details for example here and here.
However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.
For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.
Let's take a look at profiler-result to see whether the theory above holds (I use perf
):
For a.sum(axis=1)
:
17,39% python umath.cpython-36m-x86_64-linux-gnu.so [.] reduce_loop
11,41% python umath.cpython-36m-x86_64-linux-gnu.so [.] pairwise_sum_DOUBLE
9,78% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] npyiter_buffered_reduce_iternext_ite
9,24% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_add
4,35% python python3.6 [.] _PyEval_EvalFrameDefault
2,17% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] _aligned_strided_to_contig_size8_src
2,17% python python3.6 [.] lookdict_unicode_nodummy
...
The overhead of using reduce_loop
, pairwise_sum_DOUBLE
is dominating.
For a[:,0]+a[:,1])
:
7,24% python python3.6 [.] _PyEval_EvalF
5,26% python python3.6 [.] PyObject_Mall
3,95% python python3.6 [.] visit_decref
3,95% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_add
2,63% python python3.6 [.] PyDict_SetDef
2,63% python python3.6 [.] _PyTuple_Mayb
2,63% python python3.6 [.] collect
2,63% python python3.6 [.] fast_function
2,63% python python3.6 [.] visit_reachab
1,97% python python3.6 [.] _PyObject_Gen
As expected: Python overhead plays a big role, a simple DOUBLE_add
is used.
There are less overhead when calling a.sum()
- for once,
reduce_loop
isn't called for every row but only once, which means considerable less overhead. - no new resulting arrays are created, there is no longer need to write 1000 doubles to the memory.
so it can be expected, that a.sum()
is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).
Data obtaining by running:
perf record python run.py
perf report
and
#run.py
import numpy as np
a=np.random.rand(1000,2)
for _ in range(10000):
a.sum(axis=1)
#a[:,0]+a[:,1]
How do I create an average from a Ruby array?
Try this:
arr = [5, 6, 7, 8]
arr.inject{ |sum, el| sum + el }.to_f / arr.size
=> 6.5
Note the .to_f
, which you'll want for avoiding any problems from integer division. You can also do:
arr = [5, 6, 7, 8]
arr.inject(0.0) { |sum, el| sum + el } / arr.size
=> 6.5
You can define it as part of Array
as another commenter has suggested, but you need to avoid integer division or your results will be wrong. Also, this isn't generally applicable to every possible element type (obviously, an average only makes sense for things that can be averaged). But if you want to go that route, use this:
class Array
def sum
inject(0.0) { |result, el| result + el }
end
def mean
sum / size
end
end
If you haven't seen inject
before, it's not as magical as it might appear. It iterates over each element and then applies an accumulator value to it. The accumulator is then handed to the next element. In this case, our accumulator is simply an integer that reflects the sum of all the previous elements.
Edit: Commenter Dave Ray proposed a nice improvement.
Edit: Commenter Glenn Jackman's proposal, using arr.inject(:+).to_f
, is nice too but perhaps a bit too clever if you don't know what's going on. The :+
is a symbol; when passed to inject, it applies the method named by the symbol (in this case, the addition operation) to each element against the accumulator value.
Why is numpy's einsum faster than numpy's built in functions?
Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.
To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc
using standard options on a AMD opteron core running Ubuntu.
This is the test run both before and after the 1.8 upgrade:
import numpy as np
import timeit
arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
Numpy 1.7.1:
Summation test:
0.172988510132
0.0934836149216
----------------------
Power test:
1.93524689674
0.839519000053
----------------------
Outer test:
0.130380821228
0.121401786804
----------------------
Einsum test:
0.979052495956
0.126066613197
Numpy 1.8:
Summation test:
0.116551589966
0.0920487880707
----------------------
Power test:
1.23683619499
0.815982818604
----------------------
Outer test:
0.131808176041
0.127472200394
----------------------
Einsum test:
0.781750011444
0.129271841049
I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.
Is MATLAB's bsxfun the best? Python's numpy.einsum?
I believe both of your summations can be removed, but I only removed the easier one for the time being. The summation over the second dimension is trivial, since it only affects the A_k
array:
B_k = sum(A_k,2);
for k = 2:L
i = 2:k;
x(:,1,k+1) = x(:,1,k+1) + sum(bsxfun(@times,B_k(:,1,2:k),x(:,1,k+1-i)),3);
end
With this single change the runtime is reduced from ~8 seconds to ~2.5 seconds on my laptop.
The second summation could also be removed, by transforming times+sum into a matrix-vector product. It needs some singleton fiddling to get the dimensions right, but if you define an auxiliary array that is B_k
with the second dimension reversed, you can generate the remaining sum as ~x*C_k
with this auxiliary array C_k
, give or take a few calls to reshape
.
So after a closer look I realized that my original assessment was overly optimistic: you have multiplications in both dimensions in your remaining term, so it's not a simple matrix product. Anyway, we can rewrite that term to be the diagonal of a matrix product. This implies that we're computing a bunch of unnecessary matrix elements, but this still seems to be slightly faster than the bsxfun
approach, and we can get rid of your pesky singleton dimension too:
L = 10000;
x = rand(4,L+1);
A_k = rand(4,4,L);
B_k = squeeze(sum(A_k,2)).';
tic
for k = 2:L
ii = 1:k-1;
x(:,k+1) = x(:,k+1) + diag(x(:,ii)*B_k(k+1-ii,:));
end
toc
This runs in ~2.2 seconds on my laptop, somewhat faster than the ~2.5 seconds obtained previously.
Related Topics
How to Run Untrusted Ruby Code Inside a Safe Sandbox
Why Doesn't Ruby Find Classes in a Higher Scope When Module Is Specified Using ::
How to Test a Function With Gets.Chomp in It
Returning All Maximum or Minimum Values That Can Be Multiple
Ruby on Rails Callback, What Is Difference Between :Before_Save and :Before_Create
Ruby: How to Write Multi-Line String With No Concatenation
Serve Current Directory from Command Line
Getting "Warning! Path Is Not Properly Set Up" When Doing Rvm Use 2.0.0 --Default
How to Create a Private Class Method
How to Get the Name of a Ruby Class
How to Use CSS with a Ruby on Rails Application
Automatic Counter in Ruby For Each
Rails Browser Detection Methods
Really Cheap Command-Line Option Parsing in Ruby