Fastest Way to Do This Double Summation

Fastest way to do this double summation?

Nothing fancy:

# sample data
m <- matrix(1:20, 4)
sigma <- 1:ncol(m)
omega <- 1:nrow(m)
mu <- 2

sum(((m - mu) / outer(omega, sigma))^2)

I need help doing this double summation in python

You were pretty close:

N = int(input("N: "))
M = int(input("M: "))

x = sum(sum(j ** 2 * (k + 1) for k in range(M)) for j in range(1, N + 1))

It also can be done with nested for loops:

x = 0
for j in range(1, N + 1):  # [1, N]
    for k in range(M):  # [0, M - 1]
        x += j ** 2 * (k + 1)

Solving double summation in R studio without loops

Using for loops:

sum = 0
for(i in 1:10){
  for(j in 1:5){
    sum = sum + i^5/(10+j^i)
  }
}

Result:

> sum
[1] 20845.76

Without loops:

i = rep(1:10, each=5)
j = rep(1:5, 10)

This way i and j looks like:

> i
 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10

> j
 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1  2  3  4  5

So every 5 j's turns i up by one, then resets j.

sum(i^5/(10+j^i))

(Same result)

Is `outer` fast enough for my double summation?

I would like to point out first that you can write you code as

sum(f(outer(x, x, "-") / c))

This reduces function call overhead, as subtraction in R is already a function. Try "-"(5, 2).

outer is fast enough for your application. The only case when it is suboptimal is when your function f is symmetric around 0, i.e., f(-u) = f(u). In this case, optimal computation only sums over the lower triangular of combination matrix outer(x, x, "-"), and multiply the sum by 2 for summation over off-diagonals. Finally, diagonal results are added.

The following function does this. We generate (i, j) indices for the lower triangular part (excluding diagonal) of the combination matrix, then the lower triangular part of outer(x, x, "-") / c would be dx <- (x[i] - x[j]) / c. Now,

if f is symmetric, the result is 2 * sum(f(dx)) + n * f(0), and this is faster than outer;
if f is asymmetric, we have to do sum(f(dx)) + sum(f(-dx)) + n * f(0), and this won't have any advantage over outer.

## `x` is the vector, `f` is your function of interest, `c` is a constant
## `symmetric` is a switch; only set `TRUE` when `f` is symmetric around 0
g <- function (x, f, c, symmetric = FALSE) {
  n <- length(x)
  j <- rep.int(1:(n-1), (n-1):1)
  i <- sequence((n-1):1) + j
  dx <- (x[i] - x[j]) / c
  if (symmetric) 2 * sum(f(dx)) + n * f(0)
  else sum(f(dx)) + sum(f(-dx)) + n * f(0)
  }

Consider a small example here. Let's assume c = 2 and a vector x <- 1:500. We also consider a symmetric function f1 <- cos and an asymmetric function f2 <- sin. Let's do a benchmark:

x <- 1:500
library(microbenchmark)

We first consider the symmetric case with f1. Remember to set symmetric = TRUE for g.

microbenchmark(sum(f1(outer(x,x,"-")/2)), g(x, f1, 2, TRUE))

#Unit: milliseconds
#                        expr      min       lq     mean   median       uq
# sum(f2(outer(x, x, "-")/2)) 32.79472 35.35316 46.91560 36.78152 37.63580
#           g(x, f2, 2, TRUE) 20.24940 23.34324 29.97313 24.45638 25.33352
#      max neval cld
# 133.5494   100   b
# 120.3278   100  a

Here we see that g is faster.

Now consider the asymmetric case with f2.

microbenchmark(sum(f2(outer(x,x,"-")/2)), g(x, f2, 2))

#Unit: milliseconds
#                        expr      min       lq     mean   median       uq
# sum(f2(outer(x, x, "-")/2)) 32.84412 35.55520 44.33684 36.95336 37.89508
#                 g(x, f2, 2) 36.71572 39.11832 50.54516 40.25590 41.75060
#      max neval cld
# 134.2991   100   a
# 142.5143   100   a

As expected, there is no advantage here.

Yes, we also want to check that g is doing the correct computation. It is sufficient to consider a small example, with x <- 1:5.

x <- 1:5

#### symmetric case ####

sum(f1(outer(x, x, "-") / 2))
# [1] 14.71313

g(x, f1, 2, TRUE)
# [1] 14.71313

#### asymmetric case ####

sum(f2(outer(x, x, "-") / 2))
# [1] 0

g(x, f2, 2)
# [1] 0

So g is correct.

most optimal way to perform double summation

For calculating a single element of G_xx(r,c) element, you can't optimize anything. That's logical: after all you don't know anything about the structure of x so you will have to read all x_j,i elements in the range.

However, things change if you need to calculate the entire matrix. In that case, you can reuse work from calculating the previous element.

First some basic mathematics. Since x-bar(r,c) doesn't depend on j or i, it's a constant in the process. Now we know that:

(a-b)^2 = a^2+b^2-2*a*b

With b if you substitute it back to x-bar a constant. So if you apply this to a sommation, you can state that:

---                ---
\                  \
/     (x_ji-b)^2 = /   x_ji^2-2*b*x_ji+b^2   = S-2*s*b-D*b^2
---                ---
i,j                i,j

With S the sum of squares of the x-elements and s the sum of the x elements.

Now if you look to a matrix, the first iteration, you use a certain domain of the matrix:

x  x  x  x  x  x  x  x  x
  /-------\
x |x  x  x| x  x  x  x  x
  |       |
x |x  x  x| x  x  x  x  x
  |       |
x |x  x  x| x  x  x  x  x
  \-------/
x  x  x  x  x  x  x  x  x

The next iteration, you only move the scope only one element further so:

x  x  x  x  x  x  x  x  x
     /-------\
x  x |x^ x^ x| x  x  x  x
     |       |
x  x |x^ x^ x| x  x  x  x
     |       |
x  x |x^ x^ x| x  x  x  x
     \-------/
x  x  x  x  x  x  x  x  x

The xs denoted with ^ are in other words reused. So what one can do is use some kind of sliding window.

The first time you thus calculate first the sum and the sum of squares of the elements (you store them). Next each time you move the "cursor" you subtract again the line column that is no longer in the scope, and add the column that appears in scope. A basic algorithm is thus:

for r in range (rmin,rmax) :
    sum = 0
    sumsq = 0
    jmin = r-(rows-1)/2
    jmax = r+(rows-1)/2

    #calculate sum and sum of square of the first
    imin = cmin-(cols-1)/2
    imax = cmin+(cols-1)/2
    for j,i in product(range(jmin,jmax),range(imin,imax)) :
        xji = x(j,i) #cache xji
        sum += xji
        sumsq += xji * xji
    d = (jmax-jmin)*(imax-imin)
    #now we can calculate the first element of the row
    xb = xbar(r,cmin)
    Gxx(r,cmin) = sumsq-2*sum*xb+d*xb*xb

    #now iterate over all elements (except the first)
    for c in range(cmin+1,cmax) :
        isub = c-1-(cols-1)/2 #column to remove, (previous column = -1)
        iadd = c+(cols-1)/2 #column to add
        for j in range(jmin,jmax) :
            xji = x(j,isub)
            sum -= xji
            sumsq -= xji*xji
            xji = x(j,iadd)
            sum += xji
            sumsq += xji*xji
        #Now the sums and the sum of squares are updated
        xb = xbar(r,c)
        Gxx(r,c) = sumsq-2*sum*xb+d*xb*xb

I think there will be some work to adapt the algorithm, but it should be doable. Furthermore please check first on a small instance whether it works correct. Small rounding errors are possible.

This will not give much difference if the cols and rows are small, but it case these are large, it can result in a huge boost.

Best way to speed up a double summation

This should be a bit faster; no way of telling how much without some sample data.

import time

def get_ints(s):
    return [int(i) for i in s.split()]

t = time.time()

Alen, Blen = get_ints(input())
A = get_ints(input())
B = get_ints(input())

total = sum(abs(Ai - Bj) * (i - j) for i,Ai in enumerate(A) for j,Bj in enumerate(B))

print(total)#, time.time()-t)

Optimizing a double sum in Julia?

This is not an optimized implementation, but it showcases use of FFT to calculate this sum. The complexity is reduced to O(n*log(n)) which for large enough n should be better than original O(n^2). Additionally this showcase requires n to be a power of 2:

# unoptimized FFT shamelessly copy-pasted from
# https://github.com/dillondaudert/JuliaFFT/blob/master/julia_fft.jl
# other Julia packages support optimized FFT calculations
function FFT(n::Integer, x::Array{<:Number})
    if n == 1
        return [x[1]]
    end
    evens = [x[2i] for i = 1:n÷2]
    odds = [x[2i-1] for i = 1:n÷2]
    # Since Julia is 1-indexed, we flip the odds and evens at the recursive step
    u = FFT(n÷2, odds)
    v = FFT(n÷2, evens)
    y = zeros(Complex, n)
    for j = 1:n
        τ = exp(2π*im*(j-1)/n) 
        y[j] = u[(j-1)%(n÷2)+1] + τ * v[(j-1)%(n÷2)+1]
    end
    y
end

# O(n*log(n)) implementation of OP calculation
function calc_fft(u)
    n = length(u)
    N = 2*n
    x = vcat(u,zeros(n))
    y = FFT(N,x).^2
    z = FFT(N,y)/N
    return sum([x[i]*real(z[1+((i+n)%N)]) for i=1:n])
end

# original OP implementation
function calc(u)
    Iterations = length(u)
    Gamma_sum1 = 0
    Gamma_sum2 = 0
    Gamma_sum = 0
    for k = 1:(Iterations)
        Gamma_sum1 = u[Iterations + 1 - k]
        Gamma_sum2 = 0
        for j = 1:k
            Gamma_sum2 = Gamma_sum2 + u[j] * u[k + 1 - j]
        end
        Gamma_sum = Gamma_sum + Gamma_sum1 * Gamma_sum2
    end
    return Gamma_sum
end

# some benchmarking prep

using BenchmarkTools
import Random
Random.seed!(12)
u = rand(1024);

Benchmark results:

julia> @btime calc(u)
  1.267 ms (1 allocation: 16 bytes)
70095.5403921176

julia> @btime calc_fft(u)
  7.755 ms (164893 allocations: 6.34 MiB)
70095.54039211746

So, FFT version is slower. But with good FFT implementation, it should be faster even for modest n.
Of course, for truly fast calculation, optimizations like in @AboAmmar answer will also be necessary.

UPDATE:
Testing with FFTW.jl package as suggested by OscarSmith gives:

julia> v = rand(1024);

julia> @btime calc_fft(v)
  70.352 μs (64 allocations: 196.06 KiB)
64986.609465082154

julia> @btime calc(v)
  1.267 ms (1 allocation: 16 bytes)
64986.60946508216

i.e. at n=1024, FFT version is already 10x faster.

And the calculation function is:

using FFTW
function calc_fft(u)
    n = length(u)
    N = 2*n
    x = vcat(u,zeros(n))
    y = fft(x).^2
    z = fft(y)/N
    return sum([x[i]*real(z[1+((i+n)%N)]) for i=1:n])
end

Fastest Way to Do This Double Summation