How to Speed Up or Vectorize a for Loop

How to speed up or vectorize a for loop?

You can use Rcpp when vectorization is difficult.

library(Rcpp)
cppFunction('
  IntegerVector bin(NumericVector Volume, int n) {
    IntegerVector binIdexVector(Volume.size());
    int binIdex = 1;
    double totalVolume =0;

    for(int i=0; i<Volume.size(); i++){
      totalVolume = totalVolume + Volume[i];
      if (totalVolume <= n) {
        binIdexVector[i] = binIdex;
      } else {
        binIdex++;
        binIdexVector[i] = binIdex;
        totalVolume = Volume[i];
      }
    }
    return binIdexVector;
  }')

all.equal(bin(Volume, 100), binIdexVector)
#[1] TRUE

It's faster than findInterval(cumsum(Volume), seq(0, sum(Volume), by=100)) (which of course gives an inexact answer)

Why is vectorization, faster in general, than loops?

Vectorization (as the term is normally used) refers to SIMD (single instruction, multiple data) operation.

That means, in essence, that one instruction carries out the same operation on a number of operands in parallel. For example, to multiply a vector of size N by a scalar, let's call M the number of operands that size that it can operate on simultaneously. If so, then the number of instructions it needs to execute is approximately N/M, where (with purely scalar operations) it would have to carry out N operations.

For example, Intel's current AVX 2 instruction set uses 256-bit registers. These can be used to hold (and operate on) a set of 4 operands of 64-bits apiece, or 8 operands of 32 bits apiece.

So, assuming you're dealing with 32-bit, single-precision real numbers, that means a single instruction can do 8 operations (multiplications, in your case) at once, so (at least in theory) you can finish N multiplications using only N/8 multiplication instructions. At least, in theory, this should allow the operation to finish about 8 times as fast as executing one instruction at a time would allow.

Of course, the exact benefit depends on how many operands you support per instruction. Intel's first attempts only supported 64-bit registers, so to operate on 8 items at once, those items could only be 8 bits apiece. They currently support 256-bit registers, and they've announced support for 512-bit (and they may have even shipped that in a few high-end processors, but not in normal consumer processors, at least yet). Making good use of this capability can also be non-trivial, to put it mildly. Scheduling instructions so you actually have N operands available and in the right places at the right times isn't necessarily an easy task (at all).

To put things in perspective, the (now ancient) Cray 1 gained a lot of its speed exactly this way. Its vector unit operated on sets of 64 registers of 64 bits apiece, so it could do 64 double-precision operations per clock cycle. On optimally vectorized code, it was much closer to the speed of a current CPU than you might expect based solely on its (much lower) clock speed. Taking full advantage of that wasn't always easy though (and still isn't).

Keep in mind, however, that vectorization is not the only way in which a CPU can carry out operations in parallel. There's also the possibility of instruction-level parallelism, which allows a single CPU (or the single core of a CPU) to execute more than one instruction at a time. Most modern CPUs include hardware to (theoretically) execute up to around 4 instructions per clock cycle¹ if the instructions are a mix of loads, stores, and ALU. They can fairly routinely execute close to 2 instructions per clock on average, or more in well-tuned loops when memory isn't a bottleneck.

Then, of course, there's multi-threading--running multiple streams of instructions on (at least logically) separate processors/cores.

So, a modern CPU might have, say, 4 cores, each of which can execute 2 vector multiplies per clock, and each of those instructions can operate on 8 operands. So, at least in theory, it can be carrying out 4 * 2 * 8 = 64 operations per clock.

Some instructions have better or worse throughput. For example, FP adds throughput is lower than FMA or multiply on Intel before Skylake (1 vector per clock instead of 2). But boolean logic like AND or XOR has 3 vectors per clock throughput; it doesn't take many transistors to build an AND/XOR/OR execution unit, so CPUs replicate them. Bottlenecks on the total pipeline width (the front-end that decodes and issues into the out-of-order part of the core) are common when using high-throughput instructions, rather than bottlenecks on a specific execution unit.

But, over time CPUs tend to have more resources available, so this number rises.

Do we need vectorization in C++ or are for loops already fast enough?

Do we need vectorization in C++

Vectorisation is not necessarily needed always, but it can make some programs faster.

C++ compilers support auto-vectorisation, although if you need to have vectorisation, then you might not be able to rely on such optimisation because not every loop can be vectorised automatically.

are [loops] already fast enough?

Depends on the loop, the target CPU, the compiler and its options, and crucially: How fast does it need to be.

Some things that you could do to potentially achieve vectorisation in standard C++:

Enable compiler optimisations that perform auto vectorisation. (See the manual of your compiler)
Specify a target CPU that has vector operations in their instruction set. (See the manual of your compiler)
Use standard algorithms with std::parallel_unsequenced_policy or std::unsequenced_policy.
Ensure that the data being operated on is sufficiently aligned for SIMD instructions. You can use alignas. See the manual of the target CPU for what alignment you need.
Ensure that the optimiser knows as much as possible by using link time optimisation.
Partially unroll your loops. Limitation of this is that you hard code the amount of parallelisation:

for (int i = 0; i < count; i += 4) {
    operation(i + 0);
    operation(i + 1);
    operation(i + 2);
    operation(i + 3);
}

Outside of standard, portable C++, there are implementation specific ways:

Some compilers provide language extension to write explicitly vectorised programs. This is portable across different CPUs but not portable to compilers that don't implement the extension.

using v4si = int __attribute__ ((vector_size (16)));
v4si a, b, c;
a = b + 1;    /* a = b + {1,1,1,1}; */
a = 2 * b;    /* a = {2,2,2,2} * b; */

Some compilers provide "builtin" functions to invoke specific CPU instructions which can be used to invoke SIMD vector instructions. Using these is not portable across incompatible CPUs.
Some compilers support OpenMP API which has #pragma omp simd.

Speed up python code - can I vectorize double for loop?

import numpy as np

def dist(p1, p2): 
    # Initially, p1.shape() == (n, 2) and p2.shape() == (m, 2)
    # Now, p1.shape() == (1, n, 2) and p2.shape() == (m, 1, 2)
    p1 = p1[np.newaxis, :, :]
    p2 = p2[:, np.newaxis, :]

    # get all the vectory things
    from numpy import sin, cos, radians, sqrt, arctan2 as atan2 

    # do the same math as before, but use `p[..., 0]` instead of `p.x` etc
    dlat = radians(p2[..., 0] - p1[..., 0])
    dlon = radians(p2[..., 1] - p1[..., 1])
    a = sin(dlat/2) * sin(dlat/2) + cos(p1[..., 0])*cos(p2[..., 0]) * sin(dlon/2) * sin(dlon/2)
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    d = 6371 * c
    return d 

def distanceQuery(neighbor_pts):
    return np.max(dist(neighbor_pts, neighbor_pts))

e.g:

>>> points = np.array([[0, 0], [45, 0], [45, 45], [90, 0]], dtype=float) 
>>> dist(points, points)
array([[     0.        ,   5003.77169901,   6272.52596983,  10007.54339801],
       [  5003.77169901,      0.        ,   2579.12525679,   5003.77169901],
       [  6272.52596983,   2579.12525679,      0.        ,   4347.69702221],
       [ 10007.54339801,   5003.77169901,   4347.69702221,      0.        ]])
>>> np.max(_)
10007.543398010286

Timing:

def dist_slow(p1, p2):
    """your function, adjusted to take an array instead of a `Point`"""
    from math import radians, cos, sqrt, atan2
    # compute the distance for all possible pairs
    dlat = radians(p2[0]-p1[0])
    dlon = radians(p2[1]-p1[1])

    a = sin(dlat/2) * sin(dlat/2) + cos(radians(p1[0]))*cos(radians(p2[0])) * sin(dlon/2) * sin(dlon/2)
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    d = 6371 * c
    return d

def query_iter(p):
    return max(dist_slow(p1, p2) for p1, p2 in itertools.combinations(p, 2))

def query_orig(p):
    dista=[]
    for i in range(len(p)):
      for j in range(i + 1, len(p)):
        z = dist_slow(p[i], p[j])
        dista.append(z)
    return max(dista)

def query_mine(p):
    return dist(p, p).max()

Then:

>>> points = np.random.rand(1000, 2)
>>> timeit query_orig(points)
1 loops, best of 3: 7.94 s per loop
>>> timeit query_iter(points)
1 loops, best of 3: 7.35 s per loop
>>> timeit query_mine(points)
10 loops, best of 3: 150 ms per loop

Use numpy vectorize or map to speed up a loop - Python NumPy 3D matrix get rid of a loop Python question, Monte Carlo

One simple solution to speed up this code is to parallelize it using Numba. You only need to use the decorator @nb.njit('float64[:,:,::1](int64, int64, int64)', parallel=True) for the function sample_path_batches (where nb is the Numba module). Note that dtype=float must be replaced with dtype=np.float64 in the function so that Numba can compile the code correctly. Note that parallel=True should automatically parallelize the np.random.randn call as well as the basic following operation in the loop. On a 10-core machine this is 7 times faster (it takes 0.253 second with Numpy and 0.036 with a parallel implementation of Numba). If you do not see any improvement, you could also try to parallelize it manually using prange.

Additionally, you can use np.float32 types for significantly faster performance (up to 2 times faster theoretically). However, Numpy do not currently support such types for np.random.randn. Instead, np.random.default_rng().random(size=underlyings*sims, dtype=np.float32).reshape(underlyings, sims) should be used. Unfortunately, it is probably not yet supported by Numba since Numpy add this quite recently...

If you have an Nvidia GPU, another solution is to use CUDA to execute the function on the GPU. This should be much faster. Note that Numba have specific optimized functions to generate random np.float32 values on the GPU using CUDA (see here).

How to speed up for loop?

You can remove all of the for loops in your code using vectorization in Numpy. I recommend reading Look Ma No For Loops as you start venturing down this path.

I took your example and generated some junk data to stand in for your arrays and variables using random.

import numpy as np
import random
import time

kvec_used = 0
k_array = []
sqmax = 10
beta = .5
kfac = []
sites = 3
q_array_initial = r_cart_initial = [.1, .2, .3]

for i in range(12000):
    temp = [random.random(), random.random(), random.random()]
    k_array.append(temp)
    kfac.append(random.random())

mytime = time.time()
for i in range(12000): 
    cossum = 0
    sinsum = 0

    sq = k_array[i][0]**2 + k_array[i][1]**2 + k_array[i][2]**2

    if sq <= sqmax and sq>0:

        kvec_used = kvec_used + 1
        kfac[i] = np.exp(-sq/(4*beta**2))/(sq)
    
        for j in range(sites):
            cossum = cossum + q_array_initial[j]*np.cos(np.dot(k_array[i],r_cart_initial[j]))
            sinsum = sinsum + q_array_initial[j]*np.sin(np.dot(k_array[i],r_cart_initial[j]))

            cos_part = (abs(cossum))**2
            sin_part = (abs(sinsum))**2

print("looped", time.time() - mytime)

mytime = time.time()

k_array = np.asarray(k_array)
sq = np.square(k_array[:, 0]) + np.square(k_array[:, 1]) + np.square(k_array[:, 2])
sq = sq[np.where(np.logical_and(sq[:] <= sqmax, sq[:] > 0))]
kvec = np.shape(sq)[0]
kfac = np.exp(-sq[:]/(4*beta**2))/(sq[:])

for j in range(sites):
    cossum = cossum + q_array_initial[j]*np.cos(np.dot(k_array[i],r_cart_initial[j]))
    sinsum = sinsum + q_array_initial[j]*np.sin(np.dot(k_array[i],r_cart_initial[j]))

    cos_part = (abs(cossum))**2
    sin_part = (abs(sinsum))**2

print("vectorized:", time.time() - mytime)

By only vectorizing the first for loop, I get a speed up of ~100 times when I measure using time.time(). On my machine, the output of this code is:

looped 0.40552735328674316
vectorized: 0.004940986633300781

You should be able to speed up even more when you vectorize the second loop.

Edit: I just realized you asked for more explanation in your original question. The article I linked above explains things in detail and is what I used when I first started vectorizing large data sets. However, I can offer a super-simple explanation to get you going here:

When you have a large array, it is actually a bit slow in Python to run a for loop. Numpy uses some more advanced code behind-the-scenes which can run operations more quickly than you can sequentially in a for loop. For the purposes of a novice learning this, you can accept that Numpy is "magic" for the time being. The big take away here is that rather than going through and operating on the ith element of your array, you want to tell Numpy to operate on the whole thing, represented as myarray[:]. You can extend this out to multidimensional arrays. A full 2D array is myarray[:, :]. If (thinking in spreadsheet terms) you just want to operate on the 0th column and all rows on a 2d array, this becomes myarray[:, 0].

Here, I have used fancy Numpy-specific methods where and logical_and to replace your if conditional. This is another vectorization technique which is able to handle conditional statements more quickly than an iterative approach.

I purposely left the second for loop untranslated as an exercise for you. No one learns without suffering! You have done good to use np.cos and np.sin here, but you use them in an iterative way. Use the vector approach to run these math operators on all elements of the result vector.

How to Speed Up or Vectorize a for Loop