Itertools Product Speed Up

itertools product speed up

The NumPy equivalent of itertools.product() is numpy.indices(), but it will only get you the product of ranges of the form 0,...,k-1:

numpy.rollaxis(numpy.indices((2, 3, 3)), 0, 4)
array([[[[0, 0, 0],
         [0, 0, 1],
         [0, 0, 2]],

        [[0, 1, 0],
         [0, 1, 1],
         [0, 1, 2]],

        [[0, 2, 0],
         [0, 2, 1],
         [0, 2, 2]]],

       [[[1, 0, 0],
         [1, 0, 1],
         [1, 0, 2]],

        [[1, 1, 0],
         [1, 1, 1],
         [1, 1, 2]],

        [[1, 2, 0],
         [1, 2, 1],
         [1, 2, 2]]]])

For your special case, you can use

a = numpy.indices((4,)*13)
b = 1j ** numpy.rollaxis(a, 0, 14)

(This won't run on a 32 bit system, because the array is to large. Extrapolating from the size I can test, it should run in less than a minute though.)

EIDT: Just to mention it: the call to numpy.rollaxis() is more or less cosmetical, to get the same output as itertools.product(). If you don't care about the order of the indices, you can just omit it (but it is cheap anyway as long as you don't have any follow-up operations that would transform your array into a contiguous array.)

EDIT2: To get the exact analogue of

numpy.array(list(itertools.product(some_list, repeat=some_length)))

you can use

numpy.array(some_list)[numpy.rollaxis(
    numpy.indices((len(some_list),) * some_length), 0, some_length + 1)
    .reshape(-1, some_length)]

This got completely unreadable -- just tell me whether I should explain it any further :)

Python Itertools.product speed up

You probably won't be able to optimize that any further using just pure python code.

For any serious number crunching I would suggest using numpy.

Here is a cartersian product implementation in numpy that you can use:

Any way to speedup itertool.product

Comparing performance of the offered solutions:

import itertools
import timeit
import numpy as np

# original code from question
def f1():
    min_wt = 10
    max_wt = 50
    step = 10
    nb_assets = 5

    weight_mat = []
    for i in itertools.product(range(min_wt, (max_wt+1), step), repeat=nb_assets):
        if sum(i) == 100:
            weight = [i, ]
            if np.shape(weight_mat)[0] == 0:
                weight_mat = weight
            else:
                weight_mat = np.concatenate((weight_mat, weight), axis=0)

    return weight_mat

# code from question using list instead of numpy array
def f1b():
    min_wt = 10
    max_wt = 50
    step = 10
    nb_assets = 5

    weight_list = []
    for i in itertools.product(range(min_wt, (max_wt+1), step), repeat=nb_assets):
        if sum(i) == 100:
            weight_list.append(i)

    return weight_list

# calculating the last element of each tuple
def f2():
    min_wt = 10
    max_wt = 50
    step = 10
    nb_assets = 5

    weight_list = []
    for i in itertools.product(range(min_wt, (max_wt+1), step), repeat=nb_assets-1):
        the_sum = sum(i)
        if the_sum < 100:
            last_elem = 100 - the_sum
            if min_wt <= last_elem <= max_wt:
                weight_list.append(i + (last_elem, ))

    return weight_list

# recursive solution from user kaya3 (https://stackoverflow.com/a/58823843/9225671)
def constrained_partitions(n, k, min_w, max_w, w_step=1):
    if k < 0:
        raise ValueError('Number of parts must be at least 0')
    elif k == 0:
        if n == 0:
            yield ()
    else:
        for w in range(min_w, max_w+1, w_step):
            for p in constrained_partitions(n-w, k-1, min_w, max_w, w_step):
                yield (w,) + p

def f3():
    return list(constrained_partitions(100, 5, 10, 50, 10))

# recursive solution from user jdehesa (https://stackoverflow.com/a/58823990/9225671)
def make_weight_combs(min_wt, max_wt, step, nb_assets, req_wt):
    weights = range(min_wt, max_wt + 1, step)
    current = []
    yield from _make_weight_combs_rec(weights, nb_assets, req_wt, current)

def _make_weight_combs_rec(weights, nb_assets, req_wt, current):
    if nb_assets <= 0:
        yield tuple(current)
    else:
        # Discard weights that cannot possibly be used
        while weights and weights[0] + weights[-1] * (nb_assets - 1) < req_wt:
            weights = weights[1:]
        while weights and weights[-1] + weights[0] * (nb_assets - 1) > req_wt:
            weights = weights[:-1]
        # Add all possible weights
        for w in weights:
            current.append(w)
            yield from _make_weight_combs_rec(weights, nb_assets - 1, req_wt - w, current)
            current.pop()

def f4():
    return list(make_weight_combs(10, 50, 10, 5, 100))

I tested these functions using timeit like this:

print(timeit.timeit('f()', 'from __main__ import f1 as f', number=100))

The results using the parameters from the question:

# min_wt = 10
# max_wt = 50
# step = 10
# nb_assets = 5
0.07021828400320373       # f1 - original code from question
0.041302188008558005      # f1b - code from question using list instead of numpy array
0.009902548001264222      # f2 - calculating the last element of each tuple
0.10601829699589871       # f3 - recursive solution from user kaya3
0.03329997700348031       # f4 - recursive solution from user jdehesa

If I expand the search space (reduced step and increased assets):

# min_wt = 10
# max_wt = 50
# step = 5
# nb_assets = 6
7.6620834979985375        # f1 - original code from question
7.31425816299452          # f1b - code from question using list instead of numpy array
0.809070186005556         # f2 - calculating the last element of each tuple
14.88188026699936         # f3 - recursive solution from user kaya3
0.39385621099791024       # f4 - recursive solution from user jdehesa

Seems like f2 and f4 are the fastest (for the tested size of the data).

Python: faster alternative for itertools.product()?

As everyone commented, try using the generator directly instead of using a list. finding all combinations is unclear. If you need to print them, do this:

for i in itertools.product(range(1, 10), repeat=22):
    ... #Don't actually print, that may block your computer for a long time.

if you need to do something on those values, then tell us what you need.

Speeding up itertools combinations with cython

As already pointed out in the comments, you should not expect cython to speed-up your code because the most of the time the algorithm spends in itertools and creation of lists.

Because I'm curios to see how itertools's generic implementation fares against old-school-tricks, let's take a look at this Cython implementation of "all subsets k out of n":

%%cython

ctypedef unsigned long long ull

cdef ull next_subset(ull subset):
   cdef ull smallest, ripple, ones
   smallest = subset& -subset      
   ripple = subset + smallest    
   ones = subset ^ ripple    
   ones = (ones >> 2)//smallest 
   subset= ripple | ones    
   return subset

cdef subset2list(ull subset, int offset, int cnt):  
    cdef list lst=[0]*cnt #pre-allocate
    cdef int current=0;
    cdef int index=0
    while subset>0:
        if((subset&1)!=0):
            lst[index]=offset+current
            index+=1
        subset>>=1
        current+=1
    return lst

def all_k_subsets(int start, int end, int k):
    cdef int n=end-start      
    cdef ull MAX=1L<<n;
    cdef ull subset=(1L<<k)-1L;
    lst=[]
    while(MAX>subset):
         lst.append(subset2list(subset, start, k))
         subset=next_subset(subset)
    return lst

This implementation uses some well-known bit-tricks and has the limitation, that it only works for at most 64 elements.

If we compare both approaches:

>>> %timeit x(1,45,6)
2.52 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit all_k_subsets(1,45,6)
1.29 s ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The speed-up of factor 2 is quite disappointing.

However, the bottle-neck is the creation of the lists and not the calculation itself - it is easy to check, that without list creation the calculation would take about 0.1 seconds.

My take away from it: if you are serious about speed you should not create so many lists but proceed the subset on the fly (best in cython) - a speed-up of more than 10 is possible. If it is a must to have all subsets as lists, so you cannot expect a huge speed-up.

How to speed up multiple inner products in python

This new code gets another order of magnitude speedup by taking advantage of the cyclic symmetry of the problem. This Python version enumerates necklaces with Duval's algorithm; the C version uses brute force. Both incorporate the speedups described below. On my machine, the C version solves n = 20 in 100 seconds! A back-of-the-envelope calculation suggests that, if you were to let it run for a week on a single core, it could do n = 26, and, as noted below, it's amenable to parallelism.

import itertools

def necklaces_with_multiplicity(n):
    assert isinstance(n, int)
    assert n > 0
    w = [1] * n
    i = 1
    while True:
        if n % i == 0:
            s = sum(w)
            if s > 0:
                yield (tuple(w), i * 2)
            elif s == 0:
                yield (tuple(w), i)
        i = n - 1
        while w[i] == -1:
            if i == 0:
                return
            i -= 1
        w[i] = -1
        i += 1
        for j in range(n - i):
            w[i + j] = w[j]

def leading_zero_counts(n):
    assert isinstance(n, int)
    assert n > 0
    assert n % 2 == 0
    counts = [0] * n
    necklaces = list(necklaces_with_multiplicity(n))
    for combo in itertools.combinations(range(n - 1), n // 2):
        for v, multiplicity in necklaces:
            w = list(v)
            for j in combo:
                w[j] *= -1
            for i in range(n):
                counts[i] += multiplicity * 2
                product = 0
                for j in range(n):
                    product += v[j - (i + 1)] * w[j]
                if product != 0:
                    break
    return counts

if __name__ == '__main__':
    print(leading_zero_counts(12))

C version:

#include <stdio.h>

enum {
  N = 14
};

struct Necklace {
  unsigned int v;
  int multiplicity;
};

static struct Necklace g_necklace[1 << (N - 1)];
static int g_necklace_count;

static void initialize_necklace(void) {
  g_necklace_count = 0;
  for (unsigned int v = 0; v < (1U << (N - 1)); v++) {
    int multiplicity;
    unsigned int w = v;
    for (multiplicity = 2; multiplicity < 2 * N; multiplicity += 2) {
      w = ((w & 1) << (N - 1)) | (w >> 1);
      unsigned int x = w ^ ((1U << N) - 1);
      if (w < v || x < v) goto nope;
      if (w == v || x == v) break;
    }
    g_necklace[g_necklace_count].v = v;
    g_necklace[g_necklace_count].multiplicity = multiplicity;
    g_necklace_count++;
   nope:
    ;
  }
}

int main(void) {
  initialize_necklace();
  long long leading_zero_count[N + 1];
  for (int i = 0; i < N + 1; i++) leading_zero_count[i] = 0;
  for (unsigned int v_xor_w = 0; v_xor_w < (1U << (N - 1)); v_xor_w++) {
    if (__builtin_popcount(v_xor_w) != N / 2) continue;
    for (int k = 0; k < g_necklace_count; k++) {
      unsigned int v = g_necklace[k].v;
      unsigned int w = v ^ v_xor_w;
      for (int i = 0; i < N + 1; i++) {
        leading_zero_count[i] += g_necklace[k].multiplicity;
        w = ((w & 1) << (N - 1)) | (w >> 1);
        if (__builtin_popcount(v ^ w) != N / 2) break;
      }
    }
  }
  for (int i = 0; i < N + 1; i++) {
    printf(" %lld", 2 * leading_zero_count[i]);
  }
  putchar('\n');
  return 0;
}

You can get a bit of speedup by exploiting the sign symmetry (4x) and by iterating over only those vectors that pass the first inner product test (asymptotically, O(sqrt(n))x).

import itertools

n = 10
m = n + 1

def innerproduct(A, B):
    s = 0
    for k in range(n):
        s += A[k] * B[k]
    return s

leadingzerocounts = [0] * m
for S in itertools.product([-1, 1], repeat=n - 1):
    S1 = S + (1,)
    S1S1 = S1 * 2
    for C in itertools.combinations(range(n - 1), n // 2):
        F = list(S1)
        for i in C:
            F[i] *= -1
        leadingzerocounts[0] += 4
        for i in range(1, m):
            if innerproduct(F, S1S1[i:i + n]):
                break
            leadingzerocounts[i] += 4
print(leadingzerocounts)

C version, to get a feel for how much performance we're losing to PyPy (16 for PyPy is roughly equivalent to 18 for C):

#include <stdio.h>

enum {
  HALFN = 9,
  N = 2 * HALFN
};

int main(void) {
  long long lzc[N + 1];
  for (int i = 0; i < N + 1; i++) lzc[i] = 0;
  unsigned int xor = 1 << (N - 1);
  while (xor-- > 0) {
    if (__builtin_popcount(xor) != HALFN) continue;
    unsigned int s = 1 << (N - 1);
    while (s-- > 0) {
      lzc[0]++;
      unsigned int f = xor ^ s;
      for (int i = 1; i < N + 1; i++) {
        f = ((f & 1) << (N - 1)) | (f >> 1);
        if (__builtin_popcount(f ^ s) != HALFN) break;
        lzc[i]++;
      }
    }
  }
  for (int i = 0; i < N + 1; i++) printf(" %lld", 4 * lzc[i]);
  putchar('\n');
  return 0;
}

This algorithm is embarrassingly parallel because it's just accumulating over all values of xor. With the C version, a back-of-the-envelope calculation suggests that a few thousand hours of CPU time would suffice to calculate n = 26, which works out to a couple hundred dollars at current rates on EC2. There are undoubtedly some optimizations to be made (e.g., vectorization), but for a one-off like this I'm not sure how much more programmer effort is worthwhile.

itertools.product slower than nested for loops

Your original itertool code spent a lot extra time in the needless lambda, and building lists of intermediate values by hand - a lot of this can be replaced with builtin functionality.

Now, the inner for loop does add quite a lot extra overhead: just try the following and the performance is very much on par with your original code:

for a in itertools.product(carbons,hydrogens,nitrogens,oxygens17,
                           oxygens18,sulfurs33,sulfurs34,sulfurs36):
    i, j, k, l, m, n, o, p = a
    totals.append((i[0]+j[0]+k[0]+l[0]+m[0]+n[0]+o[0]+p[0],
                   i[1]*j[1]*k[1]*l[1]*m[1]*n[1]*o[1]*p[1]))

The following code runs as much as possible in the CPython builtin side, and I tested it to be equivalent to with code. Notably the code uses zip(*iterable) to unzip each of the product results; then uses the reduce with operator.mul for product, and sum for summing; 2 generators for going through the lists. The for loop still beats slightly, but being hardcoded it probably is not what you can use in the long run.

import itertools
from operator import mul
from functools import partial

prod = partial(reduce, mul)
elems = carbons, hydrogens, nitrogens, oxygens17, oxygens18, sulfurs33, sulfurs34, sulfurs36
p = itertools.product(*elems)

totals = [
    ( sum(massdiffs), prod(chances) )
    for massdiffs, chances in
    ( zip(*i) for i in p )
]

Itertools Product Speed Up