Iteratively Compute the Cartesian Product of an Arbitrary Number of Sets

Iteratively compute the Cartesian product of an arbitrary number of sets

I've written a solution that doesn't require you to fill up a large collection in memory. Unfortunately, the code required is hundreds of lines long. You may have to wait until it appears in the Guava project (https://github.com/google/guava), which I hope will be by the end of the year. Sorry. :(

Note that you may not need such a utility if the number of sets you're cartesian-producting is a fixed number known at compile time -- you could just use that number of nested for loops.

EDIT: the code is released now.

Sets.cartesianProduct()

I think you'll be very happy with it. It only creates the individual lists as you ask for them; doesn't fill up memory with all MxNxPxQ of them.

If you want to inspect the source, it's here.

Enjoy!

Cartesian product of an arbitrary number of sets

Edit: Previous solutions for two sets removed. See edit history for details.

Here is a way to do it recursively for an arbitrary number of sets:

public static Set<Set<Object>> cartesianProduct(Set<?>... sets) {
    if (sets.length < 2)
        throw new IllegalArgumentException(
                "Can't have a product of fewer than two sets (got " +
                sets.length + ")");

    return _cartesianProduct(0, sets);
}

private static Set<Set<Object>> _cartesianProduct(int index, Set<?>... sets) {
    Set<Set<Object>> ret = new HashSet<Set<Object>>();
    if (index == sets.length) {
        ret.add(new HashSet<Object>());
    } else {
        for (Object obj : sets[index]) {
            for (Set<Object> set : _cartesianProduct(index+1, sets)) {
                set.add(obj);
                ret.add(set);
            }
        }
    }
    return ret;
}

Note that it is impossible to keep any generic type information with the returned sets. If you knew in advance how many sets you wanted to take the product of, you could define a generic tuple to hold that many elements (for instance Triple<A, B, C>), but there is no way to have an arbitrary number of generic parameters in Java.

Cartesian product of a variable number of sets each of variable size

Create an array with the maximum value per digit; in your example, that would be [2,3,4].

Set the maximum of digit j to 1, so you get [2,1,4].

Create a first combination [1,1,1] and add it to the result. Then repeatedly increment the combination and add it to the result.

To increment the combination, increment the last digit; if that digit equals the maximum, set it to 1 and increment the previous digit, and so on. If the first digit needs to be incremented and it's already set to the maximum value, you've found all combinations.


[1,1,1] -> [1,1,2] -> [1,1,3] -> [1,1,4] ->
4 is maximum for last digit, so set it to 1 and increment previous digit ->
1 is maximum for second digit, so set it to 1 and increment previous digit ->
[2,1,1] -> [2,1,2] -> [2,1,3] -> [2,1,4] ->
4 is maximum for last digit, so set it to 1 and increment previous digit ->
1 is maximum for second digit, so set it to 1 and increment previous digit ->
2 is maximum for first digit, so all combinations have been found.

How can I compute a Cartesian product iteratively?

1) Create a list of indexes into the respective lists, initialized to 0, i.e:

indexes = [0,0,0,0,0,0]

2) Yield the appropriate element from each list (in this case the first).

3) Increase the last index by one.

4) If the last index equals the length of the last list, reset it to zero and carry one. Repeat this until there is no carry.

5) Go back to step 2 until the indexes wrap back to [0,0,0,0,0,0]

It's similar to how counting works, except the base for each digit can be different.

Here's an implementation of the above algorithm in Python:

def cartesian_product(aListOfList):
    indexes = [0] * len(aListOfList)
    while True:
        yield [l[i] for l,i in zip(aListOfList, indexes)]
        j = len(indexes) - 1
        while True:
            indexes[j] += 1
            if indexes[j] < len(aListOfList[j]): break
            indexes[j] = 0
            j -= 1
            if j < 0: return

Here is another way to implement it using modulo tricks:

def cartesian_product(aListOfList):
    i = 0
    while True:
        result = []
        j = i
        for l in aListOfList:
             result.append(l[j % len(l)])
             j /= len(l)
        if j > 0: return
        yield result
        i += 1

Note that this outputs the results in a slightly different order than in your example. This can be fixed by iterating over the lists in reverse order.

Compute Cartesian Product Fast like DBMS

Basically, no - if you want a cartesian product of two sets of sizes n,m - the size of the cartesian product is n*m - and you need O(nm) algorithm to generate it.

However, for data base languages, when you do things like 'join', you sometimes do not need to generate the entire join to get the answer - if all you are going to do later on is just count number of items, or sum.

This can be shown in Pig Latin's COGROUP operator, which only creates an implicit join, in order to later just use the sizes or sums - and not the real cartesian product. This potentially could be much more efficient if used properly in the right situation.

Efficiently count sets in a cartesian product that sum above a specific number

Here is a solution. The idea is to calculate all possible sums of our values you can obtain with n loops, counting the different possible sums, and counting together all sums that are larger than the threshold.

Then, we can generate all possible sums for n+1 loops by adding our values to the previous sums. We can hope that the number of different possible sums won't grow too large, as we add many times the same values, and we regroup all sums larger than the threshold.

from collections import Counter

def all_sums(values, threshold, previous_sums = None):
    """
    values must be sorted
    previous_sums is a Counter of previously obtained possible sums

    Returns a Counter of all possible sums of values and the previous sums
    """
    if not previous_sums:
        previous_sums = Counter({0:1})

    new = Counter()
    for existing_sum, ex_sum_count in sorted(previous_sums.items()):
        for index, val in enumerate(values):
            total = existing_sum + val
            if total < threshold:
                # With the current value, we have found ex_sum_count
                # ways to obtain that total
                new.update({total: ex_sum_count})
            else:
                # We don't need the exact sum, as anything we could
                # later add to it will be over the threshold.
                # We count them under the value = threshold
                # As 'values' is sorted, all subsequent values will also give 
                # a sum over the threshold
                values_left = len(values) - index
                new.update({threshold: values_left * ex_sum_count})
                break
    return new

def count_sums(values, threshold, repeat):
    """
    values must be sorted!

    Recursively calculates the possible sums of 'repeat' values,
    counting together all values over 'threshold'
    """
    if repeat == 1:
        return all_sums(values, threshold, previous_sums=None)
    else:
        return all_sums(values, threshold, previous_sums=count_sums(values, threshold, repeat=repeat-1))

Let's try it on your example:

loops = 10
results = [4, 2.75, 2.75, 1.5, 1.5, 1.5, 0]
threshold = loops * 2

values = sorted(results)

sums = count_sums(values, threshold, repeat=loops)
print(sums)
# Counter({20: 137401794, 19.75: 16737840, 18.25: 14016240, 18.5: 13034520, 19.5: 12904920,
# 17.0: 12349260, 15.75: 8573040, 17.25: 8048160, 15.5: 6509160, 16.75: 6395760, 14.25: 5171040,
# 18.0: 5037480, 14.5: 4461480, 16: 3739980, 18.75: 3283020, 19.25: 3220800, 13.0: 3061800, 
# 14.0: 2069550, 12.75: 1927800, 15.25: 1708560, 13.25: 1574640, 17.5: 1391670, 11.5: 1326780,
# 11.75: 1224720, 14.75: 1182660, 16.5: 1109640, 10.25: 612360, 17.75: 569520, 11.25: 453600, 
# 16.25: 444060, 12.5: 400680, 10.0: 374220, 12: 295365, 13.75: 265104, 10.5: 262440, 19.0: 229950,
# 13.5: 204390, 8.75: 204120, 15.0: 192609, 9.0: 153090, 8.5: 68040, 9.75: 65520, 7.5: 61236, 
# 7.25: 45360, 11.0: 44940, 12.25: 21840, 6.0: 17010, 7.0: 7560, 5.75: 6480, 8.25: 5280, 4.5: 3240,
# 9.5: 2520, 10.75: 720, 4.25: 540, 5.5: 450, 3.0: 405, 6.75: 180, 8: 45, 1.5: 30, 2.75: 20, 4: 10, 0: 1})
number_of_sums = len(results) ** loops
# 282475249
good = sums[threshold]
# 137401794
bad = number_of_sums - good
# 145073455

I timed it, it takes about 9 ms on my rather old machine.

And with some other data: 10 different values, 20 loops:

loops = 20
results = [4, 2.75, 2.45, 1.5, 1.3, 0.73, 0.12, 1.4, 1.5, 0]
threshold = loops * 2
values = sorted(results)

sums = count_sums(values, threshold, repeat=loops)
number_of_sums = len(results) ** loops
good = sums[threshold]
bad = number_of_sums - good
print(good)
print(bad)
# 5440943363190360728
# 94559056636809639272

which I obtain in less than 12 seconds.

How to calculate the Cartesian Power of a list in R

Thanks @G.Grothendieck for solving the question!

s <- c(1, 2, 3)
n <- 2
do.call("expand.grid", rep(list(s), n))

This is the R code that gives the correct results.

Lazy Cartesian product of arrays (arbitrary nested loops)

Coincidentally working on the same thing over the weekend. I was looking to find alternative implementations to my [].every-based algo which turned out to have abyssmal performance in Firefox (but screams in Chrome -- more than twice as fast as the next).

The end result is http://jsperf.com/lazy-cartesian-product/19 . It's similar to Tomalak's approach but there is only one arguments array which is mutated as the carets move instead of being generated each time.

I'm sure it could be improved further by using the clever maths in the other algos. I don't quite understand them though, so I leave it to others to try.

EDIT: the actual code, same interface as Tomalak's. I like this interface because it could be breaked anytime. It's only slightly slower than if the loop is inlined in the function itself.

var xp = crossProduct([
  [2,3,4,5],['angry','happy'], 
  ['monkeys','anteaters','manatees']]);
while (xp.next()) xp.do(console.log, console);

function crossProduct(sets) {
  var n = sets.length, carets = [], args = [];

  function init() {
    for (var i = 0; i < n; i++) {
      carets[i] = 0;
      args[i] = sets[i][0];
    }
  }

  function next() {
    if (!args.length) {
      init();
      return true;
    }
    var i = n - 1;
    carets[i]++;
    if (carets[i] < sets[i].length) {
      args[i] = sets[i][carets[i]];
      return true;
    }
    while (carets[i] >= sets[i].length) {
      if (i == 0) {
        return false;
      }
      carets[i] = 0;
      args[i] = sets[i][0];
      carets[--i]++;
    }
    args[i] = sets[i][carets[i]];
    return true;
  }

  return {
    next: next,
    do: function (block, _context) {
      return block.apply(_context, args);
    }
  }
}

Iteratively Compute the Cartesian Product of an Arbitrary Number of Sets