Python Random Sample with a Generator/Iterable/Iterator

Python random sample with a generator / iterable / iterator

While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

def iter_sample_fast(iterable, samplesize):
    results = []
    iterator = iter(iterable)
    # Fill in the first samplesize elements:
    try:
        for _ in xrange(samplesize):
            results.append(iterator.next())
    except StopIteration:
        raise ValueError("Sample larger than population.")
    random.shuffle(results)  # Randomize their positions
    for i, v in enumerate(iterator, samplesize):
        r = random.randint(0, i)
        if r < samplesize:
            results[r] = v  # at a decreasing rate, replace random items
    return results

The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):

iterSample: 5.05s
iter_sample_fast: 2.64s

Python itertools create iterator of random subset

Just produce random combinations, tracking what you've seen before:

def random_combinations(matrix, size):
    seen = set()
    n = len(matrix)
    while True:
        new_sample = tuple(sorted(random.sample(xrange(n), size)))
        if new_sample not in seen:
            seen.add(new_sample)
            yield tuple(matrix[i] for i in new_sample)

Iterating through all possible combinations to sample is not efficient, you still end up testing all 10^14 combinations.

The above generator picks a random combination each time you iterate; if you need a certain number, use a loop or itertools.islice(); picking 10 random combinations would be:

combinations_sample = list(islice(random_combinations(matrix, 50), 10))

You may have misunderstood what the function you found does; it does much the same as my function above but produces just the one random combination, without tracking what was produced before. You were supposed to use it on matrix, not on all combinations of matrix.

Random sampling a large Cartesian product of iterables

Which version of Python are you using? Somewhere along the way .next() methods were deprecated in favor a new next() built-in function. That works fine with all iterators. Here, for example, under the current released 3.10.1:

>>> import itertools
>>> itp = itertools.product(range(5), repeat=6)
>>> next(itp)
(0, 0, 0, 0, 0, 0)
>>> next(itp)
(0, 0, 0, 0, 0, 1)
>>> next(itp)
(0, 0, 0, 0, 0, 2)
>>> next(itp)
(0, 0, 0, 0, 0, 3)
>>> for ignore in range(50):
...     ignore = next(itp)
>>> next(itp)
(0, 0, 0, 2, 0, 4)

Beyond that, you didn't show us the most important part of your code: how you build your product.

Without seeing that, I can only guess that it would be far more efficient to make a random choice from the first sequence passed to product(), then another from the second, and so on. Build a random element of the product from one component at a time.

Picking a random product tuple efficiently

Perhaps overkill, but this class shows an especially efficient way to do this. The .index() method maps an integer i to the i'th tuple (0-based) in the product. Then picking a random tuple from the product is simply applying .index() to a random integer in range(total number of elements in the product).

from math import prod
from random import randrange

class RanProduct:
    def __init__(self, iterables):
        self.its = list(map(list, iterables))
        self.n = prod(map(len, self.its))

    def index(self, i):
        if i not in range(self.n):
            raise ValueError(f"index {i} not in range({self.n})")
        result = []
        for it in reversed(self.its):
            i, r = divmod(i, len(it))
            result.append(it[r])
        return tuple(reversed(result))

    def pickran(self):
        return self.index(randrange(self.n))

and then

>>> r = RanProduct(["abc", range(2)])
>>> for i in range(6):
...     print(i, '->', r.index(i))
0 -> ('a', 0)
1 -> ('a', 1)
2 -> ('b', 0)
3 -> ('b', 1)
4 -> ('c', 0)
5 -> ('c', 1)
>>> r = RanProduct([range(10)] * 19)
>>> r.pickran()
(3, 5, 8, 8, 3, 6, 7, 6, 8, 6, 2, 0, 5, 6, 1, 0, 0, 8, 2)
>>> r.pickran()
(4, 5, 0, 5, 7, 1, 7, 2, 7, 4, 8, 4, 2, 0, 2, 9, 3, 6, 2)
>>> r.pickran()
(8, 7, 4, 1, 3, 0, 4, 6, 4, 3, 9, 8, 5, 8, 9, 9, 7, 1, 8)
>>> r.pickran()
(8, 6, 6, 0, 6, 7, 1, 3, 9, 5, 1, 4, 5, 8, 6, 8, 4, 9, 9)
>>> r.pickran()
(4, 9, 4, 7, 1, 5, 5, 1, 6, 7, 1, 8, 9, 0, 7, 9, 1, 7, 0)
>>> r.pickran()
(3, 0, 3, 9, 8, 6, 3, 0, 3, 0, 9, 9, 3, 5, 2, 3, 7, 8, 8)

Randomly sampling from large combination generator

From what you describe, I believe that you'd have a much more effective algorithm if you pick each component randomly, independent of the others, and continue until you have the requisite sample. RNGs (random number generators) are quite fast, enough to make up for needing to replace the occasional duplicate. Store your chosen combinations as a set of tuples (hashable), and you can look up set inclusion in constant time, making the collection linear time. Something like this:

from random import randint

# For illustration, the "lsits" include letters, symbols, 3-letter words, and low primes
list1 = "pythonic"
list2 = "~!@#$%^&*()"
list3 = ["dog", "cat", "ape", "red", "cwm", "pox"]
list4 = [2, 3, 5, 7, 11, 13, 17, 19]

combo = [list1, list2, list3, list4]
my_sample = set()
needed_size = 10

while len(my_sample) < needed_size:
    # Choose one random item from each list; that forms an element
    elem = tuple([comp[randint(0, len(comp)-1)] for comp in combo])
    # Using a set elminates duplicates easily
    my_sample.add(elem)

print(my_sample)

Output:

{('h', '$', 'pox', 7),
 ('y', '(', 'cat', 11),
 ('n', '@', 'cat', 7),
 ('i', '^', 'ape', 13),
 ('y', '#', 'pox', 11),
 ('o', '%', 'dog', 7),
 ('p', '^', 'cwm', 13),
 ('c', '*', 'dog', 19),
 ('o', ')', 'pox', 11),
 ('h', '~', 'cat', 5)}

Another possibility is to generate one random number in the range of the product of the lengths (8 * 10 * 6 * 8 in this case), and then use integer division and mod to break that into your four random indices.

One more possibility is to simply generate your first set of random indices, and then increment those as you see fit, stepping through the lists in turn. You will want your list lengths to be pairwise relative prime in this case; you can guarantee that by adding a None element as needed. Any combination with a None is discarded.

Do those ideas get you moving?

memory efficient random number iterator without replacement

How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).

import numpy

x_count = 10000
y_count = 20000

x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)

large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table

def get_random_item(sample_size):
    from random import sample
    for i in sample(xrange(y_count * x_count), sample_size):
        y,x = divmod(i, y_count)
        yield (x,y)

for x,y in get_random_item(10):
    print '%12i   x: %5i y: %5i' % (large_table[x][y],  x,y)

Which returns:

(first to simulate your existing 2-D array you created via product)

[[        0         1         2 ...,      9997      9998      9999]
 [    10000     10001     10002 ...,     19997     19998     19999]
 [    20000     20001     20002 ...,     29997     29998     29999]
 ..., 
 [199970000 199970001 199970002 ..., 199979997 199979998 199979999]
 [199980000 199980001 199980002 ..., 199989997 199989998 199989999]
 [199990000 199990001 199990002 ..., 199999997 199999998 199999999]]

Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]

   154080675   x: 15408 y:   675
   186978188   x: 18697 y:  8188
   157506087   x: 15750 y:  6087
   168859259   x: 16885 y:  9259
    29775768   x:  2977 y:  5768
    94167866   x:  9416 y:  7866
    15978144   x:  1597 y:  8144
    91964007   x:  9196 y:  4007
   163462830   x: 16346 y:  2830
    62613129   x:  6261 y:  3129

sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.

I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.

random iteration in Python

You can use random.shuffle() to, well, shuffle a list:

import random

r = list(range(1000))
random.shuffle(r)
for i in r:
  # do something with i

By the way, in many cases where you'd use a for loop over a range of integers in other programming languages, you can directly describe the "thing" you want to iterate in Python.

For example, if you want to use the values of i to access elements of a list, you should better shuffle the list directly:

lst = [1970, 1991, 2012]
random.shuffle(lst)
for x in lst:
  print x

NOTE: You should bear the following warning in mind when using random.shuffle() (taken from the docs:

Note that for even rather small len(x), the total number of
permutations of x is larger than the period of most random number
generators; this implies that most permutations of a long sequence can
never be generated.

Python generator that groups another iterable into groups of N

See the grouper recipe in the docs for the itertools package

def grouper(n, iterable, fillvalue=None):
  "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
  args = [iter(iterable)] * n
  return izip_longest(fillvalue=fillvalue, *args)

(However, this is a duplicate of quite a few questions.)

Python Random Sample with a Generator/Iterable/Iterator