Python Random Sample with a Generator/Iterable/Iterator

Python random sample with a generator / iterable / iterator

While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results

The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):

  • iterSample: 5.05s
  • iter_sample_fast: 2.64s

Python itertools create iterator of random subset

Just produce random combinations, tracking what you've seen before:

def random_combinations(matrix, size):
seen = set()
n = len(matrix)
while True:
new_sample = tuple(sorted(random.sample(xrange(n), size)))
if new_sample not in seen:
seen.add(new_sample)
yield tuple(matrix[i] for i in new_sample)

Iterating through all possible combinations to sample is not efficient, you still end up testing all 10^14 combinations.

The above generator picks a random combination each time you iterate; if you need a certain number, use a loop or itertools.islice(); picking 10 random combinations would be:

combinations_sample = list(islice(random_combinations(matrix, 50), 10))

You may have misunderstood what the function you found does; it does much the same as my function above but produces just the one random combination, without tracking what was produced before. You were supposed to use it on matrix, not on all combinations of matrix.

Random sampling a large Cartesian product of iterables

Which version of Python are you using? Somewhere along the way .next() methods were deprecated in favor a new next() built-in function. That works fine with all iterators. Here, for example, under the current released 3.10.1:

>>> import itertools
>>> itp = itertools.product(range(5), repeat=6)
>>> next(itp)
(0, 0, 0, 0, 0, 0)
>>> next(itp)
(0, 0, 0, 0, 0, 1)
>>> next(itp)
(0, 0, 0, 0, 0, 2)
>>> next(itp)
(0, 0, 0, 0, 0, 3)
>>> for ignore in range(50):
... ignore = next(itp)
>>> next(itp)
(0, 0, 0, 2, 0, 4)

Beyond that, you didn't show us the most important part of your code: how you build your product.

Without seeing that, I can only guess that it would be far more efficient to make a random choice from the first sequence passed to product(), then another from the second, and so on. Build a random element of the product from one component at a time.

Picking a random product tuple efficiently

Perhaps overkill, but this class shows an especially efficient way to do this. The .index() method maps an integer i to the i'th tuple (0-based) in the product. Then picking a random tuple from the product is simply applying .index() to a random integer in range(total number of elements in the product).

from math import prod
from random import randrange

class RanProduct:
def __init__(self, iterables):
self.its = list(map(list, iterables))
self.n = prod(map(len, self.its))

def index(self, i):
if i not in range(self.n):
raise ValueError(f"index {i} not in range({self.n})")
result = []
for it in reversed(self.its):
i, r = divmod(i, len(it))
result.append(it[r])
return tuple(reversed(result))

def pickran(self):
return self.index(randrange(self.n))

and then

>>> r = RanProduct(["abc", range(2)])
>>> for i in range(6):
... print(i, '->', r.index(i))
0 -> ('a', 0)
1 -> ('a', 1)
2 -> ('b', 0)
3 -> ('b', 1)
4 -> ('c', 0)
5 -> ('c', 1)
>>> r = RanProduct([range(10)] * 19)
>>> r.pickran()
(3, 5, 8, 8, 3, 6, 7, 6, 8, 6, 2, 0, 5, 6, 1, 0, 0, 8, 2)
>>> r.pickran()
(4, 5, 0, 5, 7, 1, 7, 2, 7, 4, 8, 4, 2, 0, 2, 9, 3, 6, 2)
>>> r.pickran()
(8, 7, 4, 1, 3, 0, 4, 6, 4, 3, 9, 8, 5, 8, 9, 9, 7, 1, 8)
>>> r.pickran()
(8, 6, 6, 0, 6, 7, 1, 3, 9, 5, 1, 4, 5, 8, 6, 8, 4, 9, 9)
>>> r.pickran()
(4, 9, 4, 7, 1, 5, 5, 1, 6, 7, 1, 8, 9, 0, 7, 9, 1, 7, 0)
>>> r.pickran()
(3, 0, 3, 9, 8, 6, 3, 0, 3, 0, 9, 9, 3, 5, 2, 3, 7, 8, 8)

Randomly sampling from large combination generator

From what you describe, I believe that you'd have a much more effective algorithm if you pick each component randomly, independent of the others, and continue until you have the requisite sample. RNGs (random number generators) are quite fast, enough to make up for needing to replace the occasional duplicate. Store your chosen combinations as a set of tuples (hashable), and you can look up set inclusion in constant time, making the collection linear time. Something like this:

from random import randint

# For illustration, the "lsits" include letters, symbols, 3-letter words, and low primes
list1 = "pythonic"
list2 = "~!@#$%^&*()"
list3 = ["dog", "cat", "ape", "red", "cwm", "pox"]
list4 = [2, 3, 5, 7, 11, 13, 17, 19]

combo = [list1, list2, list3, list4]
my_sample = set()
needed_size = 10

while len(my_sample) < needed_size:
# Choose one random item from each list; that forms an element
elem = tuple([comp[randint(0, len(comp)-1)] for comp in combo])
# Using a set elminates duplicates easily
my_sample.add(elem)

print(my_sample)

Output:

{('h', '$', 'pox', 7),
('y', '(', 'cat', 11),
('n', '@', 'cat', 7),
('i', '^', 'ape', 13),
('y', '#', 'pox', 11),
('o', '%', 'dog', 7),
('p', '^', 'cwm', 13),
('c', '*', 'dog', 19),
('o', ')', 'pox', 11),
('h', '~', 'cat', 5)}

Another possibility is to generate one random number in the range of the product of the lengths (8 * 10 * 6 * 8 in this case), and then use integer division and mod to break that into your four random indices.

One more possibility is to simply generate your first set of random indices, and then increment those as you see fit, stepping through the lists in turn. You will want your list lengths to be pairwise relative prime in this case; you can guarantee that by adding a None element as needed. Any combination with a None is discarded.

Do those ideas get you moving?

memory efficient random number iterator without replacement

How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).

import numpy

x_count = 10000
y_count = 20000

x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)

large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table

def get_random_item(sample_size):
from random import sample
for i in sample(xrange(y_count * x_count), sample_size):
y,x = divmod(i, y_count)
yield (x,y)

for x,y in get_random_item(10):
print '%12i x: %5i y: %5i' % (large_table[x][y], x,y)

Which returns:

(first to simulate your existing 2-D array you created via product)

[[        0         1         2 ...,      9997      9998      9999]
[ 10000 10001 10002 ..., 19997 19998 19999]
[ 20000 20001 20002 ..., 29997 29998 29999]
...,
[199970000 199970001 199970002 ..., 199979997 199979998 199979999]
[199980000 199980001 199980002 ..., 199989997 199989998 199989999]
[199990000 199990001 199990002 ..., 199999997 199999998 199999999]]

Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]

   154080675   x: 15408 y:   675
186978188 x: 18697 y: 8188
157506087 x: 15750 y: 6087
168859259 x: 16885 y: 9259
29775768 x: 2977 y: 5768
94167866 x: 9416 y: 7866
15978144 x: 1597 y: 8144
91964007 x: 9196 y: 4007
163462830 x: 16346 y: 2830
62613129 x: 6261 y: 3129

sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.

I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.

random iteration in Python

You can use random.shuffle() to, well, shuffle a list:

import random

r = list(range(1000))
random.shuffle(r)
for i in r:
# do something with i

By the way, in many cases where you'd use a for loop over a range of integers in other programming languages, you can directly describe the "thing" you want to iterate in Python.

For example, if you want to use the values of i to access elements of a list, you should better shuffle the list directly:

lst = [1970, 1991, 2012]
random.shuffle(lst)
for x in lst:
print x

NOTE: You should bear the following warning in mind when using random.shuffle() (taken from the docs:

Note that for even rather small len(x), the total number of
permutations of x is larger than the period of most random number
generators; this implies that most permutations of a long sequence can
never be generated.

Python generator that groups another iterable into groups of N

See the grouper recipe in the docs for the itertools package

def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

(However, this is a duplicate of quite a few questions.)



Related Topics



Leave a reply



Submit