Split a Generator into Chunks Without Pre-Walking It

Split a generator into chunks without pre-walking it

One way would be to peek at the first element, if any, and then create and return the actual generator.

def head(iterable, max=10):
first = next(iterable) # raise exception when depleted
def head_inner():
yield first # yield the extracted first element
for cnt, el in enumerate(iterable):
yield el
if cnt + 1 >= max: # cnt + 1 to include first
break
return head_inner()

Just use this in your chunk generator and catch the StopIteration exception like you did with your custom exception.


Update: Here's another version, using itertools.islice to replace most of the head function, and a for loop. This simple for loop in fact does exactly the same thing as that unwieldy while-try-next-except-break construct in the original code, so the result is much more readable.

def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator: # stops when iterator is depleted
def chunk(): # construct generator for next chunk
yield first # yield element from for loop
for more in islice(iterator, size - 1):
yield more # yield more elements from the iterator
yield chunk() # in outer generator, yield next chunk

And we can get even shorter than that, using itertools.chain to replace the inner generator:

def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))

Chunking a generator

Each time you call g() you restart the generator from the beginning. You need to assign the result to a variable so it will keep using the same generator.

And as mentioned in a comment, the islice object is always truthy. To tell if you reached the end, check whether the for c in chunk: loop did anything.

from itertools import islice

def g():
for x in range(11):
print("generating: ", x)
yield x

size = 2
gen = g()
while True:
chunk = islice(gen, size)

print("at chunk")
empty = True
for c in chunk:
print(c)
empty = False

if empty:
break

how to split an iterable in constant-size chunks

This is probably more efficient (faster)

def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]

for x in batch(range(0, 10), 3):
print x

Example using list

data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # list of data 

for x in batch(data, 3):
print(x)

# Output

[0, 1, 2]
[3, 4, 5]
[6, 7, 8]
[9, 10]

It avoids building new lists.

Split list into chunks with repeats between chunks

Something like this with list comprehension:

[l[i*(M-m):i*(M-m)+M] for i in range(math.ceil((len(l)-m)/(M-m)))]

Example:

import math
l = list(range(15))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
m, M = 2, 5
[l[i*(M-m):i*(M-m)+M] for i in range(math.ceil((len(l)-m)/(M-m)))]
# [[0, 1, 2, 3, 4],
# [3, 4, 5, 6, 7],
# [6, 7, 8, 9, 10],
# [9, 10, 11, 12, 13],
# [12, 13, 14]]

m, M = 3, 5
[l[i*(M-m):i*(M-m)+M] for i in range(math.ceil((len(l)-m)/(M-m)))]
# [[0, 1, 2, 3, 4],
# [2, 3, 4, 5, 6],
# [4, 5, 6, 7, 8],
# [6, 7, 8, 9, 10],
# [8, 9, 10, 11, 12],
# [10, 11, 12, 13, 14]]

l = range(5)
m, M = 2, 3
[l[i*(M-m):i*(M-m)+M] for i in range(math.ceil((len(l)-m)/(M-m)))]
# [range(0, 3), range(1, 4), range(2, 5)]

Explanation:

Chunk i starts at index i*(M-m) and ends M positions later at index i*(M-m) + M.

chunk index    starts           ends
-------------------------------------------------
0 0 M
1 M-m M-m+M = 2*M-m
2 2*M-m-m=2(M-m) 2*(M-m)+M = 3M-2m
...

Now the problem is to determine how many chunks.

At each step we increase the initial index by M-m, so to count the total number of steps we need to divide the length of the list by M-m (but after subtracting m because in the first chunk we're not skipping anything).

Finally, use the ceiling function to add the last incomplete chunk in case the division is not exact.

iterating over an iterable using generator function

Try initializing tuple inside for loop

def bunch_together(iterable,n):

for k in range(0,len(iterable),n):
tup = tuple()
for i in range(k,k+n):
tup += (iterable[i] if i<len(iterable) else None,) # condition to check overflow
yield tup

for x in bunch_together(range(10),3):
print(x)

Output

(0, 1, 2)
(3, 4, 5)
(6, 7, 8)
(9, None, None)

Split dataframe into relatively even chunks according to length

You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby splitting the dataframe into equally sized chunks:

n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)

How to randomly split a generator into two generators by given ratio?

This solution doesn't store values. It sets up two identical generators and two identical random number streams. The generators share the same cutoff percentage and one only yields below it and the other only yields above it:

from random import Random

def percentage_generators(generator, percentage):

def generator_1(state):
twister = Random()
twister.setstate(state)

for value in generator():
if twister.random() < percentage:
yield value

def generator_2(state):
twister = Random()
twister.setstate(state)

for value in generator():
if twister.random() >= percentage:
yield value

state = Random().getstate()

return [generator_1(state), generator_2(state)]

if __name__ == "__main__":

def test_generator():
for n in range(20):
yield n

generator1, generator2 = percentage_generators(test_generator, 0.7)

for number in generator1:
print(1, number)

print()

for number in generator2:
print(2, number)

OUTPUT

% python3 test.py
1 0
1 1
1 2
1 3
1 6
1 7
1 8
1 10
1 11
1 12
1 13
1 14
1 15
1 17

2 4
2 5
2 9
2 16
2 18
2 19
%

The code can probably be reduced by generating the generator wrappers via a loop, i.e. looping over operator.lt and operator.ge, or some such.



Related Topics



Leave a reply



Submit