How to Clone a Python Generator Object

How to clone a Python generator object?

You can use itertools.tee():

walk, walk2 = itertools.tee(walk)

Note that this might "need significant extra storage", as the documentation points out.

Cloning generators in Python without Tee

Yes, it's just as you say (except you don't copy the list):

Is it saying to convert the generator to a list, copy that and convert
both the lists to iterators/generators?

Let's say you have a generator and you want to make five copies of it. Each of these copies needs to yield the same values as your original generator. A simple solution would be to get all the values from your generator, for example by converting it to a list, and then using this list to produce new generators:

def orig(n):
yield from range(n)

orig_gen = orig(100)

for i in range(90):
next(orig_gen)

# now we have 10 values left in gen
values_left = list(orig_gen)

def copy():
yield from values_left

copy_gen1 = copy()
copy_gen2 = copy()
print(next(copy_gen1))
print(next(copy_gen2))

This can become very expensive though. The purpose of a generator is to produce new values dynamically. If you convert a generator to a list, you have to do all the calculations necessary to get those values. Also, if the generator produces many values, you will end up using huge amounts of memory.

That's why tee() offers a buffered approach. You have to specify how many copies of the generator you want and then tee() sets up a deque (a list with fast appends and pops) for each copy. When you request a new value from one of your copied generators, it is taken from the buffer. Only if the buffer is empty, new values are produced from the original generator. The source is given in the docs. Let's say you want to have 5 copies, using tee() could look like this:

  • create the original generator and use it a bit
  • create 5 copies with tee(). Each copy has an empty buffer
  • call next() on copy 1. Since the buffer is empty, we call next() on the original generator. The value is added to all the buffers. It is immediately popped from buffer 1 and returned
  • call next() on copy 2. The buffer for copy 2 already contains this value, we pop it from the buffer and return it. No use of the original generator is necessary.
  • call next() on copy 1. In the previous step we didn't have to use the original generator, so our buffer is still empty. We call next() on the original generator. The value is added to all buffers. It is immediately taken out of buffer 1 and returned
  • call next() on copy 3 twice. Both times we can simply read the value from the buffer.

If you call next() on the copies in roughly the same frequency, this is very efficient. The buffers don't grow since values are removed from them evenly. So you don't have to store many values in memory.

But if you only use one of the copies, then the buffers for the other copies grow larger and larger. In an extreme case, if you exhaust copy 1, before you touch the other copies, their buffers become lists with all the values from the generator. Now you have 4 lists with all the values. Instead of just 1 list with all values in the simple approach.

deep-copying a generator in python

There are three cases I can think of:

  • Generator has no side effects, and you just want to be able to walk back through results you've already captured. You could consider a cached generator instead of a true generator. You can shared the cached generator around as well, and if any client walks to an item you haven't been to yet, it will advance. This is similar to the tee() method, but does the tee functionality in the generator/cache itself instead of requiring the client to do it.

  • Generator has side effects, but no history, and you want to be able to restart anywhere. Consider writing it as a coroutine, where you can pass in the value to start at any time.

  • Generator has side effects AND history, meaning that the state of the generator at G(x) depends on the results of G(x-1), and so you can't just pass x back into it to start anywhere. In this case, I think you'd need to be more specific about what you are trying to do, as the result depends not just on the generator, but on the state of other data. Probably, in this case, there is a better way to do it.

How to create a copy of python iterator?

Use the itertools.tee() function to produce copies; these use a buffer to share results between different iterators:

from itertools import tee

my_list = [5, 4, 3,2]
first_it = iter(my_list)
first_it, second_it = tee(first_it)
print next(first_it) # prints 5
print next(second_it) # prints 5
print next(first_it) # prints 4

Note that you should no longer use the original iterator; use only the tees.

Note that the buffer also means that these can incur a significant memory cost if you advance one of the copies far ahead of the others! From the documentation:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Resetting generator object in Python

Another option is to use the itertools.tee() function to create a second version of your generator:

import itertools
y = FunctionWithYield()
y, y_backup = itertools.tee(y)
for x in y:
print(x)
for x in y_backup:
print(x)

This could be beneficial from memory usage point of view if the original iteration might not process all the items.

How to pick just one item from a generator?

Create a generator using

g = myfunct()

Everytime you would like an item, use

next(g)

(or g.next() in Python 2.5 or below).

If the generator exits, it will raise StopIteration. You can either catch this exception if necessary, or use the default argument to next():

next(g, default_value)

Convert generator object to list for debugging

Simply call list on the generator.

lst = list(gen)
lst

Be aware that this affects the generator which will not return any further items.

You also cannot directly call list in IPython, as it conflicts with a command for listing lines of code.

Tested on this file:

def gen():
yield 1
yield 2
yield 3
yield 4
yield 5
import ipdb
ipdb.set_trace()

g1 = gen()

text = "aha" + "bebe"

mylst = range(10, 20)

which when run:

$ python code.py 
> /home/javl/sandbox/so/debug/code.py(10)<module>()
9
---> 10 g1 = gen()
11

ipdb> n
> /home/javl/sandbox/so/debug/code.py(12)<module>()
11
---> 12 text = "aha" + "bebe"
13

ipdb> lst = list(g1)
ipdb> lst
[1, 2, 3, 4, 5]
ipdb> q
Exiting Debugger.

General method for escaping function/variable/debugger name conflicts

There are debugger commands p and pp that will print and prettyprint any expression following them.

So you could use it as follows:

$ python code.py 
> /home/javl/sandbox/so/debug/code.py(10)<module>()
9
---> 10 g1 = gen()
11

ipdb> n
> /home/javl/sandbox/so/debug/code.py(12)<module>()
11
---> 12 text = "aha" + "bebe"
13

ipdb> p list(g1)
[1, 2, 3, 4, 5]
ipdb> c

There is also an exec command, called by prefixing your expression with !, which forces debugger to take your expression as Python one.

ipdb> !list(g1)
[]

For more details see help p, help pp and help exec when in debugger.

ipdb> help exec
(!) statement
Execute the (one-line) statement in the context of
the current stack frame.
The exclamation point can be omitted unless the first word
of the statement resembles a debugger command.
To assign to a global variable you must always prefix the
command with a 'global' command, e.g.:
(Pdb) global list_options; list_options = ['-l']

Cloning Babeltrace events from generator for random-access traversal

Babeltrace co-maintainer here.

Indeed, Babeltrace 1 reuses the same event record object for each iteration step. This means you cannot keep an "old" event record alive as its data changes behind the scenes.

The Python bindings of Babeltrace 1 are rudimental wrappers of the library objects. This means the same constraints apply. Also, Babeltrace 1 doesn't offer any event record object copying function, so anything like copy.copy() will only copy internal pointers which will then exhibit the same issue.

Babeltrace (1 and 2) iterators cannot go backwards for performance reasons (more about this below).

The only solution I see is making your own event record copying function, keeping what's necessary in another instance of your own class. After all, you probably only need the name, timestamp, and some first-level fields of the event record.

But Babeltrace 2 is what you're looking for, especially since we don't maintain Babeltrace 1 anymore (except for critical/security bug fixes).

Babeltrace 2 offers a rich and consistent C API where many objects have a reference count and therefore can live as long as you like. The Babeltrace 2 Python bindings wrap this C API so that you can benefit from the same features.

While the C API documentation is complete, unfortunately the Python bindings one is not. However, we have this, which at least shows some examples to get you started.

About your comment:

since it seems the events are a kind of linked list where one could walk backward

No, you cannot. This is to accomodate limitations of some trace formats, in particular CTF (the format which LTTng uses). A CTF packet is a sequence of serialized binary event records: to decode event record N, you need to decode event record N - 1 first, and so on. A CTF packet can contain thousands of contiguous event records like this, CTF data streams can contain thousands of packets, and a CTF trace can contain many data streams. Knowing this, there would be no reasonable way to store the offsets of all the encoded CTF event records so that you can iterate backwards without heavy object copies.

What you can do however with Babeltrace 2 is keep the specific event record objects you need, without any copy.

In the future, we'd like a way to copy a message iterator, duplicating all its state and what's needed to continue behind the scenes. This would make it possible to keep "checkpoint iterators" so that you can go back to previous event records if you can't perform your analysis in one pass for some reason.

Note that you can also make a message iterator seek a specific timestamp, but "fast" seeking is not implemented as of this date in the ctf plugin (the iterator seeks the beginning of the message sequence and then advances until it reaches the requested timestamp, which is not efficient).



Related Topics



Leave a reply



Submit