Understanding Generators in Python

Understanding generators in Python

Note: this post assumes Python 3.x syntax.

A generator is simply a function which returns an object on which you can call next, such that for every call it returns some value, until it raises a StopIteration exception, signaling that all values have been generated. Such an object is called an iterator.

Normal functions return a single value using return, just like in Java. In Python, however, there is an alternative, called yield. Using yield anywhere in a function makes it a generator. Observe this code:

>>> def myGen(n):
... yield n
... yield n + 1
...
>>> g = myGen(6)
>>> next(g)
6
>>> next(g)
7
>>> next(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

As you can see, myGen(n) is a function which yields n and n + 1. Every call to next yields a single value, until all values have been yielded. for loops call next in the background, thus:

>>> for n in myGen(6):
... print(n)
...
6
7

Likewise there are generator expressions, which provide a means to succinctly describe certain common types of generators:

>>> g = (n for n in range(3, 5))
>>> next(g)
3
>>> next(g)
4
>>> next(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Note that generator expressions are much like list comprehensions:

>>> lc = [n for n in range(3, 5)]
>>> lc
[3, 4]

Observe that a generator object is generated once, but its code is not run all at once. Only calls to next actually execute (part of) the code. Execution of the code in a generator stops once a yield statement has been reached, upon which it returns a value. The next call to next then causes execution to continue in the state in which the generator was left after the last yield. This is a fundamental difference with regular functions: those always start execution at the "top" and discard their state upon returning a value.

There are more things to be said about this subject. It is e.g. possible to send data back into a generator (reference). But that is something I suggest you do not look into until you understand the basic concept of a generator.

Now you may ask: why use generators? There are a couple of good reasons:

  • Certain concepts can be described much more succinctly using generators.
  • Instead of creating a function which returns a list of values, one can write a generator which generates the values on the fly. This means that no list needs to be constructed, meaning that the resulting code is more memory efficient. In this way one can even describe data streams which would simply be too large to fit in memory.
  • Generators allow for a natural way to describe infinite streams. Consider for example the Fibonacci numbers:

    >>> def fib():
    ... a, b = 0, 1
    ... while True:
    ... yield a
    ... a, b = b, a + b
    ...
    >>> import itertools
    >>> list(itertools.islice(fib(), 10))
    [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

    This code uses itertools.islice to take a finite number of elements from an infinite stream. You are advised to have a good look at the functions in the itertools module, as they are essential tools for writing advanced generators with great ease.


   About Python <=2.6: in the above examples next is a function which calls the method __next__ on the given object. In Python <=2.6 one uses a slightly different technique, namely o.next() instead of next(o). Python 2.7 has next() call .next so you need not use the following in 2.7:

>>> g = (n for n in range(3, 5))
>>> g.next()
3

What does the yield keyword do?

To understand what yield does, you must understand what generators are. And before you can understand generators, you must understand iterables.

Iterables

When you create a list, you can read its items one by one. Reading its items one by one is called iteration:

>>> mylist = [1, 2, 3]
>>> for i in mylist:
... print(i)
1
2
3

mylist is an iterable. When you use a list comprehension, you create a list, and so an iterable:

>>> mylist = [x*x for x in range(3)]
>>> for i in mylist:
... print(i)
0
1
4

Everything you can use "for... in..." on is an iterable; lists, strings, files...

These iterables are handy because you can read them as much as you wish, but you store all the values in memory and this is not always what you want when you have a lot of values.

Generators

Generators are iterators, a kind of iterable you can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly:

>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
... print(i)
0
1
4

It is just the same except you used () instead of []. BUT, you cannot perform for i in mygenerator a second time since generators can only be used once: they calculate 0, then forget about it and calculate 1, and end calculating 4, one by one.

Yield

yield is a keyword that is used like return, except the function will return a generator.

>>> def create_generator():
... mylist = range(3)
... for i in mylist:
... yield i*i
...
>>> mygenerator = create_generator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object create_generator at 0xb7555c34>
>>> for i in mygenerator:
... print(i)
0
1
4

Here it's a useless example, but it's handy when you know your function will return a huge set of values that you will only need to read once.

To master yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky.

Then, your code will continue from where it left off each time for uses the generator.

Now the hard part:

The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it'll return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting yield. That can be because the loop has come to an end, or because you no longer satisfy an "if/else".



Your code explained

Generator:

# Here you create the method of the node object that will return the generator
def _get_child_candidates(self, distance, min_dist, max_dist):

# Here is the code that will be called each time you use the generator object:

# If there is still a child of the node object on its left
# AND if the distance is ok, return the next child
if self._leftchild and distance - max_dist < self._median:
yield self._leftchild

# If there is still a child of the node object on its right
# AND if the distance is ok, return the next child
if self._rightchild and distance + max_dist >= self._median:
yield self._rightchild

# If the function arrives here, the generator will be considered empty
# there is no more than two values: the left and the right children

Caller:

# Create an empty list and a list with the current object reference
result, candidates = list(), [self]

# Loop on candidates (they contain only one element at the beginning)
while candidates:

# Get the last candidate and remove it from the list
node = candidates.pop()

# Get the distance between obj and the candidate
distance = node._get_dist(obj)

# If distance is ok, then you can fill the result
if distance <= max_dist and distance >= min_dist:
result.extend(node._values)

# Add the children of the candidate in the candidate's list
# so the loop will keep running until it will have looked
# at all the children of the children of the children, etc. of the candidate
candidates.extend(node._get_child_candidates(distance, min_dist, max_dist))

return result

This code contains several smart parts:

  • The loop iterates on a list, but the list expands while the loop is being iterated. It's a concise way to go through all these nested data even if it's a bit dangerous since you can end up with an infinite loop. In this case, candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) exhaust all the values of the generator, but while keeps creating new generator objects which will produce different values from the previous ones since it's not applied on the same node.

  • The extend() method is a list object method that expects an iterable and adds its values to the list.

Usually we pass a list to it:

>>> a = [1, 2]
>>> b = [3, 4]
>>> a.extend(b)
>>> print(a)
[1, 2, 3, 4]

But in your code, it gets a generator, which is good because:

  1. You don't need to read the values twice.
  2. You may have a lot of children and you don't want them all stored in memory.

And it works because Python does not care if the argument of a method is a list or not. Python expects iterables so it will work with strings, lists, tuples, and generators! This is called duck typing and is one of the reasons why Python is so cool. But this is another story, for another question...

You can stop here, or read a little bit to see an advanced use of a generator:

Controlling a generator exhaustion

>>> class Bank(): # Let's create a bank, building ATMs
... crisis = False
... def create_atm(self):
... while not self.crisis:
... yield "$100"
>>> hsbc = Bank() # When everything's ok the ATM gives you as much as you want
>>> corner_street_atm = hsbc.create_atm()
>>> print(corner_street_atm.next())
$100
>>> print(corner_street_atm.next())
$100
>>> print([corner_street_atm.next() for cash in range(5)])
['$100', '$100', '$100', '$100', '$100']
>>> hsbc.crisis = True # Crisis is coming, no more money!
>>> print(corner_street_atm.next())
<type 'exceptions.StopIteration'>
>>> wall_street_atm = hsbc.create_atm() # It's even true for new ATMs
>>> print(wall_street_atm.next())
<type 'exceptions.StopIteration'>
>>> hsbc.crisis = False # The trouble is, even post-crisis the ATM remains empty
>>> print(corner_street_atm.next())
<type 'exceptions.StopIteration'>
>>> brand_new_atm = hsbc.create_atm() # Build a new one to get back in business
>>> for cash in brand_new_atm:
... print cash
$100
$100
$100
$100
$100
$100
$100
$100
$100
...

Note: For Python 3, useprint(corner_street_atm.__next__()) or print(next(corner_street_atm))

It can be useful for various things like controlling access to a resource.

Itertools, your best friend

The itertools module contains special functions to manipulate iterables. Ever wish to duplicate a generator?
Chain two generators? Group values in a nested list with a one-liner? Map / Zip without creating another list?

Then just import itertools.

An example? Let's see the possible orders of arrival for a four-horse race:

>>> horses = [1, 2, 3, 4]
>>> races = itertools.permutations(horses)
>>> print(races)
<itertools.permutations object at 0xb754f1dc>
>>> print(list(itertools.permutations(horses)))
[(1, 2, 3, 4),
(1, 2, 4, 3),
(1, 3, 2, 4),
(1, 3, 4, 2),
(1, 4, 2, 3),
(1, 4, 3, 2),
(2, 1, 3, 4),
(2, 1, 4, 3),
(2, 3, 1, 4),
(2, 3, 4, 1),
(2, 4, 1, 3),
(2, 4, 3, 1),
(3, 1, 2, 4),
(3, 1, 4, 2),
(3, 2, 1, 4),
(3, 2, 4, 1),
(3, 4, 1, 2),
(3, 4, 2, 1),
(4, 1, 2, 3),
(4, 1, 3, 2),
(4, 2, 1, 3),
(4, 2, 3, 1),
(4, 3, 1, 2),
(4, 3, 2, 1)]

Understanding the inner mechanisms of iteration

Iteration is a process implying iterables (implementing the __iter__() method) and iterators (implementing the __next__() method).
Iterables are any objects you can get an iterator from. Iterators are objects that let you iterate on iterables.

There is more about it in this article about how for loops work.

Trouble understanding python generators

Generators (like all iterables) can only be iterated over once. By the time print_gen(x) is done, so is x. Any further attempts to get new values from x will result in StopIteration being raised.

This works:

x = do_gen()
y = (incr_gen(i) for i in do_gen())
print_gen(x)
print_gen(y)

as that creates two separate independent generators. In your version, the generator expression you assigned to y expects x to yield more values.

It is easier to see that the x generator is shared with y when you use the next() function on them in turn:

>>> def do_gen():
... for i in range(3):
... yield i
...
>>> def incr_gen(y):
... return y + 1
...
>>> x = do_gen()
>>> y = (incr_gen(i) for i in x)
>>> next(x) # first value for the range
0
>>> next(y) # second value from the range, plus 1
2
>>> next(x) # third value from the range
2
>>> next(y) # no more values available, generator done
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Note the StopIteration raised by next(y) as well here.

Is a generator the callable? Which is the generator?

The terminology is unfortunately confusing, as "generator" is so commonly used to refer to either the function or the returned iterator that it's hard to say one usage is more correct. The documentation of the yield statement says

The yield statement is only used when defining a generator function,
and is only used in the body of the generator function. Using a yield
statement in a function definition is sufficient to cause that
definition to create a generator function instead of a normal
function.

When a generator function is called, it returns an iterator known as a
generator iterator, or more commonly, a generator.

The original PEP introducing the concept says

Note that when the intent is clear from context, the unqualified name
"generator" may be used to refer either to a generator-function or a
generator-iterator
.

If you want to make a clear distinction, use "generator function" for the function and "generator iterator" for the iterator.

In what situations should you actually use generators in python?

Does this defeat the purpose of using a generator as it then creates this in an even list. In this case in what exact cases are generators useful?

This is a bit opinion based, but there are some situations where a list might not do the trick (for example because of hardware limitations).

Saving CPU cycles (time)

Imagine that you have a list of even numbers, and then want to take the sum of the first five numbers. In Python we could do that with an islice, like:

sumfirst5even = sum(islice(even(100), 5))

If we would first generate a list of 100 even numbers (not knowing what we will later do with that list), then we have spent a lot of CPU cycles in the construction of such list, that are wasted.

By using a generator, we can restrict this to only the elements we really need. So we will only yield the first five elements. The algorithm will never calculate elements larger than 10. Yes, here it is doubful that this will have any (significant) impact. It is even possible that the "generator protocol" will require more CPU cycles compared to generating a list, so for small lists, there is no advantage. But now imagine that we used even(100000), then the amount of "useless CPU cycles" we spent on generating an entire list, can be significantly.

Saving memory

Another potential benefit is saving memory, given we do not need all elements of the generator in memory concurrently.

Take for example the following example:

for x in even(1000):
print(x)

If even(..) constructs a list of 1000 elements, then that means that all these numbers need to be objects in memory concurrently. Depending on the Python interpreter, objects can take significant amount(s) of memory. For example an int takes in CPython, 28 bytes of memory. So that means that a list containing 500 such ints can take roughly 14 kB of memory (some extra memory for the list). Yes most Python interpreters maintain a "flyweight" pattern to reduce the burden of small ints (these are shared, and so we do not create a separate object for each int we construct in the process), but still it can easily add up. For an even(1000000), we will need 14 MB of memory.

If we use a generator, than depending on how we use the generator, we might save memory. Why? Because once we no longer need the number 123456 (since the for loop advances to the next item), the space that object "occupied" can be recycled, and given to an int object with value 12348. So it means that - given the way we use the generator permits this - that the memory usage remains constant, whereas for a list it scales linear. Of course the generator itself also needs to do proper management: if in the generator code, we build up a collection, than the memory will of course increase as well.

In 32-bit systems, this can even result in some problems since Python lists have a maximum length. A list can contain at most 536'870'912 elements. Yes that is a huge number, but what if you for example want to generate all permutations of a given list? If we store the permutations in a list, then that means that for a 32-bit system, a list of 13 (or more elements), we will never be able to construct such a list.

"online" programs

In theoretical computer science, an "online algorithm" is by some researchers defined as an algorithm that receives input gradually, and thus does not knows the entire input in advance.

A practical example can be a webcam, that each second makes an image, and sends it to a Python webserver. We do not know at that moment how a picture that will be captured by the webcam within 24 hours will look like. But we might be interested in detecting a burglar that aims to steal something. In that case a list of frames will thus not contain all images. A generator can however construct an elegant "protocol" where we iteratively fetch an image, detect a burglar, and raise an alarm, like:

for frame in from_webcam():
if contains_burglar(frame):
send_alarm_email('Maurice Moss')

Infinite generators

We do not need webcams or other hardware to exploit the elegance of generators. Generators can yield an "infinite" sequence. Or even generator could for example look like:

def even():
i = 0
while True:
yield i
i += 2

This is a generator that will eventually generate all even numbers. If we keep iterating over it, eventually we will yield the number 123'456'789'012'345'678 (although it might take a very long time).

The above can be useful if we want to implement a program that for example keeps yielding even numbers that are palindromes. This could look like:

for i in even():
if is_palindrome(i):
print(i)

We thus can assume that this program will keep working, and do not need to "update" the list of even numbers. In some pure functional languages that make lazy programming transparent, programs are written as if you create a list, but in fact it is typically a generator in place.

"enriched" generators: range(..) and friends

In Python a lot of classes do not construct lists when you iterate over them, for example a range(1000) object does not first construct a list (it does in python-2.x, but not in python-3.x). The range(..) object simply represents a range. A range(..) object is not a generator, but it is a class that can generate an iterator object, that works like a generator.

Besides iterating, we can do all kinds of things with a range(..) object, that is possible with lists, but not in an efficient way.

If we for example want to know whether 1000000000 is an element of range(400, 10000000000, 2), then we can write 1000000000 in range(400, 10000000000, 2). Now there is an algorithm in place that will check this without generating the range, or constructing a list: it sees if the elements is an int, is in the range of the range(..) object (so greater than or equal to 400, and less than 10000000000), and whether it is yielded (taking the step into account), this does not require iterating over it. As a result the membership check can be done instantly.

If we had generated a list, this would mean that Python had to enumerate over every element until it finally can find that element (or reaches the end of the list). For numbers like 1000000000, this can easily take minutes, hours, maybe days.

We can also "slice" the range object, which yield another range(..) object, for example:

>>> range(123, 456, 7)[1::4]
range(130, 459, 28)

with an algorithm we can thus instantly slice the range(..) object into a new range object. Slicing a list takes linear time. This can again (for huge lists) take significant time and memory.

Confusion on how Python generator works

One way to tackle this kind of learning problems is to visualize what is happening at each step. Let us start from the beginning:

def counter(start=0): 
n = start
while True:
result = yield n # A
print(type(result), result) # B
if result == 'Q':
break
n += 1

Nothing interesting happens here. It is just a function definition.

c = counter()

Normally it should execute the function right? But since there is a yield keyword, it simply returns an object which can be used to execute the function. Read the previous sentence again! That is the first thing to understand about generators.

print(next(c)) # C

This is the way you execute the function using the object c. You don't invoke it with (), but instead do next(c). This is the very first time your instructions are executed and it happens till it finds the next yield statement. Since this is at A and the value of n is 0 at this moment, it just prints 0 and exits from the function - it would be better say the function pauses here. Remember it has not even reached n +=1! That answers your Q1.

print(c.send('Wow!')) # D

Now some more interesting stuff happens. The generator c, which had previously stopped at yield n, now just resumes and the next immediate instruction it has to perform is to result = in the result = yield n statement at A. It is given the value which you send in! So now result = 'Wow' has just happened.

Rest of the execution is normal. It again comes out of the function when it hits the next yield n. Now n is the n+1 because it was incremented in the while loop. I hope you can guess how the rest of the code behaves.

print(c.send('Q')) # F

Now this statement is somewhat different because it sends in a value that actually breaks the loop which in turn also stops any further yields in this case. Since the generator no longer finds any yield statements, it just throws a StopIteration exception and stops. If there was a yield outside of the while loop, it would have returned that and paused again.



Related Topics



Leave a reply



Submit