When Is Not a Good Time to Use Python Generators

When is not a good time to use python generators?

Use a list instead of a generator when:

1) You need to access the data multiple times (i.e. cache the results instead of recomputing them):

for i in outer:           # used once, okay to be a generator or return a list
for j in inner: # used multiple times, reusing a list is better
...

2) You need random access (or any access other than forward sequential order):

for i in reversed(data): ...     # generators aren't reversible

s[i], s[j] = s[j], s[i] # generators aren't indexable

3) You need to join strings (which requires two passes over the data):

s = ''.join(data)                # lists are faster than generators in this use case

4) You are using PyPy which sometimes can't optimize generator code as much as it can with normal function calls and list manipulations.

python generator is too slow to use it. why should I use it? and when?

generators do not store all elements in a memory in one go. They yield one at a time, and this behavior makes them memory efficient. Thus you can use them when memory is a constraint.

In what situations should you actually use generators in python?

Does this defeat the purpose of using a generator as it then creates this in an even list. In this case in what exact cases are generators useful?

This is a bit opinion based, but there are some situations where a list might not do the trick (for example because of hardware limitations).

Saving CPU cycles (time)

Imagine that you have a list of even numbers, and then want to take the sum of the first five numbers. In Python we could do that with an islice, like:

sumfirst5even = sum(islice(even(100), 5))

If we would first generate a list of 100 even numbers (not knowing what we will later do with that list), then we have spent a lot of CPU cycles in the construction of such list, that are wasted.

By using a generator, we can restrict this to only the elements we really need. So we will only yield the first five elements. The algorithm will never calculate elements larger than 10. Yes, here it is doubful that this will have any (significant) impact. It is even possible that the "generator protocol" will require more CPU cycles compared to generating a list, so for small lists, there is no advantage. But now imagine that we used even(100000), then the amount of "useless CPU cycles" we spent on generating an entire list, can be significantly.

Saving memory

Another potential benefit is saving memory, given we do not need all elements of the generator in memory concurrently.

Take for example the following example:

for x in even(1000):
print(x)

If even(..) constructs a list of 1000 elements, then that means that all these numbers need to be objects in memory concurrently. Depending on the Python interpreter, objects can take significant amount(s) of memory. For example an int takes in CPython, 28 bytes of memory. So that means that a list containing 500 such ints can take roughly 14 kB of memory (some extra memory for the list). Yes most Python interpreters maintain a "flyweight" pattern to reduce the burden of small ints (these are shared, and so we do not create a separate object for each int we construct in the process), but still it can easily add up. For an even(1000000), we will need 14 MB of memory.

If we use a generator, than depending on how we use the generator, we might save memory. Why? Because once we no longer need the number 123456 (since the for loop advances to the next item), the space that object "occupied" can be recycled, and given to an int object with value 12348. So it means that - given the way we use the generator permits this - that the memory usage remains constant, whereas for a list it scales linear. Of course the generator itself also needs to do proper management: if in the generator code, we build up a collection, than the memory will of course increase as well.

In 32-bit systems, this can even result in some problems since Python lists have a maximum length. A list can contain at most 536'870'912 elements. Yes that is a huge number, but what if you for example want to generate all permutations of a given list? If we store the permutations in a list, then that means that for a 32-bit system, a list of 13 (or more elements), we will never be able to construct such a list.

"online" programs

In theoretical computer science, an "online algorithm" is by some researchers defined as an algorithm that receives input gradually, and thus does not knows the entire input in advance.

A practical example can be a webcam, that each second makes an image, and sends it to a Python webserver. We do not know at that moment how a picture that will be captured by the webcam within 24 hours will look like. But we might be interested in detecting a burglar that aims to steal something. In that case a list of frames will thus not contain all images. A generator can however construct an elegant "protocol" where we iteratively fetch an image, detect a burglar, and raise an alarm, like:

for frame in from_webcam():
if contains_burglar(frame):
send_alarm_email('Maurice Moss')

Infinite generators

We do not need webcams or other hardware to exploit the elegance of generators. Generators can yield an "infinite" sequence. Or even generator could for example look like:

def even():
i = 0
while True:
yield i
i += 2

This is a generator that will eventually generate all even numbers. If we keep iterating over it, eventually we will yield the number 123'456'789'012'345'678 (although it might take a very long time).

The above can be useful if we want to implement a program that for example keeps yielding even numbers that are palindromes. This could look like:

for i in even():
if is_palindrome(i):
print(i)

We thus can assume that this program will keep working, and do not need to "update" the list of even numbers. In some pure functional languages that make lazy programming transparent, programs are written as if you create a list, but in fact it is typically a generator in place.

"enriched" generators: range(..) and friends

In Python a lot of classes do not construct lists when you iterate over them, for example a range(1000) object does not first construct a list (it does in python-2.x, but not in python-3.x). The range(..) object simply represents a range. A range(..) object is not a generator, but it is a class that can generate an iterator object, that works like a generator.

Besides iterating, we can do all kinds of things with a range(..) object, that is possible with lists, but not in an efficient way.

If we for example want to know whether 1000000000 is an element of range(400, 10000000000, 2), then we can write 1000000000 in range(400, 10000000000, 2). Now there is an algorithm in place that will check this without generating the range, or constructing a list: it sees if the elements is an int, is in the range of the range(..) object (so greater than or equal to 400, and less than 10000000000), and whether it is yielded (taking the step into account), this does not require iterating over it. As a result the membership check can be done instantly.

If we had generated a list, this would mean that Python had to enumerate over every element until it finally can find that element (or reaches the end of the list). For numbers like 1000000000, this can easily take minutes, hours, maybe days.

We can also "slice" the range object, which yield another range(..) object, for example:

>>> range(123, 456, 7)[1::4]
range(130, 459, 28)

with an algorithm we can thus instantly slice the range(..) object into a new range object. Slicing a list takes linear time. This can again (for huge lists) take significant time and memory.

Why does a generator function not use the idle time to prepare the next yield?

Generators were designed as a simpler, shorter, easier-to-understand syntax for writing iterators. That was their use case. People who want to make iterators shorter and easier to understand do not want to introduce the headaches of thread synchronization into every iterator they write. That would be the opposite of the design goal.

As such, generators are based around the concept of coroutines and cooperative multitasking, not threads. The design tradeoffs are different; generators sacrifice parallel execution in exchange for semantics that are much easier to reason about.

Also, using separate threads for every generator would be really inefficient, and figuring out when to parallelize is a hard problem. Most generators aren't actually worth executing in another thread. Heck, they wouldn't be worth executing in another thread even in GIL-less implementations of Python, like Jython or Grumpy.

If you want something that runs in parallel, that's already handled by starting a thread or process and communicating with it through queues.

Is it better to use generator or list when I need to go back to the start often?

There are two choices : keeping memory low or runtime low .

low Memory :
Since you wouldn't modify any element in the generator or accessing the elements by indexes,it's better to use the generator as they return only one object in the memory .But you have to run the generator function every i in the loop .

Low runtime :
Using a list generated only one time before for loop .

def sieve_of_eratosthenes(limit):
# Initialize the primality list
a = [False] * 2 + [True] * (limit-2)

for (i, isprime) in enumerate(a):
if isprime:
yield i
# Mark factors non-prime
for n in range(i*i, limit, i):
a[n] = False

limits = list(sieve_of_eratosthenes(limit))
for n in numbers_list:
s = 0
for p in limits:
if not x % p:
s += p

What can you use generator functions for?

Generators give you lazy evaluation. You use them by iterating over them, either explicitly with 'for' or implicitly by passing it to any function or construct that iterates. You can think of generators as returning multiple items, as if they return a list, but instead of returning them all at once they return them one-by-one, and the generator function is paused until the next item is requested.

Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time. Or for situations where the generator uses another generator, or consumes some other resource, and it's more convenient if that happened as late as possible.

Another use for generators (that is really the same) is to replace callbacks with iteration. In some situations you want a function to do a lot of work and occasionally report back to the caller. Traditionally you'd use a callback function for this. You pass this callback to the work-function and it would periodically call this callback. The generator approach is that the work-function (now a generator) knows nothing about the callback, and merely yields whenever it wants to report something. The caller, instead of writing a separate callback and passing that to the work-function, does all the reporting work in a little 'for' loop around the generator.

For example, say you wrote a 'filesystem search' program. You could perform the search in its entirety, collect the results and then display them one at a time. All of the results would have to be collected before you showed the first, and all of the results would be in memory at the same time. Or you could display the results while you find them, which would be more memory efficient and much friendlier towards the user. The latter could be done by passing the result-printing function to the filesystem-search function, or it could be done by just making the search function a generator and iterating over the result.

If you want to see an example of the latter two approaches, see os.path.walk() (the old filesystem-walking function with callback) and os.walk() (the new filesystem-walking generator.) Of course, if you really wanted to collect all results in a list, the generator approach is trivial to convert to the big-list approach:

big_list = list(the_generator)

What are the upsides of generators in python 3?

The biggest benefit of a generator is that it doesn't need to reserve memory for every element of a sequence, it generates each item as needed.

Because of this, a generator doesn't need to have a defined size. It can generate an infinite sequence if needed.

What should I use between generator and function with return for resume parsing in python where I need to process lots of resume at a time?

First things first I highly recommend to write Clean Code, that's mean when you writing Python don't write C#/Java (a.k.a PEP8)

Another issue is: try to be pythonic (sometimes it's even make your code faster),
for example instead of your getResumeList() in the generator example, try generator expression:

def get_resume_list(dir_path):
files = os.listdir(dir_path)
return (f for f in files if f.endswith(".pdf"))

Or list comprehension, in the second example:

def get_resume_list(dir_path):
files = os.listdir(dir_path)
return [f for f in files if f.endswith(".pdf")]

When you are opening a file try to use with, because people tend to forget closing files.

About the efficiency it is clear that generators are created for that.
With generator you can see each result as soon as it's ready, and not waiting for the whole code to finish processing.

About the performance, I don't knon how many pdf files you are trying to parse,
but I did a little test on 1056 pdf files, and the iterator was faster by couple of seconds (usually that's the case in measure of speed).
generator are there for efficiency, look at this answer of Raymond Hettinger (Python core developer) explaining when not to use generators.

For conclusion: in your case it is more efficient to use generator and faster with iterator.

Different behavior of consumed Python generators depending on implementation

This is because you are not calling the same generator, if you assign sc.classgen to a variable, it will behave like you expect.

class SimpleClass(object):
@property
def classgen(self):
for i in range(3):
yield i

mygen = (p for p in range(3))

##### Test behavior
sc = SimpleClass()
print(type(sc.classgen))
print(type(mygen))
print("")

g = sc.classgen

print("Iterating over new sc.classgen")
for i in g:
print(i)
print("")

print("Iterating over consumed sc.classgen")
for i in g:
print(i)
print("")

print("Iterating over new mygen")
for i in mygen:
print(i)
print("")

print("Iterating over consumed mygen")
for i in mygen:
print(i)

As classgen property is a function, it will create a new one every time you access it.



Related Topics



Leave a reply



Submit