Generator Expressions Vs. List Comprehensions

Generator expressions vs. list comprehensions

John's answer is good (that list comprehensions are better when you want to iterate over something multiple times). However, it's also worth noting that you should use a list if you want to use any of the list methods. For example, the following code won't work:

def gen():
return (something for something in get_some_stuff())

print gen()[:2] # generators don't support indexing or slicing
print [5,6] + gen() # generators can't be added to lists

Basically, use a generator expression if all you're doing is iterating once. If you want to store and use the generated results, then you're probably better off with a list comprehension.

Since performance is the most common reason to choose one over the other, my advice is to not worry about it and just pick one; if you find that your program is running too slowly, then and only then should you go back and worry about tuning your code.

Generator expression vs list comprehension for adding values to a set

Generator expressions are lazy, if you don't actually iterate over them, they do nothing (aside from compute the value of the iterator for the outermost loop, e.g. in this case, doing work equivalent to iter(mylist) and storing the result for when the genexpr is actually iterated). To make it work, you'd have to run out the generator, e.g. using the consume itertools recipe:

consume(my_set.add(num) for num in mylist)

# Unoptimized equivalent:
for _ in (my_set.add(num) for num in mylist):
pass

In any event, this is an insane thing to do; comprehensions and generator expressions are functional programming tools, and should not have side-effects, let alone be written solely for the purpose of producing side-effects. Code maintainers (reasonably) expect that comprehensions will trigger no "spooky action at a distance"; don't violate that expectation. Just use a set comprehension:

myset = {num for num in mylist}

or since the comprehension does nothing in this case, the set constructor:

myset = set(mylist)  # Or with modern unpacking generalizations, myset = {*mylist}

List comprehension vs generator expression's weird timeit results?

Expanding on Paulo's answer, generator expressions are often slower than list comprehensions because of the overhead of function calls. In this case, the short-circuiting behavior of in offsets that slowness if the item is found fairly early, but otherwise, the pattern holds.

I ran a simple script through the profiler for a more detailed analysis. Here's the script:

lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],
[7,8,9],[10,11,12],[13,14,15],[16,17,18]]

def ge_d():
return 'd' in (y for x in lis for y in x)
def lc_d():
return 'd' in [y for x in lis for y in x]

def ge_11():
return 11 in (y for x in lis for y in x)
def lc_11():
return 11 in [y for x in lis for y in x]

def ge_18():
return 18 in (y for x in lis for y in x)
def lc_18():
return 18 in [y for x in lis for y in x]

for i in xrange(100000):
ge_d()
lc_d()
ge_11()
lc_11()
ge_18()
lc_18()

Here are the relevant results, reordered to make the patterns clearer.

         5400002 function calls in 2.830 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.158 0.000 0.251 0.000 fop.py:3(ge_d)
500000 0.092 0.000 0.092 0.000 fop.py:4(<genexpr>)
100000 0.285 0.000 0.285 0.000 fop.py:5(lc_d)

100000 0.356 0.000 0.634 0.000 fop.py:8(ge_11)
1800000 0.278 0.000 0.278 0.000 fop.py:9(<genexpr>)
100000 0.333 0.000 0.333 0.000 fop.py:10(lc_11)

100000 0.435 0.000 0.806 0.000 fop.py:13(ge_18)
2500000 0.371 0.000 0.371 0.000 fop.py:14(<genexpr>)
100000 0.344 0.000 0.344 0.000 fop.py:15(lc_18)

Creating a generator expression is equivalent to creating a generator function and calling it. That accounts for one call to <genexpr>. Then, in the first case, next is called 4 times, until d is reached, for a total of 5 calls (times 100000 iterations = ncalls = 500000). In the second case, it is called 17 times, for a total of 18 calls; and in the third, 24 times, for a total of 25 calls.

The genex outperforms the list comprehension in the first case, but the extra calls to next account for most of the difference between the speed of the list comprehension and the speed of the generator expression in the second and third cases.

>>> .634 - .278 - .333
0.023
>>> .806 - .371 - .344
0.091

I'm not sure what accounts for the remaining time; it seems that generator expressions would be a hair slower even without the additional function calls. I suppose this confirms inspectorG4dget's assertion that "creating a generator comprehension has more native overhead than does a list comprehension." But in any case, this shows pretty clearly that generator expressions are slower mostly because of calls to next.

I'll add that when short-circuiting doesn't help, list comprehensions are still faster, even for very large lists. For example:

>>> counter = itertools.count()
>>> lol = [[counter.next(), counter.next(), counter.next()]
for _ in range(1000000)]
>>> 2999999 in (i for sublist in lol for i in sublist)
True
>>> 3000000 in (i for sublist in lol for i in sublist)
False
>>> %timeit 2999999 in [i for sublist in lol for i in sublist]
1 loops, best of 3: 312 ms per loop
>>> %timeit 2999999 in (i for sublist in lol for i in sublist)
1 loops, best of 3: 351 ms per loop
>>> %timeit any([2999999 in sublist for sublist in lol])
10 loops, best of 3: 161 ms per loop
>>> %timeit any(2999999 in sublist for sublist in lol)
10 loops, best of 3: 163 ms per loop
>>> %timeit for i in [2999999 in sublist for sublist in lol]: pass
1 loops, best of 3: 171 ms per loop
>>> %timeit for i in (2999999 in sublist for sublist in lol): pass
1 loops, best of 3: 183 ms per loop

As you can see, when short circuiting is irrelevant, list comprehensions are consistently faster even for a million-item-long list of lists. Obviously for actual uses of in at these scales, generators will be faster because of short-circuiting. But for other kinds of iterative tasks that are truly linear in the number of items, list comprehensions are pretty much always faster. This is especially true if you need to perform multiple tests on a list; you can iterate over an already-built list comprehension very quickly:

>>> incache = [2999999 in sublist for sublist in lol]
>>> get_list = lambda: incache
>>> get_gen = lambda: (2999999 in sublist for sublist in lol)
>>> %timeit for i in get_list(): pass
100 loops, best of 3: 18.6 ms per loop
>>> %timeit for i in get_gen(): pass
1 loops, best of 3: 187 ms per loop

In this case, the list comprehension is an order of magnitude faster!

Of course, this only remains true until you run out of memory. Which brings me to my final point. There are two main reasons to use a generator: to take advantage of short circuiting, and to save memory. For very large seqences/iterables, generators are the obvious way to go, because they save memory. But if short-circuiting is not an option, you pretty much never choose generators over lists for speed. You chose them to save memory, and it's always a trade-off.

Why this list comprehension is faster than equivalent generator expression?

I believe the difference here is entirely in the cost of 1000000 additions. Testing with 64-bit Python.org 3.3.0 on Mac OS X:

In [698]: %timeit len ([None for n in range (1, 1000000) if n%3 == 1])
10 loops, best of 3: 127 ms per loop
In [699]: %timeit sum (1 for n in range (1, 1000000) if n%3 == 1)
10 loops, best of 3: 138 ms per loop
In [700]: %timeit sum ([1 for n in range (1, 1000000) if n%3 == 1])
10 loops, best of 3: 139 ms per loop

So, it's not that the comprehension is faster than the genexp; they both take about the same time. But calling len on a list is instant, while summing 1M numbers adds another 7% to the total time.

Throwing a few different numbers at it, this seems to hold up unless the list is very tiny (in which case it does seem to get faster), or large enough that memory allocation starts to become a significant factor (which it isn't yet, at 333K).

List comprehension works but not generator expression

Every generator is true, so you always call max(values) even if the generator is "empty". The right way to do it is to tell max what to do then:

return max(values, default=-1)

Unexpected results when comparing list comprehension with generator expression

Generators aren't evaluated until you call next() on them which is what makes them useful, while list comprehensions are evaluated immediately.

So lc = [4,5] before extend and is therefore done.

lg is still the same value at the start so the extend still applies to the a which hasn't finished being evaluated within the generator, meaning that a gets extended before you start printing it which is why it will print out longer with the rest of the numbers as well.

Check it out like this:

>>> a = [2, 3, 4, 5]
>>> lg = ( x for x in a if x >= 4 )
>>> next(lg)
4
>>> next(lg)
5
>>> a.extend([6,7,8,9])
>>> next(lg)
6

However, if you were to try calling an extra next() before extend you'll get StopIteration because the generator is exhausted at that point and then you won't be able to call it any longer.

>>> a = [2, 3, 4, 5]
>>> lg = ( x for x in a if x >= 4 )
>>> next(lg)
4
>>> next(lg)
5
>>> next(lg)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>> a.extend([6,7,8,9])
>>> next(lg)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Why is summing list comprehension faster than generator expression?

I took a look at the disassembly of each construct (using dis). I did this by declaring these two functions:

def list_comprehension():
return sum([ch in A for ch in B])

def generation_expression():
return sum(ch in A for ch in B)

and then calling dis.dis with each function.

For the list comprehension:

 0 BUILD_LIST               0
2 LOAD_FAST 0 (.0)
4 FOR_ITER 12 (to 18)
6 STORE_FAST 1 (ch)
8 LOAD_FAST 1 (ch)
10 LOAD_GLOBAL 0 (A)
12 COMPARE_OP 6 (in)
14 LIST_APPEND 2
16 JUMP_ABSOLUTE 4
18 RETURN_VALUE

and for the generator expression:

 0 LOAD_FAST                0 (.0)
2 FOR_ITER 14 (to 18)
4 STORE_FAST 1 (ch)
6 LOAD_FAST 1 (ch)
8 LOAD_GLOBAL 0 (A)
10 COMPARE_OP 6 (in)
12 YIELD_VALUE
14 POP_TOP
16 JUMP_ABSOLUTE 2
18 LOAD_CONST 0 (None)
20 RETURN_VALUE

The disassembly for the actual summation is:

 0 LOAD_GLOBAL              0 (sum)
2 LOAD_CONST 1 (<code object <genexpr> at 0x7f49dc395240, file "/home/mishac/dev/python/kintsugi/KintsugiModels/automated_tests/a.py", line 12>)
4 LOAD_CONST 2 ('generation_expression.<locals>.<genexpr>')
6 MAKE_FUNCTION 0
8 LOAD_GLOBAL 1 (B)
10 GET_ITER
12 CALL_FUNCTION 1
14 CALL_FUNCTION 1
16 RETURN_VALUE

but this sum disassembly was constant between both your examples, with the only difference being the loading of generation_expression.<locals>.<genexpr> vs list_comprehension.<locals>.<listcomp> (so just loading a different local variable).

The differing bytecode instructions between the first two disassemblies are LIST_APPEND for the list comprehension vs. the conjunction of YIELD_VALUE and POP_TOP for the generator expression.

I won't pretend I know the intrinsics of Python bytecode, but what I gather from this is that the generator expression is implemented as a queue, where the value is generated and then popped. This popping doesn't have to happen in a list comprehension, leading me to believe there'll be a slight amount of overhead in using generators.

Now this doesn't mean that generators are always going to be slower. Generators excel at being memory-efficient, so there will be a threshold N such that list comprehensions will perform slightly better before this threshold (because memory use won't be a problem), but after this threshold, generators will significantly perform better.

How exactly does a generator comprehension work?

Do you understand list comprehensions? If so, a generator expression is like a list comprehension, but instead of finding all the items you're interested and packing them into list, it waits, and yields each item out of the expression, one by one.

>>> my_list = [1, 3, 5, 9, 2, 6]
>>> filtered_list = [item for item in my_list if item > 3]
>>> print(filtered_list)
[5, 9, 6]
>>> len(filtered_list)
3
>>> # compare to generator expression
...
>>> filtered_gen = (item for item in my_list if item > 3)
>>> print(filtered_gen) # notice it's a generator object
<generator object <genexpr> at 0x7f2ad75f89e0>
>>> len(filtered_gen) # So technically, it has no length
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()
>>> # We extract each item out individually. We'll do it manually first.
...
>>> next(filtered_gen)
5
>>> next(filtered_gen)
9
>>> next(filtered_gen)
6
>>> next(filtered_gen) # Should be all out of items and give an error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>> # Yup, the generator is spent. No values for you!
...
>>> # Let's prove it gives the same results as our list comprehension
...
>>> filtered_gen = (item for item in my_list if item > 3)
>>> gen_to_list = list(filtered_gen)
>>> print(gen_to_list)
[5, 9, 6]
>>> filtered_list == gen_to_list
True
>>>

Because a generator expression only has to yield one item at a time, it can lead to big savings in memory usage. Generator expressions make the most sense in scenarios where you need to take one item at a time, do a lot of calculations based on that item, and then move on to the next item. If you need more than one value, you can also use a generator expression and grab a few at a time. If you need all the values before your program proceeds, use a list comprehension instead.



Related Topics



Leave a reply



Submit