List VS Generator Comprehension Speed with Join Function

List vs generator comprehension speed with join function

The str.join method converts its iterable parameter to a list if it's not a list or tuple already. This lets the joining logic iterate over the items multiple times (it makes one pass to calculate the size of the result string, then a second pass to actually copy the data).

You can see this in the CPython source code:

PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{
    /* lots of variable declarations at the start of the function omitted */

    fseq = PySequence_Fast(seq, "can only join an iterable");

    /* ... */
}

The PySequence_Fast function in the C API does just what I described. It converts an arbitrary iterable into a list (essentially by calling list on it), unless it already is a list or tuple.

The conversion of the generator expression to a list means that the usual benefits of generators (a smaller memory footprint and the potential for short-circuiting) don't apply to str.join, and so the (small) additional overhead that the generator has makes its performance worse.

List comprehension vs generator expression's weird timeit results?

Expanding on Paulo's answer, generator expressions are often slower than list comprehensions because of the overhead of function calls. In this case, the short-circuiting behavior of in offsets that slowness if the item is found fairly early, but otherwise, the pattern holds.

I ran a simple script through the profiler for a more detailed analysis. Here's the script:

lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],
     [7,8,9],[10,11,12],[13,14,15],[16,17,18]]

def ge_d():
    return 'd' in (y for x in lis for y in x)
def lc_d():
    return 'd' in [y for x in lis for y in x]

def ge_11():
    return 11 in (y for x in lis for y in x)
def lc_11():
    return 11 in [y for x in lis for y in x]

def ge_18():
    return 18 in (y for x in lis for y in x)
def lc_18():
    return 18 in [y for x in lis for y in x]

for i in xrange(100000):
    ge_d()
    lc_d()
    ge_11()
    lc_11()
    ge_18()
    lc_18()

Here are the relevant results, reordered to make the patterns clearer.

         5400002 function calls in 2.830 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100000    0.158    0.000    0.251    0.000 fop.py:3(ge_d)
   500000    0.092    0.000    0.092    0.000 fop.py:4(<genexpr>)
   100000    0.285    0.000    0.285    0.000 fop.py:5(lc_d)

   100000    0.356    0.000    0.634    0.000 fop.py:8(ge_11)
  1800000    0.278    0.000    0.278    0.000 fop.py:9(<genexpr>)
   100000    0.333    0.000    0.333    0.000 fop.py:10(lc_11)

   100000    0.435    0.000    0.806    0.000 fop.py:13(ge_18)
  2500000    0.371    0.000    0.371    0.000 fop.py:14(<genexpr>)
   100000    0.344    0.000    0.344    0.000 fop.py:15(lc_18)

Creating a generator expression is equivalent to creating a generator function and calling it. That accounts for one call to <genexpr>. Then, in the first case, next is called 4 times, until d is reached, for a total of 5 calls (times 100000 iterations = ncalls = 500000). In the second case, it is called 17 times, for a total of 18 calls; and in the third, 24 times, for a total of 25 calls.

The genex outperforms the list comprehension in the first case, but the extra calls to next account for most of the difference between the speed of the list comprehension and the speed of the generator expression in the second and third cases.

>>> .634 - .278 - .333
0.023
>>> .806 - .371 - .344
0.091

I'm not sure what accounts for the remaining time; it seems that generator expressions would be a hair slower even without the additional function calls. I suppose this confirms inspectorG4dget's assertion that "creating a generator comprehension has more native overhead than does a list comprehension." But in any case, this shows pretty clearly that generator expressions are slower mostly because of calls to next.

I'll add that when short-circuiting doesn't help, list comprehensions are still faster, even for very large lists. For example:

>>> counter = itertools.count()
>>> lol = [[counter.next(), counter.next(), counter.next()] 
           for _ in range(1000000)]
>>> 2999999 in (i for sublist in lol for i in sublist)
True
>>> 3000000 in (i for sublist in lol for i in sublist)
False
>>> %timeit 2999999 in [i for sublist in lol for i in sublist]
1 loops, best of 3: 312 ms per loop
>>> %timeit 2999999 in (i for sublist in lol for i in sublist)
1 loops, best of 3: 351 ms per loop
>>> %timeit any([2999999 in sublist for sublist in lol])
10 loops, best of 3: 161 ms per loop
>>> %timeit any(2999999 in sublist for sublist in lol)
10 loops, best of 3: 163 ms per loop
>>> %timeit for i in [2999999 in sublist for sublist in lol]: pass
1 loops, best of 3: 171 ms per loop
>>> %timeit for i in (2999999 in sublist for sublist in lol): pass
1 loops, best of 3: 183 ms per loop

As you can see, when short circuiting is irrelevant, list comprehensions are consistently faster even for a million-item-long list of lists. Obviously for actual uses of in at these scales, generators will be faster because of short-circuiting. But for other kinds of iterative tasks that are truly linear in the number of items, list comprehensions are pretty much always faster. This is especially true if you need to perform multiple tests on a list; you can iterate over an already-built list comprehension very quickly:

>>> incache = [2999999 in sublist for sublist in lol]
>>> get_list = lambda: incache
>>> get_gen = lambda: (2999999 in sublist for sublist in lol)
>>> %timeit for i in get_list(): pass
100 loops, best of 3: 18.6 ms per loop
>>> %timeit for i in get_gen(): pass
1 loops, best of 3: 187 ms per loop

In this case, the list comprehension is an order of magnitude faster!

Of course, this only remains true until you run out of memory. Which brings me to my final point. There are two main reasons to use a generator: to take advantage of short circuiting, and to save memory. For very large seqences/iterables, generators are the obvious way to go, because they save memory. But if short-circuiting is not an option, you pretty much never choose generators over lists for speed. You chose them to save memory, and it's always a trade-off.

Joining strings. Generator or list comprehension?

When you call str.join(gen) where gen is a generator, Python does the equivalent of list(gen) before going on to examine the length of the resulting sequence.

Specifically, if you look at the code implementing str.join in CPython, you'll see this call:

    fseq = PySequence_Fast(seq, "can only join an iterable");

The call to PySequence_Fast converts the seq argument into a list if it wasn't a list or tuple already.

So, the two versions of your call are handled almost identically. In the list comprehension, you're building the list yourself and passing it into join. In the generator expression version, the generator object you pass in gets turned into a list right at the start of join, and the rest of the code operates the same for both versions..

Why is summing list comprehension faster than generator expression?

I took a look at the disassembly of each construct (using dis). I did this by declaring these two functions:

def list_comprehension():
    return sum([ch in A for ch in B])

def generation_expression():
    return sum(ch in A for ch in B)

and then calling dis.dis with each function.

For the list comprehension:

 0 BUILD_LIST               0
 2 LOAD_FAST                0 (.0)
 4 FOR_ITER                12 (to 18)
 6 STORE_FAST               1 (ch)
 8 LOAD_FAST                1 (ch)
10 LOAD_GLOBAL              0 (A)
12 COMPARE_OP               6 (in)
14 LIST_APPEND              2
16 JUMP_ABSOLUTE            4
18 RETURN_VALUE

and for the generator expression:

 0 LOAD_FAST                0 (.0)
 2 FOR_ITER                14 (to 18)
 4 STORE_FAST               1 (ch)
 6 LOAD_FAST                1 (ch)
 8 LOAD_GLOBAL              0 (A)
10 COMPARE_OP               6 (in)
12 YIELD_VALUE
14 POP_TOP
16 JUMP_ABSOLUTE            2
18 LOAD_CONST               0 (None)
20 RETURN_VALUE

The disassembly for the actual summation is:

 0 LOAD_GLOBAL              0 (sum)
 2 LOAD_CONST               1 (<code object <genexpr> at 0x7f49dc395240, file "/home/mishac/dev/python/kintsugi/KintsugiModels/automated_tests/a.py", line 12>)
 4 LOAD_CONST               2 ('generation_expression.<locals>.<genexpr>')
 6 MAKE_FUNCTION            0
 8 LOAD_GLOBAL              1 (B)
10 GET_ITER
12 CALL_FUNCTION            1
14 CALL_FUNCTION            1
16 RETURN_VALUE

but this sum disassembly was constant between both your examples, with the only difference being the loading of generation_expression.<locals>.<genexpr> vs list_comprehension.<locals>.<listcomp> (so just loading a different local variable).

The differing bytecode instructions between the first two disassemblies are LIST_APPEND for the list comprehension vs. the conjunction of YIELD_VALUE and POP_TOP for the generator expression.

I won't pretend I know the intrinsics of Python bytecode, but what I gather from this is that the generator expression is implemented as a queue, where the value is generated and then popped. This popping doesn't have to happen in a list comprehension, leading me to believe there'll be a slight amount of overhead in using generators.

Now this doesn't mean that generators are always going to be slower. Generators excel at being memory-efficient, so there will be a threshold N such that list comprehensions will perform slightly better before this threshold (because memory use won't be a problem), but after this threshold, generators will significantly perform better.

Why this list comprehension is faster than equivalent generator expression?

I believe the difference here is entirely in the cost of 1000000 additions. Testing with 64-bit Python.org 3.3.0 on Mac OS X:

In [698]: %timeit len ([None for n in range (1, 1000000) if n%3 == 1])
10 loops, best of 3: 127 ms per loop
In [699]: %timeit sum (1 for n in range (1, 1000000) if n%3 == 1)
10 loops, best of 3: 138 ms per loop
In [700]: %timeit sum ([1 for n in range (1, 1000000) if n%3 == 1])
10 loops, best of 3: 139 ms per loop

So, it's not that the comprehension is faster than the genexp; they both take about the same time. But calling len on a list is instant, while summing 1M numbers adds another 7% to the total time.

Throwing a few different numbers at it, this seems to hold up unless the list is very tiny (in which case it does seem to get faster), or large enough that memory allocation starts to become a significant factor (which it isn't yet, at 333K).

Why is updating a list faster when using a list comprehension as opposed to a generator expression?

This answer concerns CPython implementation only. Using a list comprehension is faster, since the generator is first converted into a list anyway. This is done because the length of the sequence should be determined before proceeding to replace data, and a generator can't tell you its length.

For list slice assignment, this operation is handled by the amusingly named list_ass_slice. There is a special-case handling for assigning a list or tuple, here - they can use PySequence_Fast ops.

This is the v3.7.4 implementation of PySequence_Fast, where you can clearly see a type-check for list or tuples:

PyObject *
PySequence_Fast(PyObject *v, const char *m)
{
    PyObject *it;

    if (v == NULL) {
        return null_error();
    }

    if (PyList_CheckExact(v) || PyTuple_CheckExact(v)) {
        Py_INCREF(v);
        return v;
    }

    it = PyObject_GetIter(v);
    if (it == NULL) {
        if (PyErr_ExceptionMatches(PyExc_TypeError))
            PyErr_SetString(PyExc_TypeError, m);
        return NULL;
    }

    v = PySequence_List(it);
    Py_DECREF(it);

    return v;
}

A generator expression will fail this type check and continue to the fallback code, where it is converted into a list object, so that the length can be predetermined.

In the general case, a predetermined length is desirable in order to allow efficient allocation of list storage, and also to provide useful error messages with extended slice assignment:

>>> vals = (x for x in 'abc')
>>> L = [1,2,3]
>>> L[::2] = vals  # attempt assigning 3 values into 2 positions
---------------------------------------------------------------------------
                                          Traceback (most recent call last)
...
ValueError: attempt to assign sequence of size 3 to extended slice of size 2
>>> L  # data unchanged
[1, 2, 3]
>>> list(vals)  # generator was fully consumed
[]

Speed/efficiency comparison for loop vs list comprehension vs other methods

Don't be too quick to write off the humble for loop. If you don't actually need a list, like in this case, a standard for loop can be faster than using a list comprehension. And of course it has less memory overheads.

Here's a program to perform timing tests; it can be easily modified to add more tests.

#!/usr/bin/env python

''' Time various implementations of string diff function

    From http://stackoverflow.com/q/28581218/4014959

    Written by PM 2Ring 2015.02.18
'''

from itertools import imap, izip, starmap
from operator import ne

from timeit import Timer
from random import random, seed

def h_dist0(s1,s2):
    ''' For loop '''
    tot = 0
    for d1, d2 in zip(s1, s2):
        if d1 != d2:
            tot += 1
    return tot

def h_dist1(s1,s2):
    ''' List comprehension '''
    return sum([dgt1 != dgt2 for dgt1, dgt2 in zip(s1, s2)])

def h_dist2(s1,s2):
    ''' Generator expression '''
    return sum(dgt1 != dgt2 for dgt1, dgt2 in zip(s1, s2))

def h_dist3(s1,s2):
    ''' Generator expression with if '''
    return sum(1 for dgt1, dgt2 in zip(s1, s2) if dgt1 != dgt2)

def h_dist3a(s1,s2):
    ''' Generator expression with izip '''
    return sum(1 for dgt1, dgt2 in izip(s1, s2) if dgt1 != dgt2)

def h_dist4(s1,s2):
    ''' imap '''
    return sum(imap(ne, s1, s2))

def h_dist5(s1,s2):
    ''' starmap '''
    return sum(starmap(ne, izip(s1, s2)))

funcs = [
    h_dist0,
    h_dist1,
    h_dist2,
    h_dist3,
    h_dist3a,
    h_dist4,
    h_dist5,
]

# ------------------------------------

def check_full():
    print 'Testing all functions with strings of length', len(s1)
    for func in funcs:
        print '%s:%s\n%d\n' % (func.func_name, func.__doc__, func(s1, s2))

def check():
    print 'Testing all functions with strings of length', len(s1)
    print [func(s1, s2) for func in funcs], '\n'

def time_test(loops=10000, reps=3):
    ''' Print timing stats for all the functions '''
    slen = len(s1)
    print 'Length = %d, Loops = %d, Repetitions = %d' % (slen, loops, reps)

    for func in funcs:
        #Get function name and docstring
        fname = func.func_name
        fdoc = func.__doc__

        print '\n%s:%s' % (fname, fdoc)
        t = Timer('%s(s1, s2)' % fname, 'from __main__ import s1, s2, %s' % fname)
        results = t.repeat(reps, loops)
        results.sort()
        print results
    print '\n' + '- '*30 + '\n'

def make_strings(n, r=0.5):
    print 'r:', r
    s1 = 'a' * n
    s2 = ''.join(['b' if random() < r else 'a' for _ in xrange(n)])
    return s1, s2

# ------------------------------------

seed(37)

s1, s2 = make_strings(100)
#print '%s\n%s\n' % (s1, s2)
check()
time_test(10000)

s1, s2 = make_strings(100, 0.1)
check()
time_test(10000)

s1, s2 = make_strings(100, 0.9)
check()
time_test(10000)

s1, s2 = make_strings(10)
check()
time_test(50000)

s1, s2 = make_strings(1000)
check()
time_test(1000)

The results below are from a 32 bit 2GHz Pentium 4 running Python 2.6.6 on Linux.

output

r: 0.5
Testing all functions with strings of length 100
[45, 45, 45, 45, 45, 45, 45] 

Length = 100, Loops = 10000, Repetitions = 3

h_dist0: For loop 
[0.62271595001220703, 0.63597297668457031, 0.65991997718811035]

h_dist1: List comprehension 
[0.80136799812316895, 1.0849411487579346, 1.1687240600585938]

h_dist2: Generator expression 
[0.81829214096069336, 0.82315492630004883, 0.85774612426757812]

h_dist3: Generator expression with if 
[0.67409086227416992, 0.67418098449707031, 0.68189001083374023]

h_dist3a: Generator expression with izip 
[0.54596519470214844, 0.54696321487426758, 0.54910516738891602]

h_dist4: imap 
[0.4574120044708252, 0.45927596092224121, 0.46362900733947754]

h_dist5: starmap 
[0.38610100746154785, 0.38653087615966797, 0.39858913421630859]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

r: 0.1
Testing all functions with strings of length 100
[13, 13, 13, 13, 13, 13, 13] 

Length = 100, Loops = 10000, Repetitions = 3

h_dist0: For loop 
[0.59487199783325195, 0.61918497085571289, 0.62035894393920898]

h_dist1: List comprehension 
[0.77733206748962402, 0.77883815765380859, 0.78676295280456543]

h_dist2: Generator expression 
[0.8313758373260498, 0.83669614791870117, 0.8419950008392334]

h_dist3: Generator expression with if 
[0.60900688171386719, 0.61443901062011719, 0.6202390193939209]

h_dist3a: Generator expression with izip 
[0.48425912857055664, 0.48703289031982422, 0.49215483665466309]

h_dist4: imap 
[0.45452284812927246, 0.46001195907592773, 0.4652099609375]

h_dist5: starmap 
[0.37329483032226562, 0.37666082382202148, 0.40111804008483887]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

r: 0.9
Testing all functions with strings of length 100
[94, 94, 94, 94, 94, 94, 94] 

Length = 100, Loops = 10000, Repetitions = 3

h_dist0: For loop 
[0.69256496429443359, 0.69339799880981445, 0.70190787315368652]

h_dist1: List comprehension 
[0.80547499656677246, 0.81107187271118164, 0.81337189674377441]

h_dist2: Generator expression 
[0.82524299621582031, 0.82638883590698242, 0.82899308204650879]

h_dist3: Generator expression with if 
[0.80344915390014648, 0.8050081729888916, 0.80581092834472656]

h_dist3a: Generator expression with izip 
[0.63276004791259766, 0.63585305213928223, 0.64699077606201172]

h_dist4: imap 
[0.46122288703918457, 0.46677708625793457, 0.46921491622924805]

h_dist5: starmap 
[0.38288688659667969, 0.38731098175048828, 0.38867902755737305]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

r: 0.5
Testing all functions with strings of length 10
[5, 5, 5, 5, 5, 5, 5] 

Length = 10, Loops = 50000, Repetitions = 3

h_dist0: For loop 
[0.55377697944641113, 0.55385804176330566, 0.56589198112487793]

h_dist1: List comprehension 
[0.69614696502685547, 0.71386599540710449, 0.71778011322021484]

h_dist2: Generator expression 
[0.74240994453430176, 0.77340388298034668, 0.77429509162902832]

h_dist3: Generator expression with if 
[0.66713404655456543, 0.66874384880065918, 0.67353487014770508]

h_dist3a: Generator expression with izip 
[0.59427285194396973, 0.59525203704833984, 0.60147690773010254]

h_dist4: imap 
[0.46971893310546875, 0.4749150276184082, 0.4831998348236084]

h_dist5: starmap 
[0.46615099906921387, 0.47054886817932129, 0.47225403785705566]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

r: 0.5
Testing all functions with strings of length 1000
[506, 506, 506, 506, 506, 506, 506] 

Length = 1000, Loops = 1000, Repetitions = 3

h_dist0: For loop 
[0.59869503974914551, 0.60042905807495117, 0.60753512382507324]

h_dist1: List comprehension 
[0.68359518051147461, 0.70072579383850098, 0.7146599292755127]

h_dist2: Generator expression 
[0.7492527961730957, 0.75325894355773926, 0.75805497169494629]

h_dist3: Generator expression with if 
[0.59286904335021973, 0.59505105018615723, 0.59793591499328613]

h_dist3a: Generator expression with izip 
[0.49536395072937012, 0.49821090698242188, 0.54327893257141113]

h_dist4: imap 
[0.42384982109069824, 0.43060398101806641, 0.43535709381103516]

h_dist5: starmap 
[0.34122705459594727, 0.35040402412414551, 0.35851287841796875]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Python list comprehension vs generator

In the simple case, it will be fastest to do this without a comprehension/generator:

sum(xrange(9999999))

Normally, if I need to do some sort of operation where I need to choose between a comprehension and generator expression, I do:

sum(a*b for a, b in zip(c, d))

Personally, I think that the generator expression (without the extra parenthesis¹) looks nicer and since readability counts -- This outweighs any micro performance differences between the two expressions.

Generators will frequently be faster for things like this because they avoid creating an intermediate list (and the memory allocation associated with it). The timing difference is probably more pronounced as the list gets bigger as the memory allocation and list resizing take more time for bigger lists. This isn't always the case however (It is well documented on StackOverflow that str.join works faster with lists than with generators in CPython because when str.join gets a generator, it constructs the list anyway...).

^{¹You can omit the parenthesis any time you are passing a generator expression to a function as the only argument -- Which happens more frequently than you might expect...}

List VS Generator Comprehension Speed with Join Function