Length of Generator Output

Length of generator output

There isn't one because you can't do it in the general case - what if you have a lazy infinite generator? For example:

def fib():
a, b = 0, 1
while True:
a, b = b, a + b
yield a

This never terminates but will generate the Fibonacci numbers. You can get as many Fibonacci numbers as you want by calling next().

If you really need to know the number of items there are, then you can't iterate through them linearly one time anyway, so just use a different data structure such as a regular list.

How to len(generator())

Generators have no length, they aren't collections after all.

Generators are functions with a internal state (and fancy syntax). You can repeatedly call them to get a sequence of values, so you can use them in loop. But they don't contain any elements, so asking for the length of a generator is like asking for the length of a function.

if functions in Python are objects, couldn't I assign the length to a
variable of this object that would be accessible to the new generator?

Functions are objects, but you cannot assign new attributes to them. The reason is probably to keep such a basic object as efficient as possible.

You can however simply return (generator, length) pairs from your functions or wrap the generator in a simple object like this:

class GeneratorLen(object):
def __init__(self, gen, length):
self.gen = gen
self.length = length

def __len__(self):
return self.length

def __iter__(self):
return self.gen

g = some_generator()
h = GeneratorLen(g, 1)
print len(h), list(h)

What's the shortest way to count the number of items in a generator/iterator?

Calls to itertools.imap() in Python 2 or map() in Python 3 can be replaced by equivalent generator expressions:

sum(1 for dummy in it)

This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.

size of generator object in python

The space a generator takes in memory is just bookkeeping info. In it a reference to the frame object is kept (administration for the running Python code, such as locals), wether or not it is running right now, and a reference to the code object are kept. Nothing more:

>>> x=(i for i in range(1,11))
>>> dir(x)
['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'gi_code', 'gi_frame', 'gi_running', 'next', 'send', 'throw']
>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> x.gi_running
0
>>> x.gi_code
<code object <genexpr> at 0x1051af5b0, file "<stdin>", line 1>

That's just 3 references, plus the usual Python object type info (think reference counting) and a weak-references list; so that's about 4 pointers, an integer and a struct, which on your system take 40 bytes (on my system, 64-bit OS X, it is 80 bytes). sys.getsizeof() reports on the size of just that structure as implemented in C, and it doesn't recurse over pointers.

As such, that amount of memory won't change when you have run through the generator. The referenced frame may change in how much memory is used (if the generator expression references large objects towards one end or the other) but you won't see that with the result of sys.getsizeof() on the generator object; look at the frame locals instead:

>>> next(x)
1
>>> x.gi_frame.f_locals
{'i': 1, '.0': <listiterator object at 0x105339dd0>}

The .0 object is the range() iterator that the generator is using in the for loop, i is the for loop target. The listiterator is another iterable object that has a private reference to the list range() produced as well as a position counter so it can yield the next element each time you ask it to.

You cannot query for an element size of a generator; they produce elements as needed anyway, you cannot a-priori 'know' how much they'll produce, nor know how much they have produced after running. sys.getsizeof() certainly won't tell you; it is a tool to measure memory footprint anyway, and you'd have to recursively measure all referenced objects if you want to know the total footprint.

You can see that the generator has completed its run from the frame; it is cleared once it is done:

>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> list(x)
[2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> x.gi_frame is None
True

So in the end, the memory used for the generator resides in structures in the frame (locals, and possibly globals, with each object in those namespaces possibly referencing other objects again), and when the generator is done the frame is cleared and the generator .gi_frame pointer is altered to point to the None singleton, leaving the frame to be cleared if the reference count has dropped to 0.

All this only applies to generators, not to iterables in general; generators are Python code and thus can be introspected this deeply.

How to measure length of generator sequence (list comp vs generator expression)

On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000

$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop

How to measure length of generator sequence (list comp vs generator expression)

On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000

$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop

Iterables / generators with specified length

It sounds like you're asking about something like __length_hint__. Excerpts from PEP 424 – A method for exposing a length hint:

CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.

For example, range iterators support this (Try it online!):

it = iter(range(1000))
print(it.__length_hint__()) # prints 1000
next(it)
print(it.__length_hint__()) # prints 999

And list iterators even take list length changes into account (Try it online!):

a = [None] * 10
it = iter(a)
print(it.__length_hint__()) # prints 10
next(it)
print(it.__length_hint__()) # prints 9
a.pop()
print(it.__length_hint__()) # prints 8
a.append(None)
print(it.__length_hint__()) # prints 9

Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...

  • Produces 10,000 elements.
  • Hints at having 5,000 elements.
  • After every 1,000 elements it shows the memory size of the list being built.
import gc

beacon = object()

class MyIterator:
def __init__(self):
self.n = 10_000
def __iter__(self):
return self
def __length_hint__(self):
print('__length_hint__ called')
return 5_000
def __next__(self):
if self.n == 0:
raise StopIteration
self.n -= 1
if self.n % 1_000 == 0:
for obj in gc.get_objects():
if isinstance(obj, list) and obj and obj[0] is beacon:
print(obj.__sizeof__())
return beacon

list(MyIterator())

Output (Try it online!):

__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560

We see that list asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.

If my __length_hint__ instead accurately returns 10,000, then list instead pre-allocates 90088 bytes and that remains until the end.



Related Topics



Leave a reply



Submit