How to Len(Generator())

How to len(generator())

Generators have no length, they aren't collections after all.

Generators are functions with a internal state (and fancy syntax). You can repeatedly call them to get a sequence of values, so you can use them in loop. But they don't contain any elements, so asking for the length of a generator is like asking for the length of a function.

if functions in Python are objects, couldn't I assign the length to a
variable of this object that would be accessible to the new generator?

Functions are objects, but you cannot assign new attributes to them. The reason is probably to keep such a basic object as efficient as possible.

You can however simply return (generator, length) pairs from your functions or wrap the generator in a simple object like this:

class GeneratorLen(object):
def __init__(self, gen, length):
self.gen = gen
self.length = length

def __len__(self):
return self.length

def __iter__(self):
return self.gen

g = some_generator()
h = GeneratorLen(g, 1)
print len(h), list(h)

Length of generator output

There isn't one because you can't do it in the general case - what if you have a lazy infinite generator? For example:

def fib():
a, b = 0, 1
while True:
a, b = b, a + b
yield a

This never terminates but will generate the Fibonacci numbers. You can get as many Fibonacci numbers as you want by calling next().

If you really need to know the number of items there are, then you can't iterate through them linearly one time anyway, so just use a different data structure such as a regular list.

What's the shortest way to count the number of items in a generator/iterator?

Calls to itertools.imap() in Python 2 or map() in Python 3 can be replaced by equivalent generator expressions:

sum(1 for dummy in it)

This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.

Iterables / generators with specified length

It sounds like you're asking about something like __length_hint__. Excerpts from PEP 424 – A method for exposing a length hint:

CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.

For example, range iterators support this (Try it online!):

it = iter(range(1000))
print(it.__length_hint__()) # prints 1000
next(it)
print(it.__length_hint__()) # prints 999

And list iterators even take list length changes into account (Try it online!):

a = [None] * 10
it = iter(a)
print(it.__length_hint__()) # prints 10
next(it)
print(it.__length_hint__()) # prints 9
a.pop()
print(it.__length_hint__()) # prints 8
a.append(None)
print(it.__length_hint__()) # prints 9

Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...

  • Produces 10,000 elements.
  • Hints at having 5,000 elements.
  • After every 1,000 elements it shows the memory size of the list being built.
import gc

beacon = object()

class MyIterator:
def __init__(self):
self.n = 10_000
def __iter__(self):
return self
def __length_hint__(self):
print('__length_hint__ called')
return 5_000
def __next__(self):
if self.n == 0:
raise StopIteration
self.n -= 1
if self.n % 1_000 == 0:
for obj in gc.get_objects():
if isinstance(obj, list) and obj and obj[0] is beacon:
print(obj.__sizeof__())
return beacon

list(MyIterator())

Output (Try it online!):

__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560

We see that list asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.

If my __length_hint__ instead accurately returns 10,000, then list instead pre-allocates 90088 bytes and that remains until the end.

Resetting generator object in Python

Another option is to use the itertools.tee() function to create a second version of your generator:

import itertools
y = FunctionWithYield()
y, y_backup = itertools.tee(y)
for x in y:
print(x)
for x in y_backup:
print(x)

This could be beneficial from memory usage point of view if the original iteration might not process all the items.

How to pick just one item from a generator?

Create a generator using

g = myfunct()

Everytime you would like an item, use

next(g)

(or g.next() in Python 2.5 or below).

If the generator exits, it will raise StopIteration. You can either catch this exception if necessary, or use the default argument to next():

next(g, default_value)

Understanding len function with iterators

The biggest reason is that it reduces type safety.

How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?

I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.

Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.

If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.

The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:

  • It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.

    Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
    The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.

    In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.

  • It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.

    Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.

Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.

In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.


An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:

len(print(something) for ... in ...)

to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.

size of generator object in python

The space a generator takes in memory is just bookkeeping info. In it a reference to the frame object is kept (administration for the running Python code, such as locals), wether or not it is running right now, and a reference to the code object are kept. Nothing more:

>>> x=(i for i in range(1,11))
>>> dir(x)
['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'gi_code', 'gi_frame', 'gi_running', 'next', 'send', 'throw']
>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> x.gi_running
0
>>> x.gi_code
<code object <genexpr> at 0x1051af5b0, file "<stdin>", line 1>

That's just 3 references, plus the usual Python object type info (think reference counting) and a weak-references list; so that's about 4 pointers, an integer and a struct, which on your system take 40 bytes (on my system, 64-bit OS X, it is 80 bytes). sys.getsizeof() reports on the size of just that structure as implemented in C, and it doesn't recurse over pointers.

As such, that amount of memory won't change when you have run through the generator. The referenced frame may change in how much memory is used (if the generator expression references large objects towards one end or the other) but you won't see that with the result of sys.getsizeof() on the generator object; look at the frame locals instead:

>>> next(x)
1
>>> x.gi_frame.f_locals
{'i': 1, '.0': <listiterator object at 0x105339dd0>}

The .0 object is the range() iterator that the generator is using in the for loop, i is the for loop target. The listiterator is another iterable object that has a private reference to the list range() produced as well as a position counter so it can yield the next element each time you ask it to.

You cannot query for an element size of a generator; they produce elements as needed anyway, you cannot a-priori 'know' how much they'll produce, nor know how much they have produced after running. sys.getsizeof() certainly won't tell you; it is a tool to measure memory footprint anyway, and you'd have to recursively measure all referenced objects if you want to know the total footprint.

You can see that the generator has completed its run from the frame; it is cleared once it is done:

>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> list(x)
[2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> x.gi_frame is None
True

So in the end, the memory used for the generator resides in structures in the frame (locals, and possibly globals, with each object in those namespaces possibly referencing other objects again), and when the generator is done the frame is cleared and the generator .gi_frame pointer is altered to point to the None singleton, leaving the frame to be cleared if the reference count has dropped to 0.

All this only applies to generators, not to iterables in general; generators are Python code and thus can be introspected this deeply.

Best way of getting both the yield'ed output and return'ed value of a generator in Python

without wrapping it inside another class?

Maybe with just a function instead?

Version 1:

def output_and_return(it):
def with_result():
yield (yield from it)
*elements, result = with_result()
return elements, result

Version 2:

def output_and_return(it):
result = None
def get_result():
nonlocal result
result = yield from it
return list(get_result()), result


Related Topics



Leave a reply



Submit