How to len(generator())
Generators have no length, they aren't collections after all.
Generators are functions with a internal state (and fancy syntax). You can repeatedly call them to get a sequence of values, so you can use them in loop. But they don't contain any elements, so asking for the length of a generator is like asking for the length of a function.
if functions in Python are objects, couldn't I assign the length to a
variable of this object that would be accessible to the new generator?
Functions are objects, but you cannot assign new attributes to them. The reason is probably to keep such a basic object as efficient as possible.
You can however simply return (generator, length)
pairs from your functions or wrap the generator in a simple object like this:
class GeneratorLen(object):
def __init__(self, gen, length):
self.gen = gen
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.gen
g = some_generator()
h = GeneratorLen(g, 1)
print len(h), list(h)
Length of generator output
There isn't one because you can't do it in the general case - what if you have a lazy infinite generator? For example:
def fib():
a, b = 0, 1
while True:
a, b = b, a + b
yield a
This never terminates but will generate the Fibonacci numbers. You can get as many Fibonacci numbers as you want by calling next()
.
If you really need to know the number of items there are, then you can't iterate through them linearly one time anyway, so just use a different data structure such as a regular list.
What's the shortest way to count the number of items in a generator/iterator?
Calls to itertools.imap()
in Python 2 or map()
in Python 3 can be replaced by equivalent generator expressions:
sum(1 for dummy in it)
This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.
Iterables / generators with specified length
It sounds like you're asking about something like __length_hint__
. Excerpts from PEP 424 – A method for exposing a length hint:
CPython currently defines a
__length_hint__
method on several types, such as various iterators. This method is then used by various other functions (such aslist
) to presize lists based on the estimate returned by__length_hint__
. Types which are not sized, and thus should not define__len__
, can then define__length_hint__
, to allow estimating or computing a size (such as many iterators).
Being able to pre-allocate lists based on the expected size, as estimated by
__length_hint__
, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.
For example, range
iterators support this (Try it online!):
it = iter(range(1000))
print(it.__length_hint__()) # prints 1000
next(it)
print(it.__length_hint__()) # prints 999
And list
iterators even take list length changes into account (Try it online!):
a = [None] * 10
it = iter(a)
print(it.__length_hint__()) # prints 10
next(it)
print(it.__length_hint__()) # prints 9
a.pop()
print(it.__length_hint__()) # prints 8
a.append(None)
print(it.__length_hint__()) # prints 9
Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...
- Produces 10,000 elements.
- Hints at having 5,000 elements.
- After every 1,000 elements it shows the memory size of the list being built.
import gc
beacon = object()
class MyIterator:
def __init__(self):
self.n = 10_000
def __iter__(self):
return self
def __length_hint__(self):
print('__length_hint__ called')
return 5_000
def __next__(self):
if self.n == 0:
raise StopIteration
self.n -= 1
if self.n % 1_000 == 0:
for obj in gc.get_objects():
if isinstance(obj, list) and obj and obj[0] is beacon:
print(obj.__sizeof__())
return beacon
list(MyIterator())
Output (Try it online!):
__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560
We see that list
asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.
If my __length_hint__
instead accurately returns 10,000, then list
instead pre-allocates 90088
bytes and that remains until the end.
Resetting generator object in Python
Another option is to use the itertools.tee()
function to create a second version of your generator:
import itertools
y = FunctionWithYield()
y, y_backup = itertools.tee(y)
for x in y:
print(x)
for x in y_backup:
print(x)
This could be beneficial from memory usage point of view if the original iteration might not process all the items.
How to pick just one item from a generator?
Create a generator using
g = myfunct()
Everytime you would like an item, use
next(g)
(or g.next()
in Python 2.5 or below).
If the generator exits, it will raise StopIteration
. You can either catch this exception if necessary, or use the default
argument to next()
:
next(g, default_value)
Understanding len function with iterators
The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()
), so asking for it doesn't make much sense. The fact that len(an_iterator)
produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len
on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len
, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))
? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len
that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library,
numpy
&scipy
and all other big third party libraries do this on both dynamically sized and static sized containers). So when you seelen(something)
you know that thelen
call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.Also note that you can, trivially, implement O(1)
__len__
on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.In summary: currently everybody implements
__len__
in O(1) and it's easy to continue to do so. So there is an expectation for calls tolen
to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your__len__
isn't O(1), it's expected that you document that.It is known to be not destructive. Any sensible implementation of
__len__
doesn't change its argument. So you can be sure thatlen(x) == len(x)
, or thatn = len(x);len(list(x)) == n
.Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len
consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map
or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.
size of generator object in python
The space a generator takes in memory is just bookkeeping info. In it a reference to the frame object is kept (administration for the running Python code, such as locals), wether or not it is running right now, and a reference to the code object are kept. Nothing more:
>>> x=(i for i in range(1,11))
>>> dir(x)
['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'gi_code', 'gi_frame', 'gi_running', 'next', 'send', 'throw']
>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> x.gi_running
0
>>> x.gi_code
<code object <genexpr> at 0x1051af5b0, file "<stdin>", line 1>
That's just 3 references, plus the usual Python object type info (think reference counting) and a weak-references list; so that's about 4 pointers, an integer and a struct, which on your system take 40 bytes (on my system, 64-bit OS X, it is 80 bytes). sys.getsizeof()
reports on the size of just that structure as implemented in C, and it doesn't recurse over pointers.
As such, that amount of memory won't change when you have run through the generator. The referenced frame may change in how much memory is used (if the generator expression references large objects towards one end or the other) but you won't see that with the result of sys.getsizeof()
on the generator object; look at the frame locals instead:
>>> next(x)
1
>>> x.gi_frame.f_locals
{'i': 1, '.0': <listiterator object at 0x105339dd0>}
The .0
object is the range()
iterator that the generator is using in the for
loop, i
is the for
loop target. The listiterator
is another iterable object that has a private reference to the list range()
produced as well as a position counter so it can yield the next element each time you ask it to.
You cannot query for an element size of a generator; they produce elements as needed anyway, you cannot a-priori 'know' how much they'll produce, nor know how much they have produced after running. sys.getsizeof()
certainly won't tell you; it is a tool to measure memory footprint anyway, and you'd have to recursively measure all referenced objects if you want to know the total footprint.
You can see that the generator has completed its run from the frame; it is cleared once it is done:
>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> list(x)
[2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> x.gi_frame is None
True
So in the end, the memory used for the generator resides in structures in the frame (locals, and possibly globals, with each object in those namespaces possibly referencing other objects again), and when the generator is done the frame is cleared and the generator .gi_frame
pointer is altered to point to the None
singleton, leaving the frame to be cleared if the reference count has dropped to 0.
All this only applies to generators, not to iterables in general; generators are Python code and thus can be introspected this deeply.
Best way of getting both the yield'ed output and return'ed value of a generator in Python
without wrapping it inside another class?
Maybe with just a function instead?
Version 1:
def output_and_return(it):
def with_result():
yield (yield from it)
*elements, result = with_result()
return elements, result
Version 2:
def output_and_return(it):
result = None
def get_result():
nonlocal result
result = yield from it
return list(get_result()), result
Related Topics
How Does Keras Calculate the Accuracy
Skip Multiple Iterations in Loop
Hitting Maximum Recursion Depth Using Pickle/Cpickle
Notimplementederror: Layers with Arguments in '_Init_' Must Override 'Get_Config'
Log Output of Multiprocessing.Process
Identifying Objects, Why Does the Returned Value from Id(...) Change
How to Give Column Name Dynamically from String Variable in SQL Alchemy Filter
How to Remove Duplicate Words in a String with Python
Generate Permutations of List with Repeated Elements
How to Check If All Items in a List Are There in Another List
How to Apply a Function on Every Row on a Dataframe
How to Merge Two Dataframes Side-By-Side
How to Deal with Kivy Installing Error in Python 3.8
How to Compare Two JSON Objects with the Same Elements in a Different Order Equal