Length of generator output
There isn't one because you can't do it in the general case - what if you have a lazy infinite generator? For example:
def fib():
a, b = 0, 1
while True:
a, b = b, a + b
yield a
This never terminates but will generate the Fibonacci numbers. You can get as many Fibonacci numbers as you want by calling next()
.
If you really need to know the number of items there are, then you can't iterate through them linearly one time anyway, so just use a different data structure such as a regular list.
How to len(generator())
Generators have no length, they aren't collections after all.
Generators are functions with a internal state (and fancy syntax). You can repeatedly call them to get a sequence of values, so you can use them in loop. But they don't contain any elements, so asking for the length of a generator is like asking for the length of a function.
if functions in Python are objects, couldn't I assign the length to a
variable of this object that would be accessible to the new generator?
Functions are objects, but you cannot assign new attributes to them. The reason is probably to keep such a basic object as efficient as possible.
You can however simply return (generator, length)
pairs from your functions or wrap the generator in a simple object like this:
class GeneratorLen(object):
def __init__(self, gen, length):
self.gen = gen
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.gen
g = some_generator()
h = GeneratorLen(g, 1)
print len(h), list(h)
What's the shortest way to count the number of items in a generator/iterator?
Calls to itertools.imap()
in Python 2 or map()
in Python 3 can be replaced by equivalent generator expressions:
sum(1 for dummy in it)
This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.
size of generator object in python
The space a generator takes in memory is just bookkeeping info. In it a reference to the frame object is kept (administration for the running Python code, such as locals), wether or not it is running right now, and a reference to the code object are kept. Nothing more:
>>> x=(i for i in range(1,11))
>>> dir(x)
['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'gi_code', 'gi_frame', 'gi_running', 'next', 'send', 'throw']
>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> x.gi_running
0
>>> x.gi_code
<code object <genexpr> at 0x1051af5b0, file "<stdin>", line 1>
That's just 3 references, plus the usual Python object type info (think reference counting) and a weak-references list; so that's about 4 pointers, an integer and a struct, which on your system take 40 bytes (on my system, 64-bit OS X, it is 80 bytes). sys.getsizeof()
reports on the size of just that structure as implemented in C, and it doesn't recurse over pointers.
As such, that amount of memory won't change when you have run through the generator. The referenced frame may change in how much memory is used (if the generator expression references large objects towards one end or the other) but you won't see that with the result of sys.getsizeof()
on the generator object; look at the frame locals instead:
>>> next(x)
1
>>> x.gi_frame.f_locals
{'i': 1, '.0': <listiterator object at 0x105339dd0>}
The .0
object is the range()
iterator that the generator is using in the for
loop, i
is the for
loop target. The listiterator
is another iterable object that has a private reference to the list range()
produced as well as a position counter so it can yield the next element each time you ask it to.
You cannot query for an element size of a generator; they produce elements as needed anyway, you cannot a-priori 'know' how much they'll produce, nor know how much they have produced after running. sys.getsizeof()
certainly won't tell you; it is a tool to measure memory footprint anyway, and you'd have to recursively measure all referenced objects if you want to know the total footprint.
You can see that the generator has completed its run from the frame; it is cleared once it is done:
>>> x.gi_frame
<frame object at 0x1053b4ad0>
>>> list(x)
[2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> x.gi_frame is None
True
So in the end, the memory used for the generator resides in structures in the frame (locals, and possibly globals, with each object in those namespaces possibly referencing other objects again), and when the generator is done the frame is cleared and the generator .gi_frame
pointer is altered to point to the None
singleton, leaving the frame to be cleared if the reference count has dropped to 0.
All this only applies to generators, not to iterables in general; generators are Python code and thus can be introspected this deeply.
How to measure length of generator sequence (list comp vs generator expression)
On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000
$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop
How to measure length of generator sequence (list comp vs generator expression)
On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000
$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop
Iterables / generators with specified length
It sounds like you're asking about something like __length_hint__
. Excerpts from PEP 424 – A method for exposing a length hint:
CPython currently defines a
__length_hint__
method on several types, such as various iterators. This method is then used by various other functions (such aslist
) to presize lists based on the estimate returned by__length_hint__
. Types which are not sized, and thus should not define__len__
, can then define__length_hint__
, to allow estimating or computing a size (such as many iterators).
Being able to pre-allocate lists based on the expected size, as estimated by
__length_hint__
, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.
For example, range
iterators support this (Try it online!):
it = iter(range(1000))
print(it.__length_hint__()) # prints 1000
next(it)
print(it.__length_hint__()) # prints 999
And list
iterators even take list length changes into account (Try it online!):
a = [None] * 10
it = iter(a)
print(it.__length_hint__()) # prints 10
next(it)
print(it.__length_hint__()) # prints 9
a.pop()
print(it.__length_hint__()) # prints 8
a.append(None)
print(it.__length_hint__()) # prints 9
Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...
- Produces 10,000 elements.
- Hints at having 5,000 elements.
- After every 1,000 elements it shows the memory size of the list being built.
import gc
beacon = object()
class MyIterator:
def __init__(self):
self.n = 10_000
def __iter__(self):
return self
def __length_hint__(self):
print('__length_hint__ called')
return 5_000
def __next__(self):
if self.n == 0:
raise StopIteration
self.n -= 1
if self.n % 1_000 == 0:
for obj in gc.get_objects():
if isinstance(obj, list) and obj and obj[0] is beacon:
print(obj.__sizeof__())
return beacon
list(MyIterator())
Output (Try it online!):
__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560
We see that list
asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.
If my __length_hint__
instead accurately returns 10,000, then list
instead pre-allocates 90088
bytes and that remains until the end.
Related Topics
How to Flatten a List of Lists/Nested Lists
How to Use a Multiprocessing Queue in a Function Called by Pool.Imap
Sort Multidimensional Array Based on 2Nd Element of the Subarray
Will Ordereddict Become Redundant in Python 3.7
Pyplot Move Alternative Y Axis to Background
Generating Discrete Random Variables with Specified Weights Using Scipy or Numpy
Pygame Tic Tak Toe Logic? How Would I Do It
How to Copy Inmemoryuploadedfile Object to Disk
Site Matching Query Does Not Exist
How to Efficiently Process a Numpy Array in Blocks Similar to Matlab's Blkproc (Blockproc) Function
How to Return a String from a Regex Match in Python
Tuple or List When Using 'In' in an 'If' Clause
Axes Class - Set Explicitly Size (Width/Height) of Axes in Given Units
How to Unimport a Python Module Which Is Already Imported
How Does Python Find a Module File If the Import Statement Only Contains the Filename
How to Use Boto to Stream a File Out of Amazon S3 to Rackspace Cloudfiles