Why can't I iterate twice over the same data?
It's because data
is an iterator, and you can consume an iterator only once. For example:
lst = [1, 2, 3]
it = iter(lst)
next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration
If we are traversing some data using a for
loop, that last StopIteration
will cause it to exit the first time. If we try to iterate over it again, we'll keep getting the StopIteration
exception, because the iterator has already been consumed.
Now for the second question: What if we do need to traverse the iterator more than once? A simple solution would be to save all the elements to a list, which can be traversed as many times as needed. For instance, if data
is an iterator:
data = list(data)
That is alright as long as there are few elements in the list. However, if there are many elements, it's a better idea to create independent iterators using tee()
:
import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed
Now we can loop over each one in turn:
for e in it1:
print("doing this one time")
for e in it2:
print("doing this two times")
Ensure that an argument can be iterated twice
I could insert a line at the beginning of the function:
x = list(x)
. But this might be inefficient in case x is already a list, a tuple, a range, or any other iterator that can be iterated more than once. Is there a more efficient solution?
Copying single-use iterables to a list
is perfectly adequate, and reasonably efficient even for multi-use iterables.
The list
(and to some extend tuple
) type is one of the most optimised data structures in Python. Common operations such as copying a list
or tuple
to a list
are internally optimised;1 even for iterables that are not special-cased, copying them to a list
is significantly faster than any realistic work done by two (or more) loops.
def print_twice(x):
x = list(x)
for i in x: print(i)
for i in x: print(i)
Copying indiscriminately can also be advantageous in the context of concurrency, when the iterable may be modified while the function is running. Common cases are threading and weakref
collections.
In case one wants to avoid needless copies, checking whether the iterable is a Collection
is a reasonable guard.
from collections.abc import Collection
x = list(x) if not isinstance(x, Collection) else x
Alternatively, one can check whether the iterable is in fact an iterator, since this implies statefulness and thus single-use.
from collections.abc import Iterator
x = list(x) if isinstance(x, Iterator) else x
x = list(x) if iter(x) is x else x
Notably, the builtins zip
, filter
, map
, ... and generators all are iterators.
1Copying a list
of 128 items is roughly as fast as checking whether it is a Collection
.
Correct way to iterate twice over a list?
rlist = list(reversed(slist))
Then iterate as often as you want. This trick applies more generally; whenever you need to iterate over an iterator multiple times, turn it into a list. Here's a code snippet that I keep copy-pasting into different projects for exactly this purpose:
def tosequence(it):
"""Turn iterable into a sequence, avoiding a copy if possible."""
if not isinstance(it, collections.Sequence):
it = list(it)
return it
(Sequence
is the abstract type of lists, tuples and many custom list-like objects.)
Why doesn't this for loop print the same integer during each iteration?
You code looks weird. You store the looping number into idx[0]
that way - you can see that by printing it:
idx = [0,0,0]
junk = [1, 2, 3, 4, 5,6]
for idx[0] in junk:
print(idx[0], idx)
to get
1 [1, 0, 0]
2 [2, 0, 0]
3 [3, 0, 0]
4 [4, 0, 0]
5 [5, 0, 0]
6 [6, 0, 0]
So you are activly changing the value of idx[0]
on each iteration - how many iterations are done is ruled by junk
's values.
You can create a "creative/inefficient" list-copy by that:
junk = [1, 2, 3, 4, 5,6]
idx = [0] * len(junk)
for i,idx[i] in enumerate(junk):
pass
although I have no clue whatfor one would need that :D
Why are some (but not all) Python iterators summable after being exhausted?
The problem is that the custom iterator is initialising inside the __iter__
method. Even though i2 = iter(CustomIterator())
includes an explicit call to iter
, the sum
function (and also min
, max
, for
, etc) will still call i2.__iter__()
again and reset i2
.
There's a bunch of tutorials out there on "how to make Python iterators", and about half of them say something like "to make an iterator, you just have to define iter
and next
methods". While this is technically correct as per the documentation, it will get you into trouble sometimes. In many cases you'll also want a separate __init__
method to initialise the iterator.
So to fix this problem, redefine CustomIterator
as:
class CustomIterator:
def __init__(self):
self.n=0
def __iter__(self):
return self
def __next__(self):
self.n += 1
if self.n > 3:
raise StopIteration
return self.n
i1 = iter([1,2,3])
i2 = CustomIterator() ### iter(...) is not needed here (but won't do any harm either)
Then init
is called once and once only on creating a new iterator, and repeated calls to iter
won't reset the iterator.
How to use the same iterator twice, once for counting and once for iteration?
Calling count
consumes the iterator, because it actually iterates until it is done (i.e. next()
returns None
).
You can prevent consuming the iterator by using by_ref
, but the iterator is still driven to its completion (by_ref
actually just returns the mutable reference to the iterator, and Iterator
is also implemented for the mutable reference: impl<'a, I> Iterator for &'a mut I
).
This still can be useful if the iterator contains other state you want to reuse after it is done, but not in this case.
You could simply try forking the iterator (they often implement Clone
if they don't have side effects), although in this case recreating it is just as good (most of the time creating an iterator is cheap; the real work is usually only done when you drive it by calling next
directly or indirectly).
So no, (in this case) you can't reset it, and yes, you need to create a new one (or clone it before using it).
Iterate twice on values (MapReduce)
We have to cache the values from the iterator if you want to iterate again. At least we can combine the first iteration and the caching:
Iterator<IntWritable> it = getIterator();
List<IntWritable> cache = new ArrayList<IntWritable>();
// first loop and caching
while (it.hasNext()) {
IntWritable value = it.next();
doSomethingWithValue();
cache.add(value);
}
// second loop
for(IntWritable value:cache) {
doSomethingElseThatCantBeDoneInFirstLoop(value);
}
(just to add an answer with code, knowing that you mentioned this solution in your own comment ;) )
why it's impossible without caching: an Iterator
is something that implements an interface and there is not a single requirement, that the Iterator
object actually stores values. Do iterate twice you either have to reset the iterator (not possible) or clone it (again: not possible).
To give an example for an iterator where cloning/resetting wouldn't make any sense:
public class Randoms implements Iterator<Double> {
private int counter = 10;
@Override
public boolean hasNext() {
return counter > 0;
}
@Override
public boolean next() {
count--;
return Math.random();
}
@Override
public boolean remove() {
throw new UnsupportedOperationException("delete not supported");
}
}
Related Topics
Pass a List to a Function to Act as Multiple Arguments
How to Schedule Updates (F/E, to Update a Clock) in Tkinter
How to Prevent Tensorflow from Allocating the Totality of a Gpu Memory
Reading Binary File and Looping Over Each Byte
Convert Python Dict into a Dataframe
How to Read Specific Lines from a File (By Line Number)
How to Retrieve a Module'S Path
Retrieving the Output of Subprocess.Call()
Understanding the "Is" Operator
Python: Justifying Numpy Array
Why Does Running the Flask Dev Server Run Itself Twice
Python Multiprocessing Picklingerror: Can't Pickle ≪Type 'Function'≫
Count the Frequency That a Value Occurs in a Dataframe Column
How to Install Pip With Python 3