Why Can't I Iterate Twice Over the Same Data

Why can't I iterate twice over the same data?

It's because data is an iterator, and you can consume an iterator only once. For example:

lst = [1, 2, 3]
it = iter(lst)

next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration

If we are traversing some data using a for loop, that last StopIteration will cause it to exit the first time. If we try to iterate over it again, we'll keep getting the StopIteration exception, because the iterator has already been consumed.

Now for the second question: What if we do need to traverse the iterator more than once? A simple solution would be to save all the elements to a list, which can be traversed as many times as needed. For instance, if data is an iterator:

data = list(data)

That is alright as long as there are few elements in the list. However, if there are many elements, it's a better idea to create independent iterators using tee():

import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed

Now we can loop over each one in turn:

for e in it1:
    print("doing this one time")

for e in it2:
    print("doing this two times")

Ensure that an argument can be iterated twice

I could insert a line at the beginning of the function: x = list(x). But this might be inefficient in case x is already a list, a tuple, a range, or any other iterator that can be iterated more than once. Is there a more efficient solution?

Copying single-use iterables to a list is perfectly adequate, and reasonably efficient even for multi-use iterables.

The list (and to some extend tuple) type is one of the most optimised data structures in Python. Common operations such as copying a list or tuple to a list are internally optimised;¹ even for iterables that are not special-cased, copying them to a list is significantly faster than any realistic work done by two (or more) loops.

def print_twice(x):
    x = list(x)
    for i in x: print(i)
    for i in x: print(i)

Copying indiscriminately can also be advantageous in the context of concurrency, when the iterable may be modified while the function is running. Common cases are threading and weakref collections.

In case one wants to avoid needless copies, checking whether the iterable is a Collection is a reasonable guard.

from collections.abc import Collection

x = list(x) if not isinstance(x, Collection) else x

Alternatively, one can check whether the iterable is in fact an iterator, since this implies statefulness and thus single-use.

from collections.abc import Iterator

x = list(x) if isinstance(x, Iterator) else x
x = list(x) if iter(x) is x else x

Notably, the builtins zip, filter, map, ... and generators all are iterators.

¹Copying a list of 128 items is roughly as fast as checking whether it is a Collection.

Correct way to iterate twice over a list?

rlist = list(reversed(slist))

Then iterate as often as you want. This trick applies more generally; whenever you need to iterate over an iterator multiple times, turn it into a list. Here's a code snippet that I keep copy-pasting into different projects for exactly this purpose:

def tosequence(it):
    """Turn iterable into a sequence, avoiding a copy if possible."""
    if not isinstance(it, collections.Sequence):
        it = list(it)
    return it

(Sequence is the abstract type of lists, tuples and many custom list-like objects.)

Why doesn't this for loop print the same integer during each iteration?

You code looks weird. You store the looping number into idx[0] that way - you can see that by printing it:

idx = [0,0,0]

junk = [1, 2, 3, 4, 5,6]

for idx[0] in junk:
    print(idx[0], idx)

to get

1 [1, 0, 0]
2 [2, 0, 0]
3 [3, 0, 0]
4 [4, 0, 0]
5 [5, 0, 0]
6 [6, 0, 0]

So you are activly changing the value of idx[0] on each iteration - how many iterations are done is ruled by junk's values.

You can create a "creative/inefficient" list-copy by that:

junk = [1, 2, 3, 4, 5,6]
idx = [0] * len(junk)
for i,idx[i] in enumerate(junk): 
    pass

although I have no clue whatfor one would need that :D

Why are some (but not all) Python iterators summable after being exhausted?

The problem is that the custom iterator is initialising inside the __iter__ method. Even though i2 = iter(CustomIterator()) includes an explicit call to iter, the sum function (and also min, max, for, etc) will still call i2.__iter__() again and reset i2.

There's a bunch of tutorials out there on "how to make Python iterators", and about half of them say something like "to make an iterator, you just have to define iter and next methods". While this is technically correct as per the documentation, it will get you into trouble sometimes. In many cases you'll also want a separate __init__ method to initialise the iterator.

So to fix this problem, redefine CustomIterator as:

class CustomIterator:
  def __init__(self):
    self.n=0

  def __iter__(self):
    return self

  def __next__(self):
    self.n += 1
    if self.n > 3:
      raise StopIteration
    return self.n

i1 = iter([1,2,3])
i2 = CustomIterator() ### iter(...) is not needed here (but won't do any harm either)

Then init is called once and once only on creating a new iterator, and repeated calls to iter won't reset the iterator.

How to use the same iterator twice, once for counting and once for iteration?

Calling count consumes the iterator, because it actually iterates until it is done (i.e. next() returns None).

You can prevent consuming the iterator by using by_ref, but the iterator is still driven to its completion (by_ref actually just returns the mutable reference to the iterator, and Iterator is also implemented for the mutable reference: impl<'a, I> Iterator for &'a mut I).

This still can be useful if the iterator contains other state you want to reuse after it is done, but not in this case.

You could simply try forking the iterator (they often implement Clone if they don't have side effects), although in this case recreating it is just as good (most of the time creating an iterator is cheap; the real work is usually only done when you drive it by calling next directly or indirectly).

So no, (in this case) you can't reset it, and yes, you need to create a new one (or clone it before using it).

Iterate twice on values (MapReduce)

We have to cache the values from the iterator if you want to iterate again. At least we can combine the first iteration and the caching:

Iterator<IntWritable> it = getIterator();
List<IntWritable> cache = new ArrayList<IntWritable>();

// first loop and caching
while (it.hasNext()) {
   IntWritable value = it.next();
   doSomethingWithValue();
   cache.add(value);
}

// second loop
for(IntWritable value:cache) {
   doSomethingElseThatCantBeDoneInFirstLoop(value);
}

(just to add an answer with code, knowing that you mentioned this solution in your own comment ;) )

why it's impossible without caching: an Iterator is something that implements an interface and there is not a single requirement, that the Iterator object actually stores values. Do iterate twice you either have to reset the iterator (not possible) or clone it (again: not possible).

To give an example for an iterator where cloning/resetting wouldn't make any sense:

public class Randoms implements Iterator<Double> {

  private int counter = 10;

  @Override 
  public boolean hasNext() { 
     return counter > 0; 
  }

  @Override 
  public boolean next() { 
     count--;
     return Math.random();        
  }      

  @Override 
  public boolean remove() { 
     throw new UnsupportedOperationException("delete not supported"); 
  }
}

Why Can't I Iterate Twice Over the Same Data