Iterables, iterators and generators, oh my! Part 1

Red Huq

2017-01-30 08:00

Iterators and generators are among my favorite programming tools—they're also some of the most powerful. These constructs enable us to write cleaner, more flexible and higher performance code; undoubtedly an invaluable addition to any programmer's toolbox. In addition, iterators and generators are an elegant means to work with large and potentially infinite data structures, coming in handy for data science. However, they can be some of the more perplexing concepts to grasp at first.

In this article, I'd like to deliver a gentle but in-depth introduction to iterators and generators in Python, although they're prevalent in other languages too. Nevertheless, in order to appreciate generators, we need to first have a good handle on iterators. And to understand iterators, we need to start with iterables.

1. Unraveling the secrets of a for loop¶

What are iterables?¶

Most programmers have worked with an iterable before even if they weren't consciously aware of doing so. An iterable is an object whose contents can be traversed or looped over. For example, when implementing a for loop (for item in obj: ...), any object that can take the place of obj is an iterable. Many Python containers are examples of iterables, including list, set, frozenset, tuple, str and dict. Even NumPy arrays, Pandas DataFrames, and open file objects are iterables since we can loop over them. All iterables share a common thread: they implement the __iter__() method, which we'll get to in the next section.

What are iterators?¶

An iterable provides the for loop its contents but on its own isn't very useful; an iterable can't tell the loop how to traverse or iterate over its contents, or what the next item is. Therefore, upon initiating, a for loop cleverly calls iter(obj) behind-the-scenes, which in turn calls obj.__iter__(). If obj doesn't support iteration or sport the __iter__() method (meaning not an iterable), a TypeError is raised. Otherwise, obj.__iter__() hands the loop an iterator; an iterator is an object that knows how to perform the iteration and determines what the next item is. Logically, this means an iterable is any object when passed to iter(), that's capable of producing an iterator.

How does the for loop actually employ the iterator to perform the iteration and identify the next item? The answer is after retrieving an iterator, the loop secretly calls next(iterator), which in turn calls iterator.__next__()—don't worry, every iterator implements this method by definition. __next__() holds instructions that determine the subsequent item, which is then handed to the loop. During each cycle, the loop secretly calls next(iterator) again; the iterator "remembers" its state and rather than yielding the first item, it resumes where it left off.

Depending on the iterable, its iterator may have differing methodologies for traversing the contents and identifying the next item. Here are instructions for some common iterables:

list or tuple - loop over each element
dict - loop over each key (arbitrary order)
set or frozenset - loop over each element (arbitrary order)
str - loop over each "character"
open file object - loop over each line

In a nutshell, the for loop first passes an iterable to iter() to produce an iterator and then in each cycle, implicitly calls next(iterator) to determine the subsequent item—just one item per cycle. Ha, we now know the inner workings of a for loop!

2. Putting theory into practice¶

The best way to make sense of iterables and iterators is to experiment with them. Let's put what we've learned to the test and start by creating an iterable.

In [1]:

myIterable = [4, 8, 15, 16, 23, 42]

Now that we have an iterable, we need instructions for traversing its contents so let's produce an iterator from myIterable and confirm its type.

In [2]:

myIterator = iter(myIterable)

type(myIterator)

Out[2]:

list_iterator

We could call next(myIterator) and explore what happens, but before jumping in, what would happen if we used an iterator in a for loop instead of an iterable? Will the loop raise a TypeError when it calls iter(myIterator)? Let's find out.

In [3]:

for item in myIterator:
    print(item, end=' ')

4 8 15 16 23 42

Surprise! It looks like everything went smoothly. The reason is because myIterator, or any iterator produced via iter(), implements its own __iter__() method to satisfy the for loop. In fact, calling myIterator.__iter__() returns self. Simply put, if we produce an iterator from an iterable and then pass this iterator to iter() (thus calling its __iter__() method), we get back the same iterator. We can quickly verify this.

In [4]:

iter(myIterator) == myIterator

Out[4]:

True

We've already mentioned that an iterable is any object when passed to iter(), that's capable of producing an iterator. According to this definition, myIterator must also be an iterable, which explains why it plays nicely with the for loop.

Traversing contents manually¶

Now let's continue where we left off and verify that calling next(myIterator) yields the subsequent item one at a time.

In [5]:

next(myIterator)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-5-4b59040f9a4a> in <module>()
      1 # Yield the subsequent item
----> 2 next(myIterator)

StopIteration:

Oh no, what happened? Well, the for loop from earlier had implicitly called next(myIterator) over and over until the iterator yielded its last item—the loop exhausted the iterator. When we attempted to manually call next(myIterator) afterwards, a StopIteration exception was raised to tell us myIterator has no more items left to yield. Unbeknownst to us, the loop also ran into the same StopIteration exception after exhausting the iterator; for loops gracefully handle and conceal this exception, and know when to stop calling next().

Once an iterator is exhausted, it's practically useless. However, we can "restock" the contents by recreating the iterator, enabling it to yield items again, starting with the first one.

In [6]:

myIterator = iter(myIterable)

next(myIterator)

Out[6]:

In [7]:

next(myIterator)

Out[7]:

In [8]:

next(myIterator)

Out[8]:

Wonderful, myIterator has been "restocked" and next() does its job. Let's keep going and confirm a StopIteration exception is raised if the iterator is exhausted again.

In [9]:

next(myIterator)

Out[9]:

In [10]:

next(myIterator)

Out[10]:

In [11]:

next(myIterator)

Out[11]:

In [12]:

next(myIterator)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-12-dc04b44d6581> in <module>()
----> 1 next(myIterator)

StopIteration:

As expected, after again exhausting the iterator, we run into another StopIteration exception when attempting to call next(myIterator).

Revealing the true identity of a for loop¶

Gathering everything we've learned so far about how iterables and iterators work, we can construct a while loop that emulates for item in obj: expression.

In [ ]:

obj_iter = iter(obj)
while True:
    try:
        item = next(obj_iter)   
    except StopIteration:
        break

We've divulged the for loop in Python is actually just a special case of a while loop! We've also shown that an iterator yields the subsequent item only when passed to next(), either manually or implicitly within a for loop; otherwise, the iterator just...sits there. Pretty lazy, no? That's actually the correct terminology to describe this behavior: lazy evaluation. We can think of an iterator as an idle conveyor belt in a factory that churns out a single product when switched on, but then it automatically shuts off. Fortunately, the conveyor belt knows where it left off each time and never runs backward.

3. Constructing custom iterators¶

So far we know only one way to produce iterators: by passing an iterable to iter(). However, we've determined the building blocks of an iterator include the __iter__() and __next__() methods; we can use this information to construct a class that produces our very own iterators! But before getting fancy, let's build one that produces iterators similar to one we're familiar with: myIterator.

In [13]:

class TheNumbersMaker():    
    def __init__(self):
        self.contents = [4, 8, 15, 16, 23, 42]
        self.curr_index = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.curr_index < len(self.contents):
            curr_index = self.curr_index
            self.curr_index += 1
            return self.contents[curr_index]
        else:
            raise StopIteration()

This class is known as an iterator protocol—everything required to produce a custom iterator. Let's go through what each of its methods are doing.

__init__(): Instantiates the iterator using our favorite list of numbers and stores it in self.contents. Since an iterator needs a way to hold state and recall where it left off, this method keeps track of which index we're on using self.curr_index.
__iter__(): Recall that in order for an iterator to be compatible with a for loop, the loop must be able call this method, so that's why it's included to simply return the iterator itself. Technically, we don't need this method if we only want to traverse the iterator manually, but why restrict ourselves?
__next__(): This method provides the instructions needed to determine the subsequent item, specifically, move down the contents of the list by one index and update the state each time and upon reaching the end, raise an exception to indicate the iterator is exhausted.

Let's use this iterator protocol to produce a "clone" of myIterator and test it out in a for loop.

In [14]:

myIteratorClone = TheNumbersMaker()
for item in myIteratorClone:
    print(item, end=' ')

4 8 15 16 23 42

The for loop should have exhausted the iterator so let's see what happens when we call next() manually.

In [15]:

next(myIteratorClone)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-15-386b0a8c8ce3> in <module>()
      1 # Confirm that an exception is raised when calling next() after the iterator is
      2 # exhausted
----> 3 next(myIteratorClone)

<ipython-input-13-9aafb5109040> in __next__(self)
     25             return self.contents[curr_index]
     26         else:
---> 27             raise StopIteration()

StopIteration:

Our custom iterator works like a charm! Now that we know how to build our own iterator, we could make more complex ones by changing the iterator protocol. Perhaps we'd like to produce an iterator from a list of our choosing—we just need to add a contents parameter to __init__() and assign it to self.contents. Or instead we'd want to yield every third item?

4. Infinite iterators¶

Every iterator we've produced thus far is finite—they can eventually be exhausted, either manually or using a for loop. However, we can amend the iterator protocol to produce iterators that never stop yielding items. Let's investigate by constructing an iterator protocol that produces one of these uncanny-sounding iterators: one that yields Fibonacci numbers. The instructions for yielding the nth Fibonacci number are:

$$F_1 = 0$$$$F_2 = 1$$$$F_n = F_{n-1} + F_{n-2}$$

This translates to: starting with 0 and 1, the next Fibonacci number is the sum of the previous two; thus, leading to the sequence 0, 1, 1, 2, 3, 5, 8, 13, 21...

In [16]:

class FibNumMaker():
    def __init__(self):
        self.last, self.curr = 0, 1

    def __iter__(self):
        return self
    
    def __next__(self):
        number = self.curr
        self.curr += self.last  
        self.last = number              
        return number

Again, let's go through each method of this iterator protocol.

__init__(): Instantiates the iterator with 0 and 1 (the first two Fibonacci numbers). Since the next Fibonacci number depends on the previous two numbers, we need to keep track of both the current and previous states of this iterator, using self.curr and self.last, respectively.
__iter__(): Enables the iterator to be compatible with for loops
__next__(): Instructions for determining and yielding the subsequent Fibonacci number

Let's produce an iterator using this iterator protocol and explore what happens as we manually attempt to yield items.

In [17]:

fib_seq = FibNumMaker()

In [18]:

next(fib_seq)

Out[18]:

In [19]:

next(fib_seq)

Out[19]:

In [20]:

next(fib_seq)

Out[20]:

In [21]:

next(fib_seq)

Out[21]:

In [22]:

next(fib_seq)

Out[22]:

So far so good, but to test if fib_seq will continue yielding items, let's ask it for the next 15 numbers via a for loop.

In [23]:

for item in range(15):
    print(next(fib_seq), end=' ')

8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765

Fantastic, we've built an iterator that successively yields Fibonacci numbers; since there are infinite Fibonacci numbers, fib_seq is an infinite iterator. We can keep calling next(fib_seq) and we'll never exhaust the iterator! However, there's a few things to keep in mind:

Notice we didn't use fib_seq as obj in for item in obj: .... If we were to do so, the loop would implicitly call next(fib_seq) each cycle; because fib_seq can't be exhausted, the loop would never end! As an alternative, we elected to use the for loop to manually call next(feb_seq) 15 times.
fib_seq is an infinite iterator because it's not derived from a finite iterable such as myIterator or myIteratorClone. In fact, fib_seq isn't derived from any iterable; rather, each time we call next(fib_seq), we simply compute a rolling sum using our two stored states (self.curr and self.last) and then update the states accordingly.
Because fib_seq can't be exhausted, we didn't need a condition that raises a StopIteration exception in the __next__() method of the iterator protocol.

5. Why iterators are so powerful¶

Let's say we wanted the sum of any prime number smaller than a maximum value. We could first create a list of all integers up to maximum, then use our favorite algorithm to keep only the prime numbers, and then pass list to sum(). What if maximum was 1 billion? The filtered list would not only take a long time just to construct but require a needlessly large amount of memory when all we really care about is its sum. Instead, we could construct an iterator that successively yields prime numbers and compute a rolling sum; when the next prime number exceeds maximum, we'd know to stop and return the current sum.

Not surprisingly, iterators can save a tremendous amount of memory and time when working very large sequences or datasets, or even infinite ones. In fact, iterators enable us to represent an infinite number of items with finite memory. Even with smaller sequences or datasets, using iterators can help us write more efficient code. Here are a few scenarios that come to mind where using an iterator may be advantageous:

Performing computations on a stream of data
Accessing certain items of a sequence without first storing the entire sequence
Processing large files or datasets in manageable chunks (buffering)
Performing computations on a sequence without knowing if all items in the sequence will be needed
Building custom data samplers or random number generators

The biggest drawback to using iterators is the numerous lines of code required to implement a complete iterator protocol, even if the resulting iterator is rudimentary (like myIteratorClone). Of course, Python wouldn't live up to its reputation if it didn't offer a few ways to simplify the process of producing iterators, conceivably down to a single, straightforward line of code! We'll learn about these shortcuts and some handy built-in iterators in an upcoming blog post.

If you'd like to play around with the code, here's the GitHub repo. As always, don't hesitate to leave your comments below.