Iterables, iterators and generators, oh my! Part 1
Iterators and generators are among my favorite programming tools—they're also some of the most powerful. These constructs enable us to write cleaner, more flexible and higher performance code; undoubtedly an invaluable addition to any programmer's toolbox. In addition, iterators and generators are an elegant means to work with large and potentially infinite data structures, coming in handy for data science. However, they can be some of the more perplexing concepts to grasp at first.
In this article, I'd like to deliver a gentle but in-depth introduction to iterators and generators in Python, although they're prevalent in other languages too. Nevertheless, in order to appreciate generators, we need to first have a good handle on iterators. And to understand iterators, we need to start with iterables.
Table of contents¶
- Unraveling the secrets of a for loop
- Putting theory into practice
- Constructing custom iterators
- Infinite iterators
- Why iterators are so powerful
1. Unraveling the secrets of a for loop¶
What are iterables?¶
Most programmers have worked with an iterable before even if they weren't consciously aware of doing so. An iterable is an object whose contents can be traversed or looped over. For example, when implementing a for
loop (for item in obj: ...
), any object that can take the place of obj
is an iterable. Many Python containers are examples of iterables, including list
, set
, frozenset
, tuple
, str
and dict
. Even NumPy arrays, Pandas DataFrames, and open file objects are iterables since we can loop over them. All iterables share a common thread: they implement the __iter__()
method, which we'll get to in the next section.
What are iterators?¶
An iterable provides the for
loop its contents but on its own isn't very useful; an iterable can't tell the loop how to traverse or iterate over its contents, or what the next item is. Therefore, upon initiating, a for
loop cleverly calls iter(obj)
behind-the-scenes, which in turn calls obj.__iter__()
. If obj
doesn't support iteration or sport the __iter__()
method (meaning not an iterable), a TypeError
is raised. Otherwise, obj.__iter__()
hands the loop an iterator; an iterator is an object that knows how to perform the iteration and determines what the next item is. Logically, this means an iterable is any object when passed to iter()
, that's capable of producing an iterator.
How does the for
loop actually employ the iterator to perform the iteration and identify the next item? The answer is after retrieving an iterator, the loop secretly calls next(iterator)
, which in turn calls iterator.__next__()
—don't worry, every iterator implements this method by definition. __next__()
holds instructions that determine the subsequent item, which is then handed to the loop. During each cycle, the loop secretly calls next(iterator)
again; the iterator "remembers" its state and rather than yielding the first item, it resumes where it left off.
Depending on the iterable, its iterator may have differing methodologies for traversing the contents and identifying the next item. Here are instructions for some common iterables:
-
list
ortuple
- loop over each element -
dict
- loop over each key (arbitrary order) -
set
orfrozenset
- loop over each element (arbitrary order) -
str
- loop over each "character" - open file object - loop over each line
In a nutshell, the for
loop first passes an iterable to iter()
to produce an iterator and then in each cycle, implicitly calls next(iterator)
to determine the subsequent item—just one item per cycle. Ha, we now know the inner workings of a for
loop!
2. Putting theory into practice¶
The best way to make sense of iterables and iterators is to experiment with them. Let's put what we've learned to the test and start by creating an iterable.
myIterable = [4, 8, 15, 16, 23, 42]
Now that we have an iterable, we need instructions for traversing its contents so let's produce an iterator from myIterable
and confirm its type.
myIterator = iter(myIterable)
type(myIterator)
We could call next(myIterator)
and explore what happens, but before jumping in, what would happen if we used an iterator in a for
loop instead of an iterable? Will the loop raise a TypeError
when it calls iter(myIterator)
? Let's find out.
for item in myIterator:
print(item, end=' ')
Surprise! It looks like everything went smoothly. The reason is because myIterator
, or any iterator produced via iter()
, implements its own __iter__()
method to satisfy the for
loop. In fact, calling myIterator.__iter__()
returns self
. Simply put, if we produce an iterator from an iterable and then pass this iterator to iter()
(thus calling its __iter__()
method), we get back the same iterator. We can quickly verify this.
iter(myIterator) == myIterator
We've already mentioned that an iterable is any object when passed to iter()
, that's capable of producing an iterator. According to this definition, myIterator
must also be an iterable, which explains why it plays nicely with the for
loop.
Traversing contents manually¶
Now let's continue where we left off and verify that calling next(myIterator)
yields the subsequent item one at a time.
next(myIterator)
Oh no, what happened? Well, the for
loop from earlier had implicitly called next(myIterator)
over and over until the iterator yielded its last item—the loop exhausted the iterator. When we attempted to manually call next(myIterator)
afterwards, a StopIteration
exception was raised to tell us myIterator
has no more items left to yield. Unbeknownst to us, the loop also ran into the same StopIteration
exception after exhausting the iterator; for
loops gracefully handle and conceal this exception, and know when to stop calling next()
.
Once an iterator is exhausted, it's practically useless. However, we can "restock" the contents by recreating the iterator, enabling it to yield items again, starting with the first one.
myIterator = iter(myIterable)
next(myIterator)
next(myIterator)
next(myIterator)
Wonderful, myIterator
has been "restocked" and next()
does its job. Let's keep going and confirm a StopIteration
exception is raised if the iterator is exhausted again.
next(myIterator)
next(myIterator)
next(myIterator)
next(myIterator)
As expected, after again exhausting the iterator, we run into another StopIteration
exception when attempting to call next(myIterator)
.
Revealing the true identity of a for loop¶
Gathering everything we've learned so far about how iterables and iterators work, we can construct a while
loop that emulates for item in obj: expression
.
obj_iter = iter(obj)
while True:
try:
item = next(obj_iter)
except StopIteration:
break
We've divulged the for
loop in Python is actually just a special case of a while
loop! We've also shown that an iterator yields the subsequent item only when passed to next()
, either manually or implicitly within a for
loop; otherwise, the iterator just...sits there. Pretty lazy, no? That's actually the correct terminology to describe this behavior: lazy evaluation. We can think of an iterator as an idle conveyor belt in a factory that churns out a single product when switched on, but then it automatically shuts off. Fortunately, the conveyor belt knows where it left off each time and never runs backward.
3. Constructing custom iterators¶
So far we know only one way to produce iterators: by passing an iterable to iter()
. However, we've determined the building blocks of an iterator include the __iter__()
and __next__()
methods; we can use this information to construct a class
that produces our very own iterators! But before getting fancy, let's build one that produces iterators similar to one we're familiar with: myIterator
.
class TheNumbersMaker():
def __init__(self):
self.contents = [4, 8, 15, 16, 23, 42]
self.curr_index = 0
def __iter__(self):
return self
def __next__(self):
if self.curr_index < len(self.contents):
curr_index = self.curr_index
self.curr_index += 1
return self.contents[curr_index]
else:
raise StopIteration()
This class
is known as an iterator protocol—everything required to produce a custom iterator. Let's go through what each of its methods are doing.
-
__init__()
: Instantiates the iterator using our favorite list of numbers and stores it inself.contents
. Since an iterator needs a way to hold state and recall where it left off, this method keeps track of which index we're on usingself.curr_index
. -
__iter__()
: Recall that in order for an iterator to be compatible with afor
loop, the loop must be able call this method, so that's why it's included to simply return the iterator itself. Technically, we don't need this method if we only want to traverse the iterator manually, but why restrict ourselves? -
__next__()
: This method provides the instructions needed to determine the subsequent item, specifically, move down the contents of the list by one index and update the state each time and upon reaching the end, raise an exception to indicate the iterator is exhausted.
Let's use this iterator protocol to produce a "clone" of myIterator
and test it out in a for
loop.
myIteratorClone = TheNumbersMaker()
for item in myIteratorClone:
print(item, end=' ')
The for
loop should have exhausted the iterator so let's see what happens when we call next()
manually.
next(myIteratorClone)
Our custom iterator works like a charm! Now that we know how to build our own iterator, we could make more complex ones by changing the iterator protocol. Perhaps we'd like to produce an iterator from a list of our choosing—we just need to add a contents
parameter to __init__()
and assign it to self.contents
. Or instead we'd want to yield every third item?
4. Infinite iterators¶
Every iterator we've produced thus far is finite—they can eventually be exhausted, either manually or using a for
loop. However, we can amend the iterator protocol to produce iterators that never stop yielding items. Let's investigate by constructing an iterator protocol that produces one of these uncanny-sounding iterators: one that yields Fibonacci numbers. The instructions for yielding the nth Fibonacci number are:
This translates to: starting with 0 and 1, the next Fibonacci number is the sum of the previous two; thus, leading to the sequence 0, 1, 1, 2, 3, 5, 8, 13, 21...
class FibNumMaker():
def __init__(self):
self.last, self.curr = 0, 1
def __iter__(self):
return self
def __next__(self):
number = self.curr
self.curr += self.last
self.last = number
return number
Again, let's go through each method of this iterator protocol.
-
__init__()
: Instantiates the iterator with 0 and 1 (the first two Fibonacci numbers). Since the next Fibonacci number depends on the previous two numbers, we need to keep track of both the current and previous states of this iterator, usingself.curr
andself.last
, respectively. -
__iter__()
: Enables the iterator to be compatible withfor
loops -
__next__()
: Instructions for determining and yielding the subsequent Fibonacci number
Let's produce an iterator using this iterator protocol and explore what happens as we manually attempt to yield items.
fib_seq = FibNumMaker()
next(fib_seq)
next(fib_seq)
next(fib_seq)
next(fib_seq)
next(fib_seq)
So far so good, but to test if fib_seq
will continue yielding items, let's ask it for the next 15 numbers via a for
loop.
for item in range(15):
print(next(fib_seq), end=' ')
Fantastic, we've built an iterator that successively yields Fibonacci numbers; since there are infinite Fibonacci numbers, fib_seq
is an infinite iterator. We can keep calling next(fib_seq)
and we'll never exhaust the iterator! However, there's a few things to keep in mind:
- Notice we didn't use
fib_seq
asobj
infor item in obj: ...
. If we were to do so, the loop would implicitly callnext(fib_seq)
each cycle; becausefib_seq
can't be exhausted, the loop would never end! As an alternative, we elected to use thefor
loop to manually callnext(feb_seq)
15 times. -
fib_seq
is an infinite iterator because it's not derived from a finite iterable such asmyIterator
ormyIteratorClone
. In fact,fib_seq
isn't derived from any iterable; rather, each time we callnext(fib_seq)
, we simply compute a rolling sum using our two stored states (self.curr
andself.last
) and then update the states accordingly. - Because
fib_seq
can't be exhausted, we didn't need a condition that raises aStopIteration
exception in the__next__()
method of the iterator protocol.
5. Why iterators are so powerful¶
Let's say we wanted the sum of any prime number smaller than a maximum
value. We could first create a list
of all integers up to maximum
, then use our favorite algorithm to keep only the prime numbers, and then pass list
to sum()
. What if maximum
was 1 billion? The filtered list
would not only take a long time just to construct but require a needlessly large amount of memory when all we really care about is its sum. Instead, we could construct an iterator that successively yields prime numbers and compute a rolling sum; when the next prime number exceeds maximum
, we'd know to stop and return the current sum.
Not surprisingly, iterators can save a tremendous amount of memory and time when working very large sequences or datasets, or even infinite ones. In fact, iterators enable us to represent an infinite number of items with finite memory. Even with smaller sequences or datasets, using iterators can help us write more efficient code. Here are a few scenarios that come to mind where using an iterator may be advantageous:
- Performing computations on a stream of data
- Accessing certain items of a sequence without first storing the entire sequence
- Processing large files or datasets in manageable chunks (buffering)
- Performing computations on a sequence without knowing if all items in the sequence will be needed
- Building custom data samplers or random number generators
The biggest drawback to using iterators is the numerous lines of code required to implement a complete iterator protocol, even if the resulting iterator is rudimentary (like myIteratorClone
). Of course, Python wouldn't live up to its reputation if it didn't offer a few ways to simplify the process of producing iterators, conceivably down to a single, straightforward line of code! We'll learn about these shortcuts and some handy built-in iterators in an upcoming blog post.
If you'd like to play around with the code, here's the GitHub repo. As always, don't hesitate to leave your comments below.
Comments
Comments powered by Disqus