Iterables, iterators and generators, oh my! Part 2
In a previous post, we learned about iterators—one of the most powerful programming constructs. Our discussion divulged their role as a fundamental but hidden component of Python's for
loop, which led to a startling revelation regarding the for
loop itself (no spoilers here). We also discovered how to implement the iterator protocol to create our very own iterators, even constructing ones that represent infinite data structures. In this post, I'd like to build upon our knowledge and introduce a more elegant and efficient means for producing iterators. However, if you're not comfortable with the iterator protocol and the inner workings of iterators, I strongly recommend familiarizing yourself with Part 1 first.
Table of contents¶
- A shortcut for constructing custom iterators
- Infinite generators
- Generator expressions
- Built-in iterators and generators
1. A shortcut for constructing custom iterators¶
Let's start by examining how to build an iterator that yields the Fibonacci numbers. As a refresher, the instructions for computing the nth Fibonacci number are
$$F_1 = 0 \\ F_2 = 1 \\ F_n = F_{n-1} + F_{n-2}$$To construct this iterator, we need to implement the iterator protocol.
class FibNumMaker():
def __init__(self):
self.last, self.curr = 0, 1
def __iter__(self):
return self
def __next__(self):
temp = self.curr
self.curr += self.last
self.last = temp
return temp
It's difficult to ignore how much boilerplate code is involved in the iterator protocol. Python has always prided itself on being highly readable and concise; not surprisingly, there's an easier way to create our own iterators.
Using generator functions¶
You may have noticed whenever we referred to retrieving the subsequent item from an iterator, we used the term yield—that was no arbitrary choice of words. It turns out that yield
is a keyword linked to another method for building iterators called a generator function.
The best way to understand how to use a generator function is to see one in action. Let's implement one that creates an iterator that successively yields a sequence of three numbers: 10, 20 and 30.
def gen_func():
print('First item')
yield 10
print('Second item')
yield 20
print('Third item')
yield 30
print('No more items left!')
To produce an iterator, we call the generator function and assign it to an object.
seq = gen_func()
We call next()
on the iterator to yield its subsequent items.
next(seq)
next(seq)
next(seq)
next(seq)
Look familiar? We see that seq
behaved exactly like an iterator built using an iterator protocol; once seq
was exhausted, it even raised a StopIteration
exception! An iterator created by a generator function is called a generator, but it operates in a similar manner to an iterator created by an iterator protocol. The major difference lies within the construction procedure so let's dive into the generator function and the yield
keyword.
How to use the yield keyword¶
In an iterator protocol, we used class attributes to hold state and determine the next item. In a generator function, this responsibility is delegated to the yield
keyword.
In the above example, we produced the generator seq
using the generator function gen_func()
. When we call next()
on the generator to yield its subsequent item, here is what's going on behind the scenes:
- Run any lines of code in the generator function, such as the
print()
function in our example, until we hit ayield
statement - Return the value or variable dictated by the
yield
statement - Pause the generator function and relinquish control
Upon subsequent calls to next()
on the generator, we resume where we left off in the generator function and continue to run any lines of code until we hit another yield
statement, and thus repeating the process. If there are no more yield
statements left in the generator function, calling next()
raises a StopIteration
exception, which indicates we've exhausted the generator. Clearly, there's quite a few things going on in the background, but the yield
statement does a terrific job delineating the generator's next item.
Unlike a traditional function that always starts at its first line after being called, a generator function resembles "a function that remembers its place"; the yield
statement serves as a bookmark of sorts. In fact, the presence of a yield
statement concretely distinguishes a generator function from a traditional function.
One reason generators can be challenging to grasp is because the generator function is doing two things at once:
- Produce a generator
- Determine and yield the subsequent item in an already produced generator
The first role is equivalent to the __init__()
method of the iterator protocol; calling a generator function produces a new generator.
seq = gen_func()
The second role is equivalent to the __next__()
method of the iterator protocol; calling next()
on a generator yields its subsequent item as dictated by the yield
statements within the generator function.
next(seq)
In the iterator protocol, these two roles were explicitly and distinctly defined, but a generator function achieves the same result while doing away with the boilerplate code. For anyone implementing iterator protocols, generator functions are a godsend! On the other hand, if you approach generator functions and the yield
statement without first understanding how iterators and the iterator protocol work, these constructs can be very perplexing.
2. Infinite generators¶
We were able to produce infinite iterators by modifying the iterator protocol; similarly, we can tweak a generator function to produce infinite generators. Let's demonstrate by converting our iterator protocol for Fibonacci numbers into a generator function.
def fib_genfunc():
last, curr = 0, 1
while True:
yield curr
last, curr = curr, last + curr
We're employing two tricks in this generator function. First, we ensure that we never run out yield
statements by incorporating an infinite while
loop. Second, since Python handles assignment statements from left to right, we don't need to utilize a temporary variable as we did in the iterator protocol. The power concealed in these few lines of code is boggling; a truly phenomenal example of the elegance of generators.
Let's now produce a generator using fib_genfunc()
and display the first 50 Fibonacci numbers.
fib_seq = fib_genfunc()
for _ in range(50):
print(next(fib_seq), end=' ')
We could also get clever and provide the generator function a parameter that governs the maximum value that can be yielded by the generator. In turn, this provides us with granular control of infinite generators.
def fib_genfunc2(limit):
last, curr = 0, 1
while curr < limit:
yield curr
last, curr = curr, last + curr
For example, we can use fib_genfunc2()
to produce a generator that yields every Fibonacci number below 1 million.
fib_million = fib_genfunc2(1000000)
for item in fib_million:
print(item, end=' ')
3. Generator expressions¶
Believe it or not, there's an even more straightforward method for building generators, related to list comprehensions. As a reminder, list comprehensions are a convenient way to construct a customized list object. For example, let's create a list containing the cubes of even integers between 0 and 20 inclusive and display each element.
cubes_list = [x ** 3 for x in range(21) if x % 2 == 0]
for i in cubes_list:
print(i, end=' ')
If we swap out the brackets for parentheses in the list comprehension, we have a generator expression, which produces a generator that successively yields the same sequence of numbers.
cubes_gen = (x ** 3 for x in range(21) if x % 2 == 0)
for i in cubes_gen:
print(i, end=' ')
Of course, if we add too many conditions in the generator expression, it can become incomprehensible (a big no-no in idiomatic Python). Therefore, a generator function may be more appropriate for building complex generators. However, whenever we're interested in performing an operation on a sequence or a stream of data, it's typically a better idea to use a generator than first creating a list. For example, let's say we wanted to find the sum of the cubes of even integers below 1 million. Let's first do so using a list.
cubes_list2 = [x ** 3 for x in range(1000000) if x % 2 == 0]
sum(cubes_list2)
And now, using a generator.
cubes_gen2 = (x ** 3 for x in range(1000000) if x % 2 == 0)
sum(cubes_gen2)
Both methods get the job done but let's take a peek at the memory footprint of each construct.
from sys import getsizeof
getsizeof(cubes_list2)
getsizeof(cubes_gen2)
We see the list taking up dramatically more memory than the generator. In fact, whether the generator were to yield ten or $10^{100}$ numbers, its memory footprint would always be 88 bytes! Such is the beauty of lazy evaluation.
If wanted to perform the same computation using a generator in a single line of code, here's a tip: the set of parentheses for implementing a generator expression isn't required.
sum(x ** 3 for x in range(1000000) if x % 2 == 0)
Finally, it's very important to note that a generator expression is not a tuple comprehension. That is, a generator expression yields a generator object, not a tuple object.
type(cubes_gen2)
4. Built-in iterators and generators¶
Even with all of these convenient shortcuts for producing iterators and generators, Python includes a module called itertools
that contains some useful prepackaged ones. There's also several ingenious recipes that demonstrate ways to get creative with custom iterators. There's simply way too many to describe here but they're worth browsing through; you'll never know when they could come in handy!
In addition, if you're performing data science using the Pandas library, you've probably wondered if there was a simple way to loop over the rows or columns of a DataFrame. The following methods can do the job:
-
iterrows()
- produces a generator that loops over the indices and rows of a DataFrame -
iteritems()
- produces a generator that loops over the column labels and columns of a DataFrame
By now, you should have an in-depth understanding of iterators and generators, how to build them, and use them effectively. I hope these two posts helped shed light on these abstract constructs and illustrated their utility. At first, I found it quite challenging to wrap my head around iterators and generators, but after appreciating what they can do, they've helped me write cleaner and more efficient code. In fact, now whenever I implement a loop, I'm naturally inclined to think about whether a generator could take its place.
If you'd like to play around with the code, here's the GitHub repo. As always, don't hesitate to leave your comments below.
Comments
Comments powered by Disqus