Iterables, iterators and generators, oh my! Part 2

In a previous post, we learned about iterators—one of the most powerful programming constructs. Our discussion divulged their role as a fundamental but hidden component of Python's for loop, which led to a startling revelation regarding the for loop itself (no spoilers here). We also discovered how to implement the iterator protocol to create our very own iterators, even constructing ones that represent infinite data structures. In this post, I'd like to build upon our knowledge and introduce a more elegant and efficient means for producing iterators. However, if you're not comfortable with the iterator protocol and the inner workings of iterators, I strongly recommend familiarizing yourself with Part 1 first.

Table of contents

  1. A shortcut for constructing custom iterators
  2. Infinite generators
  3. Generator expressions
  4. Built-in iterators and generators

1. A shortcut for constructing custom iterators

Let's start by examining how to build an iterator that yields the Fibonacci numbers. As a refresher, the instructions for computing the nth Fibonacci number are

$$F_1 = 0 \\ F_2 = 1 \\ F_n = F_{n-1} + F_{n-2}$$

To construct this iterator, we need to implement the iterator protocol.

In [1]:
class FibNumMaker():
    def __init__(self):
        self.last, self.curr = 0, 1

    def __iter__(self):
        return self
    
    def __next__(self):
        temp = self.curr
        self.curr += self.last  
        self.last = temp              
        return temp

It's difficult to ignore how much boilerplate code is involved in the iterator protocol. Python has always prided itself on being highly readable and concise; not surprisingly, there's an easier way to create our own iterators.

Using generator functions

You may have noticed whenever we referred to retrieving the subsequent item from an iterator, we used the term yield—that was no arbitrary choice of words. It turns out that yield is a keyword linked to another method for building iterators called a generator function.

The best way to understand how to use a generator function is to see one in action. Let's implement one that creates an iterator that successively yields a sequence of three numbers: 10, 20 and 30.

In [2]:
def gen_func():
    print('First item')
    yield 10
    print('Second item')
    yield 20
    print('Third item')
    yield 30
    print('No more items left!')

To produce an iterator, we call the generator function and assign it to an object.

In [3]:
seq = gen_func()

We call next() on the iterator to yield its subsequent items.

In [4]:
next(seq)
First item
Out[4]:
10
In [5]:
next(seq)
Second item
Out[5]:
20
In [6]:
next(seq)
Third item
Out[6]:
30
In [7]:
next(seq)
No more items left!
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-7-1e4fcafbc0ee> in <module>()
----> 1 next(seq)

StopIteration: 

Look familiar? We see that seq behaved exactly like an iterator built using an iterator protocol; once seq was exhausted, it even raised a StopIteration exception! An iterator created by a generator function is called a generator, but it operates in a similar manner to an iterator created by an iterator protocol. The major difference lies within the construction procedure so let's dive into the generator function and the yield keyword.

How to use the yield keyword

In an iterator protocol, we used class attributes to hold state and determine the next item. In a generator function, this responsibility is delegated to the yield keyword.

In the above example, we produced the generator seq using the generator function gen_func(). When we call next() on the generator to yield its subsequent item, here is what's going on behind the scenes:

  1. Run any lines of code in the generator function, such as the print() function in our example, until we hit a yield statement
  2. Return the value or variable dictated by the yield statement
  3. Pause the generator function and relinquish control

Upon subsequent calls to next() on the generator, we resume where we left off in the generator function and continue to run any lines of code until we hit another yield statement, and thus repeating the process. If there are no more yield statements left in the generator function, calling next() raises a StopIteration exception, which indicates we've exhausted the generator. Clearly, there's quite a few things going on in the background, but the yield statement does a terrific job delineating the generator's next item.

Unlike a traditional function that always starts at its first line after being called, a generator function resembles "a function that remembers its place"; the yield statement serves as a bookmark of sorts. In fact, the presence of a yield statement concretely distinguishes a generator function from a traditional function.

One reason generators can be challenging to grasp is because the generator function is doing two things at once:

  1. Produce a generator
  2. Determine and yield the subsequent item in an already produced generator

The first role is equivalent to the __init__() method of the iterator protocol; calling a generator function produces a new generator.

In [8]:
seq = gen_func()

The second role is equivalent to the __next__() method of the iterator protocol; calling next() on a generator yields its subsequent item as dictated by the yield statements within the generator function.

In [9]:
next(seq)
First item
Out[9]:
10

In the iterator protocol, these two roles were explicitly and distinctly defined, but a generator function achieves the same result while doing away with the boilerplate code. For anyone implementing iterator protocols, generator functions are a godsend! On the other hand, if you approach generator functions and the yield statement without first understanding how iterators and the iterator protocol work, these constructs can be very perplexing.

2. Infinite generators

We were able to produce infinite iterators by modifying the iterator protocol; similarly, we can tweak a generator function to produce infinite generators. Let's demonstrate by converting our iterator protocol for Fibonacci numbers into a generator function.

In [10]:
def fib_genfunc():
    last, curr = 0, 1
    while True:
        yield curr
        last, curr = curr, last + curr

We're employing two tricks in this generator function. First, we ensure that we never run out yield statements by incorporating an infinite while loop. Second, since Python handles assignment statements from left to right, we don't need to utilize a temporary variable as we did in the iterator protocol. The power concealed in these few lines of code is boggling; a truly phenomenal example of the elegance of generators.

Let's now produce a generator using fib_genfunc() and display the first 50 Fibonacci numbers.

In [11]:
fib_seq = fib_genfunc()

for _ in range(50):
    print(next(fib_seq), end=' ')
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 4807526976 7778742049 12586269025 

We could also get clever and provide the generator function a parameter that governs the maximum value that can be yielded by the generator. In turn, this provides us with granular control of infinite generators.

In [12]:
def fib_genfunc2(limit):
    last, curr = 0, 1
    while curr < limit:
        yield curr
        last, curr = curr, last + curr

For example, we can use fib_genfunc2() to produce a generator that yields every Fibonacci number below 1 million.

In [13]:
fib_million = fib_genfunc2(1000000)

for item in fib_million:
    print(item, end=' ')
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 

3. Generator expressions

Believe it or not, there's an even more straightforward method for building generators, related to list comprehensions. As a reminder, list comprehensions are a convenient way to construct a customized list object. For example, let's create a list containing the cubes of even integers between 0 and 20 inclusive and display each element.

In [14]:
cubes_list = [x ** 3 for x in range(21) if x % 2 == 0]

for i in cubes_list:
    print(i, end=' ')
0 8 64 216 512 1000 1728 2744 4096 5832 8000 

If we swap out the brackets for parentheses in the list comprehension, we have a generator expression, which produces a generator that successively yields the same sequence of numbers.

In [15]:
cubes_gen = (x ** 3 for x in range(21) if x % 2 == 0)

for i in cubes_gen:
    print(i, end=' ')
0 8 64 216 512 1000 1728 2744 4096 5832 8000 

Of course, if we add too many conditions in the generator expression, it can become incomprehensible (a big no-no in idiomatic Python). Therefore, a generator function may be more appropriate for building complex generators. However, whenever we're interested in performing an operation on a sequence or a stream of data, it's typically a better idea to use a generator than first creating a list. For example, let's say we wanted to find the sum of the cubes of even integers below 1 million. Let's first do so using a list.

In [16]:
cubes_list2 = [x ** 3 for x in range(1000000) if x % 2 == 0]
sum(cubes_list2)
Out[16]:
124999500000500000000000

And now, using a generator.

In [17]:
cubes_gen2 = (x ** 3 for x in range(1000000) if x % 2 == 0)
sum(cubes_gen2)
Out[17]:
124999500000500000000000

Both methods get the job done but let's take a peek at the memory footprint of each construct.

In [18]:
from sys import getsizeof

getsizeof(cubes_list2)
Out[18]:
4290016
In [19]:
getsizeof(cubes_gen2)
Out[19]:
88

We see the list taking up dramatically more memory than the generator. In fact, whether the generator were to yield ten or $10^{100}$ numbers, its memory footprint would always be 88 bytes! Such is the beauty of lazy evaluation.

If wanted to perform the same computation using a generator in a single line of code, here's a tip: the set of parentheses for implementing a generator expression isn't required.

In [20]:
sum(x ** 3 for x in range(1000000) if x % 2 == 0)
Out[20]:
124999500000500000000000

Finally, it's very important to note that a generator expression is not a tuple comprehension. That is, a generator expression yields a generator object, not a tuple object.

In [21]:
type(cubes_gen2)
Out[21]:
generator

4. Built-in iterators and generators

Even with all of these convenient shortcuts for producing iterators and generators, Python includes a module called itertools that contains some useful prepackaged ones. There's also several ingenious recipes that demonstrate ways to get creative with custom iterators. There's simply way too many to describe here but they're worth browsing through; you'll never know when they could come in handy!

In addition, if you're performing data science using the Pandas library, you've probably wondered if there was a simple way to loop over the rows or columns of a DataFrame. The following methods can do the job:

  • iterrows() - produces a generator that loops over the indices and rows of a DataFrame
  • iteritems() - produces a generator that loops over the column labels and columns of a DataFrame

By now, you should have an in-depth understanding of iterators and generators, how to build them, and use them effectively. I hope these two posts helped shed light on these abstract constructs and illustrated their utility. At first, I found it quite challenging to wrap my head around iterators and generators, but after appreciating what they can do, they've helped me write cleaner and more efficient code. In fact, now whenever I implement a loop, I'm naturally inclined to think about whether a generator could take its place.

If you'd like to play around with the code, here's the GitHub repo. As always, don't hesitate to leave your comments below.

Comments

Comments powered by Disqus