Python Advanced – Generators & Iterators

Introduction

If you have ever tried to process a 10 GB log file by reading it entirely into memory, you already know why generators and iterators matter. They are Python’s answer to a fundamental problem: how do you work with sequences of data without materializing everything in memory at once?

An iterator is any object that produces values one at a time through a standard protocol. A generator is a special kind of iterator that you create with a function containing yield statements. Together, they let you build lazy pipelines that process data element by element, consuming only the memory needed for a single item at a time.

This is not just an academic concept. Every for loop in Python uses the iterator protocol under the hood. When you iterate over a file, a database cursor, or a range of numbers, you are already using iterators. Understanding how they work gives you the ability to write code that scales to datasets of any size without blowing up your memory footprint.

In this tutorial, we will cover the iterator protocol from the ground up, build custom iterators and generators, chain them into processing pipelines, and explore the itertools module. By the end, you will have a complete mental model for lazy evaluation in Python.


1. The Iterator Protocol

The iterator protocol is deceptively simple. It consists of two methods:

  • __iter__() — Returns the iterator object itself. This is what makes an object usable in a for loop.
  • __next__() — Returns the next value in the sequence. When there are no more values, it raises StopIteration.

That is the entire contract. Any object that implements both methods is an iterator. Any object that implements __iter__() (even if it returns a separate iterator object) is an iterable.

The distinction matters: a list is an iterable (it has __iter__() that returns a list iterator), but it is not itself an iterator (it does not have __next__()). The iterator is a separate object that tracks the current position.

# The iterator protocol in action
numbers = [10, 20, 30]

# Get an iterator from the iterable
it = iter(numbers)       # Calls numbers.__iter__()

print(next(it))          # 10  — Calls it.__next__()
print(next(it))          # 20
print(next(it))          # 30
# print(next(it))        # Raises StopIteration

# This is exactly what a for loop does internally:
# 1. Calls iter() on the iterable to get an iterator
# 2. Calls next() repeatedly until StopIteration
# 3. Catches StopIteration silently and exits the loop

for num in [10, 20, 30]:
    print(num)
# Equivalent to the manual iter()/next() calls above

Understanding StopIteration is key. It is not an error — it is the signal that tells Python the sequence is exhausted. The for loop catches it automatically, but if you call next() manually, you need to handle it yourself or pass a default value:

# Handling StopIteration manually
it = iter([1, 2])

print(next(it))           # 1
print(next(it))           # 2
print(next(it, "done"))   # "done" — default value instead of StopIteration

# Without a default, you must catch the exception
it = iter([1])
try:
    print(next(it))       # 1
    print(next(it))       # StopIteration raised here
except StopIteration:
    print("Iterator exhausted")

Making a Class Iterable

To make your own class work with for loops, implement the iterator protocol. Here is a class that counts up from a start value to a stop value:

class CountUp:
    """An iterator that counts from start to stop (inclusive)."""
    
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop
        self.current = start
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.current > self.stop:
            raise StopIteration
        value = self.current
        self.current += 1
        return value

# Use it in a for loop
for num in CountUp(1, 5):
    print(num, end=" ")  # 1 2 3 4 5

# Use it with list() to materialize all values
print(list(CountUp(10, 15)))  # [10, 11, 12, 13, 14, 15]

# Use it with sum(), max(), any(), etc.
print(sum(CountUp(1, 100)))   # 5050

2. Built-in Iterators

Python’s built-in types are all iterable. The iter() function extracts an iterator from any iterable, and next() advances it one step.

# Lists
list_iter = iter([1, 2, 3])
print(next(list_iter))  # 1
print(next(list_iter))  # 2

# Strings (iterate character by character)
str_iter = iter("Python")
print(next(str_iter))  # 'P'
print(next(str_iter))  # 'y'

# Dictionaries (iterate over keys by default)
data = {"name": "Alice", "age": 30, "role": "engineer"}
dict_iter = iter(data)
print(next(dict_iter))  # 'name'
print(next(dict_iter))  # 'age'

# Iterate over values or key-value pairs
for value in data.values():
    print(value, end=" ")  # Alice 30 engineer

for key, value in data.items():
    print(f"{key}={value}", end=" ")  # name=Alice age=30 role=engineer

# Sets (order is not guaranteed)
set_iter = iter({3, 1, 4, 1, 5})
print(next(set_iter))  # Could be any element

# Files are iterators (they yield lines)
with open("example.txt", "w") as f:
    f.write("line 1\nline 2\nline 3\n")

with open("example.txt") as f:
    for line in f:  # f is its own iterator
        print(line.strip())
    # line 1
    # line 2
    # line 3

Notice that files are their own iterators — calling iter(f) returns f itself. This is why you can iterate over a file directly in a for loop. It also means you can only iterate through a file once without resetting the file pointer.


3. Creating Custom Iterators

Let us build a few more custom iterators to solidify the pattern. Each one implements __iter__() and __next__().

Fibonacci Iterator

class Fibonacci:
    """An iterator that produces Fibonacci numbers up to a maximum value."""
    
    def __init__(self, max_value):
        self.max_value = max_value
        self.a = 0
        self.b = 1
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.a > self.max_value:
            raise StopIteration
        value = self.a
        self.a, self.b = self.b, self.a + self.b
        return value

print(list(Fibonacci(100)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

# Works with any function that consumes an iterable
print(sum(Fibonacci(1000)))  # 2583

Range Reimplementation

class MyRange:
    """A simplified reimplementation of range()."""
    
    def __init__(self, start, stop=None, step=1):
        if stop is None:
            self.start = 0
            self.stop = start
        else:
            self.start = start
            self.stop = stop
        self.step = step
    
    def __iter__(self):
        # Return a new iterator each time — this allows reuse
        current = self.start
        while (self.step > 0 and current < self.stop) or \
              (self.step < 0 and current > self.stop):
            yield current  # Using yield here makes __iter__ a generator
            current += self.step
    
    def __len__(self):
        return max(0, (self.stop - self.start + self.step - 1) // self.step)
    
    def __repr__(self):
        return f"MyRange({self.start}, {self.stop}, {self.step})"

# Forward range
print(list(MyRange(5)))         # [0, 1, 2, 3, 4]
print(list(MyRange(2, 8)))      # [2, 3, 4, 5, 6, 7]
print(list(MyRange(0, 10, 3)))  # [0, 3, 6, 9]

# Reverse range
print(list(MyRange(10, 0, -2))) # [10, 8, 6, 4, 2]

# Reusable (unlike a plain iterator)
r = MyRange(3)
print(list(r))  # [0, 1, 2]
print(list(r))  # [0, 1, 2] — works again because __iter__ creates a new generator

Notice the MyRange trick: instead of implementing __next__() directly, the __iter__() method uses yield, which makes it a generator function. Each call to __iter__() creates a fresh generator object, so the range is reusable. This is a common and powerful pattern.


4. Generator Functions

Writing custom iterator classes is verbose. You need __init__, __iter__, __next__, manual state management, and StopIteration handling. Generators solve this by letting you write iterator logic as a simple function with yield statements.

When Python encounters a yield in a function body, that function becomes a generator function. Calling it does not execute the body — it returns a generator object that implements the iterator protocol automatically.

def count_up(start, stop):
    """A generator that counts from start to stop."""
    current = start
    while current <= stop:
        yield current       # Pause here, return current value
        current += 1        # Resume here on next() call

# Calling the function returns a generator object (does NOT run the body)
gen = count_up(1, 5)
print(type(gen))  # <class 'generator'>

# The generator implements the iterator protocol
print(next(gen))  # 1
print(next(gen))  # 2
print(next(gen))  # 3

# Use in a for loop
for num in count_up(1, 5):
    print(num, end=" ")  # 1 2 3 4 5

How Generators Work Internally

When you call next() on a generator, execution proceeds from the current position until it hits a yield statement. At that point, the yielded value is returned to the caller, and the generator's entire state (local variables, instruction pointer) is frozen. The next next() call resumes from exactly where it left off.

def demonstrate_state():
    print("Step 1: Starting")
    yield "first"
    print("Step 2: Resumed after first yield")
    yield "second"
    print("Step 3: Resumed after second yield")
    yield "third"
    print("Step 4: About to finish")
    # No more yields — StopIteration will be raised

gen = demonstrate_state()

print(next(gen))
# Step 1: Starting
# 'first'

print(next(gen))
# Step 2: Resumed after first yield
# 'second'

print(next(gen))
# Step 3: Resumed after second yield
# 'third'

# print(next(gen))
# Step 4: About to finish
# Raises StopIteration

Generator State

You can inspect a generator's state using the inspect module:

import inspect

def simple_gen():
    yield 1
    yield 2

gen = simple_gen()
print(inspect.getgeneratorstate(gen))  # GEN_CREATED

next(gen)
print(inspect.getgeneratorstate(gen))  # GEN_SUSPENDED

next(gen)
print(inspect.getgeneratorstate(gen))  # GEN_SUSPENDED

try:
    next(gen)
except StopIteration:
    pass
print(inspect.getgeneratorstate(gen))  # GEN_CLOSED

A generator moves through four states: GEN_CREATED (just created, not started), GEN_RUNNING (currently executing), GEN_SUSPENDED (paused at a yield), and GEN_CLOSED (finished or closed).

Fibonacci as a Generator

Compare the class-based Fibonacci iterator from earlier with the generator version:

# Generator version — drastically simpler
def fibonacci(max_value=None):
    a, b = 0, 1
    while max_value is None or a <= max_value:
        yield a
        a, b = b, a + b

# Finite sequence
print(list(fibonacci(100)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

# Infinite sequence (use itertools.islice to take a finite portion)
import itertools
print(list(itertools.islice(fibonacci(), 15)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377]

The generator version is 4 lines of logic compared to 12+ lines for the class. No __init__, no __iter__, no __next__, no StopIteration — Python handles all of it.


5. Generator Expressions

Generator expressions are to generators what list comprehensions are to lists. They use the same syntax as list comprehensions, but with parentheses instead of square brackets. The critical difference is that a generator expression produces values lazily — one at a time — while a list comprehension builds the entire list in memory.

import sys

# List comprehension — builds entire list in memory
squares_list = [x ** 2 for x in range(1_000_000)]
print(f"List size: {sys.getsizeof(squares_list):,} bytes")  # ~8,448,728 bytes

# Generator expression — produces values on demand
squares_gen = (x ** 2 for x in range(1_000_000))
print(f"Generator size: {sys.getsizeof(squares_gen):,} bytes")  # ~200 bytes

# Both support filtering
even_squares = (x ** 2 for x in range(20) if x % 2 == 0)
print(list(even_squares))  # [0, 4, 16, 36, 64, 100, 144, 196, 256, 324]

# Generator expressions can be passed directly to functions
# (no extra parentheses needed when it is the only argument)
total = sum(x ** 2 for x in range(1000))
print(total)  # 332833500

max_val = max(len(word) for word in ["Python", "generators", "are", "powerful"])
print(max_val)  # 10

has_negative = any(x < 0 for x in [1, -2, 3, 4])
print(has_negative)  # True

Memory Comparison

import sys

def compare_memory(n):
    """Compare memory usage of list vs generator for n elements."""
    
    # List comprehension
    data_list = [x * 2 for x in range(n)]
    list_size = sys.getsizeof(data_list)
    
    # Generator expression
    data_gen = (x * 2 for x in range(n))
    gen_size = sys.getsizeof(data_gen)
    
    print(f"n={n:>12,}  |  List: {list_size:>12,} bytes  |  Generator: {gen_size:>6,} bytes  |  Ratio: {list_size/gen_size:.0f}x")

compare_memory(100)
compare_memory(10_000)
compare_memory(1_000_000)
compare_memory(10_000_000)

# Output:
# n=         100  |  List:          920 bytes  |  Generator:    200 bytes  |  Ratio: 5x
# n=      10,000  |  List:       87,624 bytes  |  Generator:    200 bytes  |  Ratio: 438x
# n=   1,000,000  |  List:    8,448,728 bytes  |  Generator:    200 bytes  |  Ratio: 42244x
# n=  10,000,000  |  List:   80,000,056 bytes  |  Generator:    200 bytes  |  Ratio: 400000x

The generator's memory footprint is constant regardless of how many elements it produces. This is the fundamental advantage of lazy evaluation.


6. yield from

The yield from expression, introduced in Python 3.3, delegates iteration to a sub-generator or any iterable. It is cleaner than manually looping over a sub-iterable and yielding each element.

# Without yield from
def chain_manual(*iterables):
    for iterable in iterables:
        for item in iterable:
            yield item

# With yield from — cleaner
def chain_elegant(*iterables):
    for iterable in iterables:
        yield from iterable

# Both produce the same result
result = list(chain_elegant([1, 2, 3], "abc", (10, 20)))
print(result)  # [1, 2, 3, 'a', 'b', 'c', 10, 20]

Flattening Nested Structures

def flatten(nested):
    """Recursively flatten a nested structure."""
    for item in nested:
        if isinstance(item, (list, tuple)):
            yield from flatten(item)  # Delegate to recursive call
        else:
            yield item

data = [1, [2, 3], [4, [5, 6, [7, 8]]], 9]
print(list(flatten(data)))  # [1, 2, 3, 4, 5, 6, 7, 8, 9]

# Works with mixed nesting
mixed = [1, (2, [3, 4]), [5, (6,)], 7]
print(list(flatten(mixed)))  # [1, 2, 3, 4, 5, 6, 7]

Delegating to Sub-generators

def header_rows():
    yield "Name,Age,City"

def data_rows():
    yield "Alice,30,New York"
    yield "Bob,25,San Francisco"
    yield "Charlie,35,Chicago"

def footer_rows():
    yield "---END OF REPORT---"

def full_report():
    yield from header_rows()
    yield from data_rows()
    yield from footer_rows()

for line in full_report():
    print(line)
# Name,Age,City
# Alice,30,New York
# Bob,25,San Francisco
# Charlie,35,Chicago
# ---END OF REPORT---

7. Sending Values to Generators

Generators are not just producers — they can also receive values. The send() method resumes a generator and sends a value that becomes the result of the yield expression inside the generator. This turns generators into coroutines that can both produce and consume data.

def running_average():
    """A generator that computes a running average."""
    total = 0
    count = 0
    average = None
    while True:
        value = yield average   # Receive a value, yield the current average
        if value is None:
            break
        total += value
        count += 1
        average = total / count

# Usage
avg = running_average()
next(avg)              # Prime the generator (advance to first yield)

print(avg.send(10))    # 10.0
print(avg.send(20))    # 15.0
print(avg.send(30))    # 20.0
print(avg.send(40))    # 25.0

The first next() call is necessary to "prime" the generator — it advances execution to the first yield expression, where the generator is ready to receive a value. After that, send() both sends a value in and gets the next yielded value out.

Coroutine Pattern

def accumulator():
    """A coroutine that accumulates values and reports the running total."""
    total = 0
    while True:
        value = yield total
        if value is None:
            return total        # return value becomes StopIteration.value
        total += value

acc = accumulator()
next(acc)              # Prime

print(acc.send(5))     # 5
print(acc.send(10))    # 15
print(acc.send(3))     # 18

# Close the generator gracefully
try:
    acc.send(None)     # Triggers the return statement
except StopIteration as e:
    print(f"Final total: {e.value}")  # Final total: 18
# Practical coroutine: a filter that receives items and forwards matches
def grep_coroutine(pattern):
    """A coroutine that filters lines matching a pattern."""
    print(f"Looking for: {pattern}")
    matches = []
    while True:
        line = yield
        if line is None:
            break
        if pattern in line:
            matches.append(line)
            print(f"  Match: {line}")
    return matches

# Usage
searcher = grep_coroutine("error")
next(searcher)  # Prime

searcher.send("INFO: Server started")
searcher.send("ERROR: Connection timeout")   # Match
searcher.send("DEBUG: Request received")
searcher.send("ERROR: Disk full")             # Match
searcher.send("INFO: Shutting down")

try:
    searcher.send(None)  # Signal completion
except StopIteration as e:
    print(f"All matches: {e.value}")
# Match: ERROR: Connection timeout
# Match: ERROR: Disk full
# All matches: ['ERROR: Connection timeout', 'ERROR: Disk full']

8. Generator Pipelines

One of the most powerful patterns in Python is chaining generators into a processing pipeline. Each generator reads from the previous one, transforms the data, and passes it along. This works like Unix pipes — data flows through a chain of transformations without any intermediate lists being created in memory.

# Pipeline: Read lines -> filter non-empty -> strip whitespace -> convert to uppercase

def read_lines(text):
    """Stage 1: Split text into lines."""
    for line in text.split("\n"):
        yield line

def filter_non_empty(lines):
    """Stage 2: Remove empty lines."""
    for line in lines:
        if line.strip():
            yield line

def strip_whitespace(lines):
    """Stage 3: Strip leading/trailing whitespace."""
    for line in lines:
        yield line.strip()

def to_uppercase(lines):
    """Stage 4: Convert to uppercase."""
    for line in lines:
        yield line.upper()

# Chain the pipeline
raw_text = """
  hello world  
  
  Python generators  
  are powerful  
  
  and memory efficient  
"""

pipeline = to_uppercase(
    strip_whitespace(
        filter_non_empty(
            read_lines(raw_text)
        )
    )
)

for line in pipeline:
    print(line)
# HELLO WORLD
# PYTHON GENERATORS
# ARE POWERFUL
# AND MEMORY EFFICIENT

Data Processing Pipeline

# A more realistic pipeline: process log entries

def parse_log_entries(lines):
    """Parse each line into a structured dict."""
    for line in lines:
        parts = line.split(" | ")
        if len(parts) == 3:
            yield {
                "timestamp": parts[0],
                "level": parts[1],
                "message": parts[2]
            }

def filter_errors(entries):
    """Keep only ERROR entries."""
    for entry in entries:
        if entry["level"] == "ERROR":
            yield entry

def format_alerts(entries):
    """Format entries as alert strings."""
    for entry in entries:
        yield f"ALERT [{entry['timestamp']}]: {entry['message']}"

# Simulate log data
log_data = [
    "2024-01-15 10:00:01 | INFO | Server started",
    "2024-01-15 10:00:05 | ERROR | Database connection failed",
    "2024-01-15 10:00:10 | INFO | Retry attempt 1",
    "2024-01-15 10:00:15 | ERROR | Database connection failed again",
    "2024-01-15 10:00:20 | INFO | Connection restored",
    "2024-01-15 10:00:25 | ERROR | Disk space low",
]

# Build the pipeline
alerts = format_alerts(filter_errors(parse_log_entries(log_data)))

for alert in alerts:
    print(alert)
# ALERT [2024-01-15 10:00:05]: Database connection failed
# ALERT [2024-01-15 10:00:15]: Database connection failed again
# ALERT [2024-01-15 10:00:25]: Disk space low

Each stage processes one item at a time. No intermediate lists are created. This means you could pipe a 100 GB log file through this pipeline and it would use a trivial amount of memory.


9. The itertools Module

The itertools module is Python's standard library for efficient iterator operations. Every function in it returns an iterator, so they compose naturally with generators and pipelines. Here are the functions you will use most often.

Infinite Iterators

import itertools

# count: count from a start value with a step
for i in itertools.islice(itertools.count(10, 2), 5):
    print(i, end=" ")  # 10 12 14 16 18
print()

# cycle: repeat an iterable forever
colors = itertools.cycle(["red", "green", "blue"])
for _ in range(7):
    print(next(colors), end=" ")  # red green blue red green blue red
print()

# repeat: repeat a value n times (or forever)
fives = list(itertools.repeat(5, 4))
print(fives)  # [5, 5, 5, 5]

# Practical use of repeat: initialize a grid
row = list(itertools.repeat(0, 5))
grid = [list(itertools.repeat(0, 5)) for _ in range(3)]
print(grid)  # [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

Terminating Iterators

import itertools

# chain: concatenate multiple iterables
combined = list(itertools.chain([1, 2], [3, 4], [5, 6]))
print(combined)  # [1, 2, 3, 4, 5, 6]

# chain.from_iterable: chain from a single iterable of iterables
nested = [[1, 2], [3, 4], [5, 6]]
flat = list(itertools.chain.from_iterable(nested))
print(flat)  # [1, 2, 3, 4, 5, 6]

# islice: slice an iterator (like list slicing but for iterators)
print(list(itertools.islice(range(100), 5)))         # [0, 1, 2, 3, 4]
print(list(itertools.islice(range(100), 10, 20, 3))) # [10, 13, 16, 19]

# takewhile / dropwhile: take/drop based on a predicate
nums = [1, 3, 5, 7, 2, 4, 6, 8]
print(list(itertools.takewhile(lambda x: x < 6, nums)))  # [1, 3, 5]
print(list(itertools.dropwhile(lambda x: x < 6, nums)))  # [7, 2, 4, 6, 8]

# groupby: group consecutive elements by a key function
data = [("A", 1), ("A", 2), ("B", 3), ("B", 4), ("A", 5)]
for key, group in itertools.groupby(data, key=lambda x: x[0]):
    print(f"{key}: {list(group)}")
# A: [('A', 1), ('A', 2)]
# B: [('B', 3), ('B', 4)]
# A: [('A', 5)]           <-- Note: only groups CONSECUTIVE matches

Combinatoric Iterators

import itertools

# combinations: all r-length combinations (no repeats, order doesn't matter)
print(list(itertools.combinations("ABCD", 2)))
# [('A','B'), ('A','C'), ('A','D'), ('B','C'), ('B','D'), ('C','D')]

# combinations_with_replacement: combinations allowing repeats
print(list(itertools.combinations_with_replacement("AB", 3)))
# [('A','A','A'), ('A','A','B'), ('A','B','B'), ('B','B','B')]

# permutations: all r-length arrangements (order matters)
print(list(itertools.permutations("ABC", 2)))
# [('A','B'), ('A','C'), ('B','A'), ('B','C'), ('C','A'), ('C','B')]

# product: Cartesian product (like nested for loops)
print(list(itertools.product("AB", [1, 2])))
# [('A',1), ('A',2), ('B',1), ('B',2)]

# Practical: generate all possible configs
sizes = ["small", "medium", "large"]
colors = ["red", "blue"]
materials = ["cotton", "silk"]

for combo in itertools.product(sizes, colors, materials):
    print(combo)
# ('small', 'red', 'cotton')
# ('small', 'red', 'silk')
# ('small', 'blue', 'cotton')
# ... (12 total combinations)

10. Practical Examples

Reading Large Files Line by Line

This is the canonical use case for generators. Instead of loading an entire file into memory, you process it one line at a time.

def read_large_file(file_path):
    """Read a file line by line using a generator."""
    with open(file_path, "r") as f:
        for line in f:
            yield line.strip()

def count_errors_in_log(file_path):
    """Count error lines in a log file without loading it into memory."""
    error_count = 0
    for line in read_large_file(file_path):
        if "ERROR" in line:
            error_count += 1
    return error_count

# For a 10 GB log file, this uses ~1 line of memory at a time
# Instead of loading all 10 GB:
# count = count_errors_in_log("/var/log/huge_application.log")

# Alternative using generator expression:
# error_count = sum(1 for line in read_large_file(path) if "ERROR" in line)

Infinite Sequence Generators

import itertools

def primes():
    """Generate prime numbers indefinitely using a sieve approach."""
    yield 2
    composites = {}  # Maps composite number -> list of primes that divide it
    candidate = 3
    while True:
        if candidate not in composites:
            # candidate is prime
            yield candidate
            composites[candidate * candidate] = [candidate]
        else:
            # candidate is composite; advance its prime factors
            for prime in composites[candidate]:
                composites.setdefault(candidate + prime, []).append(prime)
            del composites[candidate]
        candidate += 2  # Skip even numbers

# Get the first 20 prime numbers
first_20_primes = list(itertools.islice(primes(), 20))
print(first_20_primes)
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

# Sum of the first 1000 primes
print(sum(itertools.islice(primes(), 1000)))  # 3682913

Data Pipeline: Read CSV, Filter, Transform, Aggregate

import csv
from io import StringIO

# Simulated CSV data
csv_data = """name,department,salary
Alice,Engineering,120000
Bob,Marketing,85000
Charlie,Engineering,135000
Diana,Marketing,90000
Eve,Engineering,110000
Frank,HR,75000
Grace,Engineering,140000
"""

def read_csv_rows(csv_text):
    """Stage 1: Parse CSV into dictionaries."""
    reader = csv.DictReader(StringIO(csv_text))
    for row in reader:
        yield row

def filter_department(rows, dept):
    """Stage 2: Keep only rows matching the department."""
    for row in rows:
        if row["department"] == dept:
            yield row

def transform_salary(rows):
    """Stage 3: Convert salary to int and add a bonus field."""
    for row in rows:
        salary = int(row["salary"])
        row["salary"] = salary
        row["bonus"] = salary * 0.1  # 10% bonus
        yield row

def aggregate(rows):
    """Stage 4: Compute total salary and average."""
    total = 0
    count = 0
    for row in rows:
        total += row["salary"]
        count += 1
        yield row  # Pass through for downstream consumers
    # After iteration, print the summary
    if count > 0:
        print(f"\nTotal salary: ${total:,}")
        print(f"Average salary: ${total/count:,.0f}")
        print(f"Headcount: {count}")

# Build and run the pipeline
pipeline = aggregate(
    transform_salary(
        filter_department(
            read_csv_rows(csv_data),
            "Engineering"
        )
    )
)

for emp in pipeline:
    print(f"{emp['name']}: ${emp['salary']:,} (bonus: ${emp['bonus']:,.0f})")

# Alice: $120,000 (bonus: $12,000)
# Charlie: $135,000 (bonus: $13,500)
# Eve: $110,000 (bonus: $11,000)
# Grace: $140,000 (bonus: $14,000)
#
# Total salary: $505,000
# Average salary: $126,250
# Headcount: 4

Pagination Generator for API Results

import time

def paginated_api_fetch(base_url, page_size=100):
    """
    Generator that fetches paginated API results.
    Yields individual items across all pages.
    """
    page = 1
    while True:
        # Simulate API call (replace with real requests.get())
        url = f"{base_url}?page={page}&size={page_size}"
        print(f"Fetching: {url}")
        
        # Simulated response
        if page <= 3:
            results = [{"id": i, "name": f"Item {i}"} 
                       for i in range((page-1)*page_size + 1, page*page_size + 1)]
        else:
            results = []  # No more data
        
        if not results:
            break  # No more pages
        
        yield from results  # Yield each item individually
        page += 1
        time.sleep(0.1)  # Rate limiting

# The consumer does not need to know about pagination
for item in paginated_api_fetch("https://api.example.com/items", page_size=2):
    print(f"  Processing: {item}")
    if item["id"] >= 5:
        break  # Stop early — remaining pages are never fetched!

# Output:
# Fetching: https://api.example.com/items?page=1&size=2
#   Processing: {'id': 1, 'name': 'Item 1'}
#   Processing: {'id': 2, 'name': 'Item 2'}
# Fetching: https://api.example.com/items?page=2&size=2
#   Processing: {'id': 3, 'name': 'Item 3'}
#   Processing: {'id': 4, 'name': 'Item 4'}
# Fetching: https://api.example.com/items?page=3&size=2
#   Processing: {'id': 5, 'name': 'Item 5'}

Notice the key advantage: when the consumer breaks out of the loop, the generator stops fetching. Pages 4, 5, 6, etc. are never requested. Lazy evaluation means you only do the work that is actually needed.


11. Performance Comparison

Let us put hard numbers on the difference between lists and generators.

import sys
import time
import tracemalloc

def benchmark_list_vs_generator(n):
    """Compare list vs generator for summing n squared numbers."""
    
    # List approach
    tracemalloc.start()
    start = time.perf_counter()
    result_list = sum([x ** 2 for x in range(n)])
    list_time = time.perf_counter() - start
    list_peak = tracemalloc.get_traced_memory()[1]
    tracemalloc.stop()
    
    # Generator approach
    tracemalloc.start()
    start = time.perf_counter()
    result_gen = sum(x ** 2 for x in range(n))
    gen_time = time.perf_counter() - start
    gen_peak = tracemalloc.get_traced_memory()[1]
    tracemalloc.stop()
    
    assert result_list == result_gen
    
    print(f"n = {n:>12,}")
    print(f"  List:      {list_time:.4f}s | Peak memory: {list_peak:>12,} bytes")
    print(f"  Generator: {gen_time:.4f}s  | Peak memory: {gen_peak:>12,} bytes")
    print(f"  Memory saved: {(1 - gen_peak/list_peak)*100:.1f}%")
    print()

benchmark_list_vs_generator(100_000)
benchmark_list_vs_generator(1_000_000)
benchmark_list_vs_generator(10_000_000)

# Typical output:
# n =      100,000
#   List:      0.0234s | Peak memory:      824,464 bytes
#   Generator: 0.0228s | Peak memory:          464 bytes
#   Memory saved: 99.9%
#
# n =    1,000,000
#   List:      0.2451s | Peak memory:    8,448,688 bytes
#   Generator: 0.2389s | Peak memory:          464 bytes
#   Memory saved: 100.0%
#
# n =   10,000,000
#   List:      2.5102s | Peak memory:   80,000,048 bytes
#   Generator: 2.4231s | Peak memory:          464 bytes
#   Memory saved: 100.0%

Key takeaways from the benchmark:

  • Memory: Generators use a constant ~464 bytes regardless of dataset size. Lists grow linearly.
  • Speed: For aggregation operations like sum(), generators are slightly faster because they avoid the overhead of allocating and populating a list.
  • When lists win: If you need random access, multiple passes over the data, or the dataset fits comfortably in memory, a list is simpler and sometimes faster due to cache locality.

12. Common Pitfalls

Generators have some surprising behaviors that trip up even experienced developers. Here are the ones you must know.

Generator Exhaustion

# Generators can only be consumed ONCE
gen = (x ** 2 for x in range(5))

print(list(gen))  # [0, 1, 4, 9, 16]
print(list(gen))  # [] — exhausted! No error, just empty.

# This is a common bug:
def get_numbers():
    yield 1
    yield 2
    yield 3

nums = get_numbers()
print(sum(nums))  # 6
print(sum(nums))  # 0 — the generator is already exhausted!

# Fix: recreate the generator each time, or use a list if you need multiple passes
nums_list = list(get_numbers())
print(sum(nums_list))  # 6
print(sum(nums_list))  # 6

Cannot Index, Slice, or Get Length

gen = (x for x in range(10))

# These all fail:
# gen[0]      # TypeError: 'generator' object is not subscriptable
# gen[2:5]    # TypeError: 'generator' object is not subscriptable
# len(gen)    # TypeError: object of type 'generator' has no len()

# Workarounds:
import itertools

# Get the nth element (consumes n elements)
def nth(iterable, n, default=None):
    return next(itertools.islice(iterable, n, None), default)

gen = (x ** 2 for x in range(10))
print(nth(gen, 3))  # 9 (the 4th element, 0-indexed)

# Slice an iterator
gen = (x ** 2 for x in range(10))
print(list(itertools.islice(gen, 2, 5)))  # [4, 9, 16]

The Reuse Gotcha

# A subtle bug: storing a generator and trying to use it in multiple places

def get_even_numbers(n):
    return (x for x in range(n) if x % 2 == 0)

evens = get_even_numbers(20)

# First use works fine
for x in evens:
    if x > 6:
        break
print(f"Stopped at {x}")  # Stopped at 8

# Second use — CONTINUES from where we left off, not from the beginning!
remaining = list(evens)
print(remaining)  # [10, 12, 14, 16, 18]

# If you expected [0, 2, 4, 6, 8, 10, 12, 14, 16, 18], you have a bug.

Late Binding in Generator Expressions

# Variables in generator expressions are evaluated lazily
funcs = []
for i in range(5):
    funcs.append(lambda: i)  # All lambdas capture the SAME variable i

print([f() for f in funcs])  # [4, 4, 4, 4, 4] — not [0, 1, 2, 3, 4]!

# Fix: use a default argument to capture the current value
funcs = []
for i in range(5):
    funcs.append(lambda i=i: i)  # Each lambda gets its own copy

print([f() for f in funcs])  # [0, 1, 2, 3, 4]

13. Best Practices

Here are the guidelines I follow when deciding how to use generators in production code.

Use Generators for Large or Potentially Infinite Datasets

# GOOD: generator for processing a large file
def process_log_file(path):
    with open(path) as f:
        for line in f:
            if "ERROR" in line:
                yield parse_error(line)

# BAD: loading entire file into memory
def process_log_file_bad(path):
    with open(path) as f:
        lines = f.readlines()  # Entire file in memory!
    return [parse_error(line) for line in lines if "ERROR" in line]

Prefer Generator Expressions for Simple Transformations

# GOOD: generator expression passed directly to sum()
total = sum(order.total for order in orders if order.status == "completed")

# UNNECESSARY: creating an intermediate list
total = sum([order.total for order in orders if order.status == "completed"])

Use itertools Instead of Reinventing the Wheel

import itertools

# GOOD: use itertools.chain instead of nested loops
all_items = itertools.chain(list_a, list_b, list_c)

# GOOD: use itertools.groupby for grouping
for key, group in itertools.groupby(sorted_data, key=extract_key):
    process_group(key, list(group))

# GOOD: use itertools.islice for taking the first N items from an iterator
first_ten = list(itertools.islice(infinite_generator(), 10))

Make Reusable Iterables When Needed

# If you need to iterate multiple times, use a class with __iter__
class DataSource:
    def __init__(self, path):
        self.path = path
    
    def __iter__(self):
        with open(self.path) as f:
            for line in f:
                yield line.strip()

# Each for loop gets a fresh iterator
source = DataSource("data.txt")
count = sum(1 for _ in source)        # First pass: count lines
total = sum(len(line) for line in source)  # Second pass: total chars

Document Generator Exhaustion Behavior

def fetch_records(query):
    """
    Yield records matching the query from the database.
    
    WARNING: This generator can only be consumed once.
    If you need multiple passes, materialize with list().
    """
    cursor = db.execute(query)
    for row in cursor:
        yield transform(row)

14. Key Takeaways

  • Iterators are objects that implement __iter__() and __next__(). They produce values one at a time and raise StopIteration when done. Every for loop in Python uses this protocol.
  • Generators are iterators created with yield. They are dramatically simpler to write than class-based iterators. The function's state is automatically saved and restored between next() calls.
  • Generator expressions provide a compact syntax for simple generators: (expr for x in iterable if condition). They use constant memory regardless of the source size.
  • yield from delegates to sub-generators and is essential for flattening nested structures and composing generators cleanly.
  • send() turns generators into coroutines that can receive values as well as produce them. This is a powerful pattern for stateful data processing.
  • Generator pipelines chain multiple generators together like Unix pipes. Data flows through the pipeline one element at a time, keeping memory usage flat.
  • itertools provides battle-tested, C-optimized iterator utilities. Use chain, islice, groupby, combinations, permutations, and product instead of writing your own.
  • Memory matters. For datasets that do not fit in memory, generators are not optional — they are the only way. Even for smaller datasets, generators avoid unnecessary allocations.
  • Generators exhaust. You can only iterate through a generator once. If you need multiple passes, either recreate the generator or materialize it into a list.
  • Use generators by default when processing sequences of data. Switch to lists only when you need random access, multiple iterations, or the dataset is small enough that the simplicity of a list outweighs the memory cost.



Subscribe To Our Newsletter
You will receive our latest post and tutorial.
Thank you for subscribing!

required
required


Leave a Reply

Your email address will not be published. Required fields are marked *