If you have ever tried to process a 10 GB log file by reading it entirely into memory, you already know why generators and iterators matter. They are Python’s answer to a fundamental problem: how do you work with sequences of data without materializing everything in memory at once?
An iterator is any object that produces values one at a time through a standard protocol. A generator is a special kind of iterator that you create with a function containing yield statements. Together, they let you build lazy pipelines that process data element by element, consuming only the memory needed for a single item at a time.
This is not just an academic concept. Every for loop in Python uses the iterator protocol under the hood. When you iterate over a file, a database cursor, or a range of numbers, you are already using iterators. Understanding how they work gives you the ability to write code that scales to datasets of any size without blowing up your memory footprint.
In this tutorial, we will cover the iterator protocol from the ground up, build custom iterators and generators, chain them into processing pipelines, and explore the itertools module. By the end, you will have a complete mental model for lazy evaluation in Python.
The iterator protocol is deceptively simple. It consists of two methods:
__iter__() — Returns the iterator object itself. This is what makes an object usable in a for loop.__next__() — Returns the next value in the sequence. When there are no more values, it raises StopIteration.That is the entire contract. Any object that implements both methods is an iterator. Any object that implements __iter__() (even if it returns a separate iterator object) is an iterable.
The distinction matters: a list is an iterable (it has __iter__() that returns a list iterator), but it is not itself an iterator (it does not have __next__()). The iterator is a separate object that tracks the current position.
# The iterator protocol in action
numbers = [10, 20, 30]
# Get an iterator from the iterable
it = iter(numbers) # Calls numbers.__iter__()
print(next(it)) # 10 — Calls it.__next__()
print(next(it)) # 20
print(next(it)) # 30
# print(next(it)) # Raises StopIteration
# This is exactly what a for loop does internally:
# 1. Calls iter() on the iterable to get an iterator
# 2. Calls next() repeatedly until StopIteration
# 3. Catches StopIteration silently and exits the loop
for num in [10, 20, 30]:
print(num)
# Equivalent to the manual iter()/next() calls above
Understanding StopIteration is key. It is not an error — it is the signal that tells Python the sequence is exhausted. The for loop catches it automatically, but if you call next() manually, you need to handle it yourself or pass a default value:
# Handling StopIteration manually
it = iter([1, 2])
print(next(it)) # 1
print(next(it)) # 2
print(next(it, "done")) # "done" — default value instead of StopIteration
# Without a default, you must catch the exception
it = iter([1])
try:
print(next(it)) # 1
print(next(it)) # StopIteration raised here
except StopIteration:
print("Iterator exhausted")
To make your own class work with for loops, implement the iterator protocol. Here is a class that counts up from a start value to a stop value:
class CountUp:
"""An iterator that counts from start to stop (inclusive)."""
def __init__(self, start, stop):
self.start = start
self.stop = stop
self.current = start
def __iter__(self):
return self
def __next__(self):
if self.current > self.stop:
raise StopIteration
value = self.current
self.current += 1
return value
# Use it in a for loop
for num in CountUp(1, 5):
print(num, end=" ") # 1 2 3 4 5
# Use it with list() to materialize all values
print(list(CountUp(10, 15))) # [10, 11, 12, 13, 14, 15]
# Use it with sum(), max(), any(), etc.
print(sum(CountUp(1, 100))) # 5050
Python’s built-in types are all iterable. The iter() function extracts an iterator from any iterable, and next() advances it one step.
# Lists
list_iter = iter([1, 2, 3])
print(next(list_iter)) # 1
print(next(list_iter)) # 2
# Strings (iterate character by character)
str_iter = iter("Python")
print(next(str_iter)) # 'P'
print(next(str_iter)) # 'y'
# Dictionaries (iterate over keys by default)
data = {"name": "Alice", "age": 30, "role": "engineer"}
dict_iter = iter(data)
print(next(dict_iter)) # 'name'
print(next(dict_iter)) # 'age'
# Iterate over values or key-value pairs
for value in data.values():
print(value, end=" ") # Alice 30 engineer
for key, value in data.items():
print(f"{key}={value}", end=" ") # name=Alice age=30 role=engineer
# Sets (order is not guaranteed)
set_iter = iter({3, 1, 4, 1, 5})
print(next(set_iter)) # Could be any element
# Files are iterators (they yield lines)
with open("example.txt", "w") as f:
f.write("line 1\nline 2\nline 3\n")
with open("example.txt") as f:
for line in f: # f is its own iterator
print(line.strip())
# line 1
# line 2
# line 3
Notice that files are their own iterators — calling iter(f) returns f itself. This is why you can iterate over a file directly in a for loop. It also means you can only iterate through a file once without resetting the file pointer.
Let us build a few more custom iterators to solidify the pattern. Each one implements __iter__() and __next__().
class Fibonacci:
"""An iterator that produces Fibonacci numbers up to a maximum value."""
def __init__(self, max_value):
self.max_value = max_value
self.a = 0
self.b = 1
def __iter__(self):
return self
def __next__(self):
if self.a > self.max_value:
raise StopIteration
value = self.a
self.a, self.b = self.b, self.a + self.b
return value
print(list(Fibonacci(100)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
# Works with any function that consumes an iterable
print(sum(Fibonacci(1000))) # 2583
class MyRange:
"""A simplified reimplementation of range()."""
def __init__(self, start, stop=None, step=1):
if stop is None:
self.start = 0
self.stop = start
else:
self.start = start
self.stop = stop
self.step = step
def __iter__(self):
# Return a new iterator each time — this allows reuse
current = self.start
while (self.step > 0 and current < self.stop) or \
(self.step < 0 and current > self.stop):
yield current # Using yield here makes __iter__ a generator
current += self.step
def __len__(self):
return max(0, (self.stop - self.start + self.step - 1) // self.step)
def __repr__(self):
return f"MyRange({self.start}, {self.stop}, {self.step})"
# Forward range
print(list(MyRange(5))) # [0, 1, 2, 3, 4]
print(list(MyRange(2, 8))) # [2, 3, 4, 5, 6, 7]
print(list(MyRange(0, 10, 3))) # [0, 3, 6, 9]
# Reverse range
print(list(MyRange(10, 0, -2))) # [10, 8, 6, 4, 2]
# Reusable (unlike a plain iterator)
r = MyRange(3)
print(list(r)) # [0, 1, 2]
print(list(r)) # [0, 1, 2] — works again because __iter__ creates a new generator
Notice the MyRange trick: instead of implementing __next__() directly, the __iter__() method uses yield, which makes it a generator function. Each call to __iter__() creates a fresh generator object, so the range is reusable. This is a common and powerful pattern.
Writing custom iterator classes is verbose. You need __init__, __iter__, __next__, manual state management, and StopIteration handling. Generators solve this by letting you write iterator logic as a simple function with yield statements.
When Python encounters a yield in a function body, that function becomes a generator function. Calling it does not execute the body — it returns a generator object that implements the iterator protocol automatically.
def count_up(start, stop):
"""A generator that counts from start to stop."""
current = start
while current <= stop:
yield current # Pause here, return current value
current += 1 # Resume here on next() call
# Calling the function returns a generator object (does NOT run the body)
gen = count_up(1, 5)
print(type(gen)) # <class 'generator'>
# The generator implements the iterator protocol
print(next(gen)) # 1
print(next(gen)) # 2
print(next(gen)) # 3
# Use in a for loop
for num in count_up(1, 5):
print(num, end=" ") # 1 2 3 4 5
When you call next() on a generator, execution proceeds from the current position until it hits a yield statement. At that point, the yielded value is returned to the caller, and the generator's entire state (local variables, instruction pointer) is frozen. The next next() call resumes from exactly where it left off.
def demonstrate_state():
print("Step 1: Starting")
yield "first"
print("Step 2: Resumed after first yield")
yield "second"
print("Step 3: Resumed after second yield")
yield "third"
print("Step 4: About to finish")
# No more yields — StopIteration will be raised
gen = demonstrate_state()
print(next(gen))
# Step 1: Starting
# 'first'
print(next(gen))
# Step 2: Resumed after first yield
# 'second'
print(next(gen))
# Step 3: Resumed after second yield
# 'third'
# print(next(gen))
# Step 4: About to finish
# Raises StopIteration
You can inspect a generator's state using the inspect module:
import inspect
def simple_gen():
yield 1
yield 2
gen = simple_gen()
print(inspect.getgeneratorstate(gen)) # GEN_CREATED
next(gen)
print(inspect.getgeneratorstate(gen)) # GEN_SUSPENDED
next(gen)
print(inspect.getgeneratorstate(gen)) # GEN_SUSPENDED
try:
next(gen)
except StopIteration:
pass
print(inspect.getgeneratorstate(gen)) # GEN_CLOSED
A generator moves through four states: GEN_CREATED (just created, not started), GEN_RUNNING (currently executing), GEN_SUSPENDED (paused at a yield), and GEN_CLOSED (finished or closed).
Compare the class-based Fibonacci iterator from earlier with the generator version:
# Generator version — drastically simpler
def fibonacci(max_value=None):
a, b = 0, 1
while max_value is None or a <= max_value:
yield a
a, b = b, a + b
# Finite sequence
print(list(fibonacci(100)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
# Infinite sequence (use itertools.islice to take a finite portion)
import itertools
print(list(itertools.islice(fibonacci(), 15)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377]
The generator version is 4 lines of logic compared to 12+ lines for the class. No __init__, no __iter__, no __next__, no StopIteration — Python handles all of it.
Generator expressions are to generators what list comprehensions are to lists. They use the same syntax as list comprehensions, but with parentheses instead of square brackets. The critical difference is that a generator expression produces values lazily — one at a time — while a list comprehension builds the entire list in memory.
import sys
# List comprehension — builds entire list in memory
squares_list = [x ** 2 for x in range(1_000_000)]
print(f"List size: {sys.getsizeof(squares_list):,} bytes") # ~8,448,728 bytes
# Generator expression — produces values on demand
squares_gen = (x ** 2 for x in range(1_000_000))
print(f"Generator size: {sys.getsizeof(squares_gen):,} bytes") # ~200 bytes
# Both support filtering
even_squares = (x ** 2 for x in range(20) if x % 2 == 0)
print(list(even_squares)) # [0, 4, 16, 36, 64, 100, 144, 196, 256, 324]
# Generator expressions can be passed directly to functions
# (no extra parentheses needed when it is the only argument)
total = sum(x ** 2 for x in range(1000))
print(total) # 332833500
max_val = max(len(word) for word in ["Python", "generators", "are", "powerful"])
print(max_val) # 10
has_negative = any(x < 0 for x in [1, -2, 3, 4])
print(has_negative) # True
import sys
def compare_memory(n):
"""Compare memory usage of list vs generator for n elements."""
# List comprehension
data_list = [x * 2 for x in range(n)]
list_size = sys.getsizeof(data_list)
# Generator expression
data_gen = (x * 2 for x in range(n))
gen_size = sys.getsizeof(data_gen)
print(f"n={n:>12,} | List: {list_size:>12,} bytes | Generator: {gen_size:>6,} bytes | Ratio: {list_size/gen_size:.0f}x")
compare_memory(100)
compare_memory(10_000)
compare_memory(1_000_000)
compare_memory(10_000_000)
# Output:
# n= 100 | List: 920 bytes | Generator: 200 bytes | Ratio: 5x
# n= 10,000 | List: 87,624 bytes | Generator: 200 bytes | Ratio: 438x
# n= 1,000,000 | List: 8,448,728 bytes | Generator: 200 bytes | Ratio: 42244x
# n= 10,000,000 | List: 80,000,056 bytes | Generator: 200 bytes | Ratio: 400000x
The generator's memory footprint is constant regardless of how many elements it produces. This is the fundamental advantage of lazy evaluation.
The yield from expression, introduced in Python 3.3, delegates iteration to a sub-generator or any iterable. It is cleaner than manually looping over a sub-iterable and yielding each element.
# Without yield from
def chain_manual(*iterables):
for iterable in iterables:
for item in iterable:
yield item
# With yield from — cleaner
def chain_elegant(*iterables):
for iterable in iterables:
yield from iterable
# Both produce the same result
result = list(chain_elegant([1, 2, 3], "abc", (10, 20)))
print(result) # [1, 2, 3, 'a', 'b', 'c', 10, 20]
def flatten(nested):
"""Recursively flatten a nested structure."""
for item in nested:
if isinstance(item, (list, tuple)):
yield from flatten(item) # Delegate to recursive call
else:
yield item
data = [1, [2, 3], [4, [5, 6, [7, 8]]], 9]
print(list(flatten(data))) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Works with mixed nesting
mixed = [1, (2, [3, 4]), [5, (6,)], 7]
print(list(flatten(mixed))) # [1, 2, 3, 4, 5, 6, 7]
def header_rows():
yield "Name,Age,City"
def data_rows():
yield "Alice,30,New York"
yield "Bob,25,San Francisco"
yield "Charlie,35,Chicago"
def footer_rows():
yield "---END OF REPORT---"
def full_report():
yield from header_rows()
yield from data_rows()
yield from footer_rows()
for line in full_report():
print(line)
# Name,Age,City
# Alice,30,New York
# Bob,25,San Francisco
# Charlie,35,Chicago
# ---END OF REPORT---
Generators are not just producers — they can also receive values. The send() method resumes a generator and sends a value that becomes the result of the yield expression inside the generator. This turns generators into coroutines that can both produce and consume data.
def running_average():
"""A generator that computes a running average."""
total = 0
count = 0
average = None
while True:
value = yield average # Receive a value, yield the current average
if value is None:
break
total += value
count += 1
average = total / count
# Usage
avg = running_average()
next(avg) # Prime the generator (advance to first yield)
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0
print(avg.send(40)) # 25.0
The first next() call is necessary to "prime" the generator — it advances execution to the first yield expression, where the generator is ready to receive a value. After that, send() both sends a value in and gets the next yielded value out.
def accumulator():
"""A coroutine that accumulates values and reports the running total."""
total = 0
while True:
value = yield total
if value is None:
return total # return value becomes StopIteration.value
total += value
acc = accumulator()
next(acc) # Prime
print(acc.send(5)) # 5
print(acc.send(10)) # 15
print(acc.send(3)) # 18
# Close the generator gracefully
try:
acc.send(None) # Triggers the return statement
except StopIteration as e:
print(f"Final total: {e.value}") # Final total: 18
# Practical coroutine: a filter that receives items and forwards matches
def grep_coroutine(pattern):
"""A coroutine that filters lines matching a pattern."""
print(f"Looking for: {pattern}")
matches = []
while True:
line = yield
if line is None:
break
if pattern in line:
matches.append(line)
print(f" Match: {line}")
return matches
# Usage
searcher = grep_coroutine("error")
next(searcher) # Prime
searcher.send("INFO: Server started")
searcher.send("ERROR: Connection timeout") # Match
searcher.send("DEBUG: Request received")
searcher.send("ERROR: Disk full") # Match
searcher.send("INFO: Shutting down")
try:
searcher.send(None) # Signal completion
except StopIteration as e:
print(f"All matches: {e.value}")
# Match: ERROR: Connection timeout
# Match: ERROR: Disk full
# All matches: ['ERROR: Connection timeout', 'ERROR: Disk full']
One of the most powerful patterns in Python is chaining generators into a processing pipeline. Each generator reads from the previous one, transforms the data, and passes it along. This works like Unix pipes — data flows through a chain of transformations without any intermediate lists being created in memory.
# Pipeline: Read lines -> filter non-empty -> strip whitespace -> convert to uppercase
def read_lines(text):
"""Stage 1: Split text into lines."""
for line in text.split("\n"):
yield line
def filter_non_empty(lines):
"""Stage 2: Remove empty lines."""
for line in lines:
if line.strip():
yield line
def strip_whitespace(lines):
"""Stage 3: Strip leading/trailing whitespace."""
for line in lines:
yield line.strip()
def to_uppercase(lines):
"""Stage 4: Convert to uppercase."""
for line in lines:
yield line.upper()
# Chain the pipeline
raw_text = """
hello world
Python generators
are powerful
and memory efficient
"""
pipeline = to_uppercase(
strip_whitespace(
filter_non_empty(
read_lines(raw_text)
)
)
)
for line in pipeline:
print(line)
# HELLO WORLD
# PYTHON GENERATORS
# ARE POWERFUL
# AND MEMORY EFFICIENT
# A more realistic pipeline: process log entries
def parse_log_entries(lines):
"""Parse each line into a structured dict."""
for line in lines:
parts = line.split(" | ")
if len(parts) == 3:
yield {
"timestamp": parts[0],
"level": parts[1],
"message": parts[2]
}
def filter_errors(entries):
"""Keep only ERROR entries."""
for entry in entries:
if entry["level"] == "ERROR":
yield entry
def format_alerts(entries):
"""Format entries as alert strings."""
for entry in entries:
yield f"ALERT [{entry['timestamp']}]: {entry['message']}"
# Simulate log data
log_data = [
"2024-01-15 10:00:01 | INFO | Server started",
"2024-01-15 10:00:05 | ERROR | Database connection failed",
"2024-01-15 10:00:10 | INFO | Retry attempt 1",
"2024-01-15 10:00:15 | ERROR | Database connection failed again",
"2024-01-15 10:00:20 | INFO | Connection restored",
"2024-01-15 10:00:25 | ERROR | Disk space low",
]
# Build the pipeline
alerts = format_alerts(filter_errors(parse_log_entries(log_data)))
for alert in alerts:
print(alert)
# ALERT [2024-01-15 10:00:05]: Database connection failed
# ALERT [2024-01-15 10:00:15]: Database connection failed again
# ALERT [2024-01-15 10:00:25]: Disk space low
Each stage processes one item at a time. No intermediate lists are created. This means you could pipe a 100 GB log file through this pipeline and it would use a trivial amount of memory.
The itertools module is Python's standard library for efficient iterator operations. Every function in it returns an iterator, so they compose naturally with generators and pipelines. Here are the functions you will use most often.
import itertools
# count: count from a start value with a step
for i in itertools.islice(itertools.count(10, 2), 5):
print(i, end=" ") # 10 12 14 16 18
print()
# cycle: repeat an iterable forever
colors = itertools.cycle(["red", "green", "blue"])
for _ in range(7):
print(next(colors), end=" ") # red green blue red green blue red
print()
# repeat: repeat a value n times (or forever)
fives = list(itertools.repeat(5, 4))
print(fives) # [5, 5, 5, 5]
# Practical use of repeat: initialize a grid
row = list(itertools.repeat(0, 5))
grid = [list(itertools.repeat(0, 5)) for _ in range(3)]
print(grid) # [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
import itertools
# chain: concatenate multiple iterables
combined = list(itertools.chain([1, 2], [3, 4], [5, 6]))
print(combined) # [1, 2, 3, 4, 5, 6]
# chain.from_iterable: chain from a single iterable of iterables
nested = [[1, 2], [3, 4], [5, 6]]
flat = list(itertools.chain.from_iterable(nested))
print(flat) # [1, 2, 3, 4, 5, 6]
# islice: slice an iterator (like list slicing but for iterators)
print(list(itertools.islice(range(100), 5))) # [0, 1, 2, 3, 4]
print(list(itertools.islice(range(100), 10, 20, 3))) # [10, 13, 16, 19]
# takewhile / dropwhile: take/drop based on a predicate
nums = [1, 3, 5, 7, 2, 4, 6, 8]
print(list(itertools.takewhile(lambda x: x < 6, nums))) # [1, 3, 5]
print(list(itertools.dropwhile(lambda x: x < 6, nums))) # [7, 2, 4, 6, 8]
# groupby: group consecutive elements by a key function
data = [("A", 1), ("A", 2), ("B", 3), ("B", 4), ("A", 5)]
for key, group in itertools.groupby(data, key=lambda x: x[0]):
print(f"{key}: {list(group)}")
# A: [('A', 1), ('A', 2)]
# B: [('B', 3), ('B', 4)]
# A: [('A', 5)] <-- Note: only groups CONSECUTIVE matches
import itertools
# combinations: all r-length combinations (no repeats, order doesn't matter)
print(list(itertools.combinations("ABCD", 2)))
# [('A','B'), ('A','C'), ('A','D'), ('B','C'), ('B','D'), ('C','D')]
# combinations_with_replacement: combinations allowing repeats
print(list(itertools.combinations_with_replacement("AB", 3)))
# [('A','A','A'), ('A','A','B'), ('A','B','B'), ('B','B','B')]
# permutations: all r-length arrangements (order matters)
print(list(itertools.permutations("ABC", 2)))
# [('A','B'), ('A','C'), ('B','A'), ('B','C'), ('C','A'), ('C','B')]
# product: Cartesian product (like nested for loops)
print(list(itertools.product("AB", [1, 2])))
# [('A',1), ('A',2), ('B',1), ('B',2)]
# Practical: generate all possible configs
sizes = ["small", "medium", "large"]
colors = ["red", "blue"]
materials = ["cotton", "silk"]
for combo in itertools.product(sizes, colors, materials):
print(combo)
# ('small', 'red', 'cotton')
# ('small', 'red', 'silk')
# ('small', 'blue', 'cotton')
# ... (12 total combinations)
This is the canonical use case for generators. Instead of loading an entire file into memory, you process it one line at a time.
def read_large_file(file_path):
"""Read a file line by line using a generator."""
with open(file_path, "r") as f:
for line in f:
yield line.strip()
def count_errors_in_log(file_path):
"""Count error lines in a log file without loading it into memory."""
error_count = 0
for line in read_large_file(file_path):
if "ERROR" in line:
error_count += 1
return error_count
# For a 10 GB log file, this uses ~1 line of memory at a time
# Instead of loading all 10 GB:
# count = count_errors_in_log("/var/log/huge_application.log")
# Alternative using generator expression:
# error_count = sum(1 for line in read_large_file(path) if "ERROR" in line)
import itertools
def primes():
"""Generate prime numbers indefinitely using a sieve approach."""
yield 2
composites = {} # Maps composite number -> list of primes that divide it
candidate = 3
while True:
if candidate not in composites:
# candidate is prime
yield candidate
composites[candidate * candidate] = [candidate]
else:
# candidate is composite; advance its prime factors
for prime in composites[candidate]:
composites.setdefault(candidate + prime, []).append(prime)
del composites[candidate]
candidate += 2 # Skip even numbers
# Get the first 20 prime numbers
first_20_primes = list(itertools.islice(primes(), 20))
print(first_20_primes)
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]
# Sum of the first 1000 primes
print(sum(itertools.islice(primes(), 1000))) # 3682913
import csv
from io import StringIO
# Simulated CSV data
csv_data = """name,department,salary
Alice,Engineering,120000
Bob,Marketing,85000
Charlie,Engineering,135000
Diana,Marketing,90000
Eve,Engineering,110000
Frank,HR,75000
Grace,Engineering,140000
"""
def read_csv_rows(csv_text):
"""Stage 1: Parse CSV into dictionaries."""
reader = csv.DictReader(StringIO(csv_text))
for row in reader:
yield row
def filter_department(rows, dept):
"""Stage 2: Keep only rows matching the department."""
for row in rows:
if row["department"] == dept:
yield row
def transform_salary(rows):
"""Stage 3: Convert salary to int and add a bonus field."""
for row in rows:
salary = int(row["salary"])
row["salary"] = salary
row["bonus"] = salary * 0.1 # 10% bonus
yield row
def aggregate(rows):
"""Stage 4: Compute total salary and average."""
total = 0
count = 0
for row in rows:
total += row["salary"]
count += 1
yield row # Pass through for downstream consumers
# After iteration, print the summary
if count > 0:
print(f"\nTotal salary: ${total:,}")
print(f"Average salary: ${total/count:,.0f}")
print(f"Headcount: {count}")
# Build and run the pipeline
pipeline = aggregate(
transform_salary(
filter_department(
read_csv_rows(csv_data),
"Engineering"
)
)
)
for emp in pipeline:
print(f"{emp['name']}: ${emp['salary']:,} (bonus: ${emp['bonus']:,.0f})")
# Alice: $120,000 (bonus: $12,000)
# Charlie: $135,000 (bonus: $13,500)
# Eve: $110,000 (bonus: $11,000)
# Grace: $140,000 (bonus: $14,000)
#
# Total salary: $505,000
# Average salary: $126,250
# Headcount: 4
import time
def paginated_api_fetch(base_url, page_size=100):
"""
Generator that fetches paginated API results.
Yields individual items across all pages.
"""
page = 1
while True:
# Simulate API call (replace with real requests.get())
url = f"{base_url}?page={page}&size={page_size}"
print(f"Fetching: {url}")
# Simulated response
if page <= 3:
results = [{"id": i, "name": f"Item {i}"}
for i in range((page-1)*page_size + 1, page*page_size + 1)]
else:
results = [] # No more data
if not results:
break # No more pages
yield from results # Yield each item individually
page += 1
time.sleep(0.1) # Rate limiting
# The consumer does not need to know about pagination
for item in paginated_api_fetch("https://api.example.com/items", page_size=2):
print(f" Processing: {item}")
if item["id"] >= 5:
break # Stop early — remaining pages are never fetched!
# Output:
# Fetching: https://api.example.com/items?page=1&size=2
# Processing: {'id': 1, 'name': 'Item 1'}
# Processing: {'id': 2, 'name': 'Item 2'}
# Fetching: https://api.example.com/items?page=2&size=2
# Processing: {'id': 3, 'name': 'Item 3'}
# Processing: {'id': 4, 'name': 'Item 4'}
# Fetching: https://api.example.com/items?page=3&size=2
# Processing: {'id': 5, 'name': 'Item 5'}
Notice the key advantage: when the consumer breaks out of the loop, the generator stops fetching. Pages 4, 5, 6, etc. are never requested. Lazy evaluation means you only do the work that is actually needed.
Let us put hard numbers on the difference between lists and generators.
import sys
import time
import tracemalloc
def benchmark_list_vs_generator(n):
"""Compare list vs generator for summing n squared numbers."""
# List approach
tracemalloc.start()
start = time.perf_counter()
result_list = sum([x ** 2 for x in range(n)])
list_time = time.perf_counter() - start
list_peak = tracemalloc.get_traced_memory()[1]
tracemalloc.stop()
# Generator approach
tracemalloc.start()
start = time.perf_counter()
result_gen = sum(x ** 2 for x in range(n))
gen_time = time.perf_counter() - start
gen_peak = tracemalloc.get_traced_memory()[1]
tracemalloc.stop()
assert result_list == result_gen
print(f"n = {n:>12,}")
print(f" List: {list_time:.4f}s | Peak memory: {list_peak:>12,} bytes")
print(f" Generator: {gen_time:.4f}s | Peak memory: {gen_peak:>12,} bytes")
print(f" Memory saved: {(1 - gen_peak/list_peak)*100:.1f}%")
print()
benchmark_list_vs_generator(100_000)
benchmark_list_vs_generator(1_000_000)
benchmark_list_vs_generator(10_000_000)
# Typical output:
# n = 100,000
# List: 0.0234s | Peak memory: 824,464 bytes
# Generator: 0.0228s | Peak memory: 464 bytes
# Memory saved: 99.9%
#
# n = 1,000,000
# List: 0.2451s | Peak memory: 8,448,688 bytes
# Generator: 0.2389s | Peak memory: 464 bytes
# Memory saved: 100.0%
#
# n = 10,000,000
# List: 2.5102s | Peak memory: 80,000,048 bytes
# Generator: 2.4231s | Peak memory: 464 bytes
# Memory saved: 100.0%
Key takeaways from the benchmark:
sum(), generators are slightly faster because they avoid the overhead of allocating and populating a list.Generators have some surprising behaviors that trip up even experienced developers. Here are the ones you must know.
# Generators can only be consumed ONCE
gen = (x ** 2 for x in range(5))
print(list(gen)) # [0, 1, 4, 9, 16]
print(list(gen)) # [] — exhausted! No error, just empty.
# This is a common bug:
def get_numbers():
yield 1
yield 2
yield 3
nums = get_numbers()
print(sum(nums)) # 6
print(sum(nums)) # 0 — the generator is already exhausted!
# Fix: recreate the generator each time, or use a list if you need multiple passes
nums_list = list(get_numbers())
print(sum(nums_list)) # 6
print(sum(nums_list)) # 6
gen = (x for x in range(10))
# These all fail:
# gen[0] # TypeError: 'generator' object is not subscriptable
# gen[2:5] # TypeError: 'generator' object is not subscriptable
# len(gen) # TypeError: object of type 'generator' has no len()
# Workarounds:
import itertools
# Get the nth element (consumes n elements)
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
gen = (x ** 2 for x in range(10))
print(nth(gen, 3)) # 9 (the 4th element, 0-indexed)
# Slice an iterator
gen = (x ** 2 for x in range(10))
print(list(itertools.islice(gen, 2, 5))) # [4, 9, 16]
# A subtle bug: storing a generator and trying to use it in multiple places
def get_even_numbers(n):
return (x for x in range(n) if x % 2 == 0)
evens = get_even_numbers(20)
# First use works fine
for x in evens:
if x > 6:
break
print(f"Stopped at {x}") # Stopped at 8
# Second use — CONTINUES from where we left off, not from the beginning!
remaining = list(evens)
print(remaining) # [10, 12, 14, 16, 18]
# If you expected [0, 2, 4, 6, 8, 10, 12, 14, 16, 18], you have a bug.
# Variables in generator expressions are evaluated lazily
funcs = []
for i in range(5):
funcs.append(lambda: i) # All lambdas capture the SAME variable i
print([f() for f in funcs]) # [4, 4, 4, 4, 4] — not [0, 1, 2, 3, 4]!
# Fix: use a default argument to capture the current value
funcs = []
for i in range(5):
funcs.append(lambda i=i: i) # Each lambda gets its own copy
print([f() for f in funcs]) # [0, 1, 2, 3, 4]
Here are the guidelines I follow when deciding how to use generators in production code.
# GOOD: generator for processing a large file
def process_log_file(path):
with open(path) as f:
for line in f:
if "ERROR" in line:
yield parse_error(line)
# BAD: loading entire file into memory
def process_log_file_bad(path):
with open(path) as f:
lines = f.readlines() # Entire file in memory!
return [parse_error(line) for line in lines if "ERROR" in line]
# GOOD: generator expression passed directly to sum() total = sum(order.total for order in orders if order.status == "completed") # UNNECESSARY: creating an intermediate list total = sum([order.total for order in orders if order.status == "completed"])
import itertools
# GOOD: use itertools.chain instead of nested loops
all_items = itertools.chain(list_a, list_b, list_c)
# GOOD: use itertools.groupby for grouping
for key, group in itertools.groupby(sorted_data, key=extract_key):
process_group(key, list(group))
# GOOD: use itertools.islice for taking the first N items from an iterator
first_ten = list(itertools.islice(infinite_generator(), 10))
# If you need to iterate multiple times, use a class with __iter__
class DataSource:
def __init__(self, path):
self.path = path
def __iter__(self):
with open(self.path) as f:
for line in f:
yield line.strip()
# Each for loop gets a fresh iterator
source = DataSource("data.txt")
count = sum(1 for _ in source) # First pass: count lines
total = sum(len(line) for line in source) # Second pass: total chars
def fetch_records(query):
"""
Yield records matching the query from the database.
WARNING: This generator can only be consumed once.
If you need multiple passes, materialize with list().
"""
cursor = db.execute(query)
for row in cursor:
yield transform(row)
__iter__() and __next__(). They produce values one at a time and raise StopIteration when done. Every for loop in Python uses this protocol.yield. They are dramatically simpler to write than class-based iterators. The function's state is automatically saved and restored between next() calls.(expr for x in iterable if condition). They use constant memory regardless of the source size.chain, islice, groupby, combinations, permutations, and product instead of writing your own.