If you are preparing for a Python developer interview, whether for a junior, mid-level, or senior role, this guide is designed to help you sharpen your understanding of the language from the ground up. Python interviews tend to go beyond syntax trivia. Interviewers want to see that you understand why things work the way they do, not just how to use them. The questions below are organized by difficulty level and cover the concepts that come up most frequently in real-world technical interviews. Each question includes a thorough explanation, a practical code example, and insight into what the interviewer is really testing.
These questions test foundational Python knowledge. You should be able to answer these confidently for any Python role.
Python is a high-level, interpreted, dynamically-typed programming language created by Guido van Rossum. It emphasizes code readability through its clean syntax and significant whitespace. Python is widely used because of its gentle learning curve, massive standard library, and strong ecosystem for web development, data science, automation, and machine learning.
Why interviewers ask this: They want to see that you understand Python’s design philosophy and can articulate its strengths beyond just saying “it’s easy.”
PEP 8 is Python’s official style guide. It defines conventions for naming, indentation, line length, imports, and whitespace. Following PEP 8 matters because Python is a language that values readability, and consistent formatting across a codebase reduces cognitive load for every developer who reads it.
# PEP 8 compliant
def calculate_total_price(unit_price, quantity, tax_rate=0.08):
"""Calculate the total price including tax."""
subtotal = unit_price * quantity
return subtotal * (1 + tax_rate)
# Not PEP 8 compliant
def calculateTotalPrice(unitPrice,quantity,taxRate=0.08):
subtotal=unitPrice*quantity
return subtotal*(1+taxRate)
Why interviewers ask this: They want to know if you write professional, team-friendly code or if you treat formatting as an afterthought.
Lists are mutable sequences (you can add, remove, or change elements), while tuples are immutable (once created, they cannot be modified). Lists use square brackets and tuples use parentheses. Because tuples are immutable, they are hashable and can be used as dictionary keys. Tuples also have a slight performance advantage due to their fixed size.
my_list = [1, 2, 3]
my_tuple = (1, 2, 3)
my_list[0] = 10 # Valid - lists are mutable
print(my_list) # [10, 2, 3]
# my_tuple[0] = 10 # TypeError: 'tuple' object does not support item assignment
# Tuples can be dictionary keys; lists cannot
coordinates = {(0, 0): "origin", (1, 2): "point_a"}
print(coordinates[(0, 0)]) # "origin"
Why interviewers ask this: This tests whether you understand mutability, which is fundamental to avoiding bugs in Python.
Single-line comments use the # symbol. Multi-line comments are typically done with consecutive # lines or with triple-quoted strings (docstrings). Note that triple-quoted strings used outside of a function or class definition are not true comments; they are string literals that Python evaluates and discards.
# This is a single-line comment
# This is a multi-line comment
# spread across multiple lines
# using the hash symbol
def calculate_area(radius):
"""
Calculate the area of a circle.
This is a docstring, not a comment. It becomes
part of the function's __doc__ attribute.
"""
import math
return math.pi * radius ** 2
print(calculate_area.__doc__)
Why interviewers ask this: They are checking whether you understand the difference between comments and docstrings, and whether you use documentation properly.
is and ==.== compares values (equality). is compares identity (whether two references point to the exact same object in memory). This distinction is critical when working with mutable objects.
a = [1, 2, 3] b = [1, 2, 3] c = a print(a == b) # True - same values print(a is b) # False - different objects in memory print(a is c) # True - c references the same object as a # CPython interns small integers, so this can be surprising: x = 256 y = 256 print(x is y) # True - CPython caches integers -5 to 256 x = 257 y = 257 print(x is y) # False - outside the cached range (in most contexts)
Why interviewers ask this: Confusing is with == is a common source of subtle bugs. Interviewers want to see that you understand object identity vs. equality.
Lambda functions are small, anonymous functions defined with the lambda keyword. They can take any number of arguments but contain only a single expression. They are most useful as short callbacks or key functions passed to higher-order functions like sorted(), map(), or filter().
# Basic lambda
add = lambda x, y: x + y
print(add(3, 5)) # 8
# Practical use: sorting a list of tuples by the second element
students = [("Alice", 88), ("Bob", 95), ("Charlie", 72)]
sorted_students = sorted(students, key=lambda s: s[1], reverse=True)
print(sorted_students)
# [('Bob', 95), ('Alice', 88), ('Charlie', 72)]
# Using with filter
numbers = [1, 2, 3, 4, 5, 6, 7, 8]
evens = list(filter(lambda n: n % 2 == 0, numbers))
print(evens) # [2, 4, 6, 8]
Why interviewers ask this: They want to see if you know when lambdas are appropriate and when a regular function would be clearer.
Python uses try, except, else, and finally blocks for exception handling. The try block contains code that might raise an exception. The except block catches specific exceptions. The else block runs only if no exception was raised. The finally block always runs, regardless of whether an exception occurred.
def divide(a, b):
try:
result = a / b
except ZeroDivisionError:
print("Cannot divide by zero.")
return None
except TypeError as e:
print(f"Invalid types: {e}")
return None
else:
print(f"Division successful: {result}")
return result
finally:
print("Operation complete.")
divide(10, 2)
# Division successful: 5.0
# Operation complete.
divide(10, 0)
# Cannot divide by zero.
# Operation complete.
# Raising custom exceptions
class InsufficientFundsError(Exception):
def __init__(self, balance, amount):
self.balance = balance
self.amount = amount
super().__init__(f"Cannot withdraw ${amount}. Balance: ${balance}")
def withdraw(balance, amount):
if amount > balance:
raise InsufficientFundsError(balance, amount)
return balance - amount
Why interviewers ask this: They are testing whether you write defensive code and understand the full exception handling flow, including the often-overlooked else and finally blocks.
pass statement?The pass statement is a no-op placeholder. It does nothing but satisfies Python’s requirement for a statement in a block. It is commonly used when defining empty classes, functions, or conditional branches that you plan to implement later.
# Placeholder for a function you haven't implemented yet
def process_payment(order):
pass # TODO: implement payment processing
# Empty class used as a custom exception
class ValidationError(Exception):
pass
# Placeholder in conditional logic
status = "pending"
if status == "approved":
pass # Handle approved case later
elif status == "rejected":
print("Order rejected")
Why interviewers ask this: This is a basic syntax question. They want to confirm you understand Python’s block structure.
These questions dig into Python’s internals, patterns, and standard library. Expect these in mid-level and senior interviews.
Both allow you to create sequences from iterables using a concise syntax, but they differ in memory behavior. A list comprehension builds the entire list in memory at once. A generator expression produces values lazily, one at a time, which is far more memory-efficient for large datasets.
import sys
# List comprehension - builds entire list in memory
squares_list = [x ** 2 for x in range(1_000_000)]
print(sys.getsizeof(squares_list)) # ~8 MB
# Generator expression - produces values on demand
squares_gen = (x ** 2 for x in range(1_000_000))
print(sys.getsizeof(squares_gen)) # ~200 bytes (just the generator object)
# Both support filtering
even_squares = [x ** 2 for x in range(20) if x % 2 == 0]
print(even_squares) # [0, 4, 16, 36, 64, 100, 144, 196, 256, 324]
# Dictionary and set comprehensions
names = ["Alice", "Bob", "Charlie", "Alice", "Bob"]
name_lengths = {name: len(name) for name in names}
unique_names = {name for name in names}
print(name_lengths) # {'Alice': 5, 'Bob': 3, 'Charlie': 7}
print(unique_names) # {'Alice', 'Bob', 'Charlie'}
Why interviewers ask this: They want to see if you think about memory efficiency and understand lazy evaluation, which is critical for processing large datasets.
*args and **kwargs?*args collects positional arguments into a tuple. **kwargs collects keyword arguments into a dictionary. Together, they allow functions to accept any number of arguments, which is essential for writing flexible APIs, decorators, and wrapper functions.
def log_call(func_name, *args, **kwargs):
print(f"Calling {func_name}")
print(f" Positional args: {args}")
print(f" Keyword args: {kwargs}")
log_call("create_user", "Alice", 30, role="admin", active=True)
# Calling create_user
# Positional args: ('Alice', 30)
# Keyword args: {'role': 'admin', 'active': True}
# Common pattern: forwarding arguments to another function
def make_request(method, url, **kwargs):
timeout = kwargs.pop("timeout", 30)
retries = kwargs.pop("retries", 3)
print(f"{method} {url} (timeout={timeout}, retries={retries})")
print(f"Additional options: {kwargs}")
make_request("GET", "/api/users", timeout=10, verify=False)
# GET /api/users (timeout=10, retries=3)
# Additional options: {'verify': False}
Why interviewers ask this: This is fundamental to writing Pythonic code. If you cannot explain *args and **kwargs, it signals a gap in your understanding of function signatures.
A decorator is a function that takes another function as input and returns a new function that extends or modifies its behavior. Decorators are Python’s implementation of the Decorator pattern and are used extensively in frameworks like Flask, Django, and pytest. The @decorator syntax is syntactic sugar for func = decorator(func).
import functools
import time
# A well-written decorator preserves the original function's metadata
def timing(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
elapsed = time.perf_counter() - start
print(f"{func.__name__} took {elapsed:.4f}s")
return result
return wrapper
@timing
def slow_function():
"""This function simulates slow work."""
time.sleep(0.5)
return "done"
result = slow_function()
# slow_function took 0.5012s
# The @functools.wraps decorator preserves metadata
print(slow_function.__name__) # "slow_function" (not "wrapper")
print(slow_function.__doc__) # "This function simulates slow work."
# Decorator with arguments
def retry(max_attempts=3):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except Exception as e:
print(f"Attempt {attempt} failed: {e}")
if attempt == max_attempts:
raise
return wrapper
return decorator
@retry(max_attempts=3)
def unreliable_api_call():
import random
if random.random() < 0.7:
raise ConnectionError("Server unavailable")
return {"status": "ok"}
Why interviewers ask this: Decorators are one of Python's most powerful patterns. Interviewers want to see that you understand closures, higher-order functions, and functools.wraps.
An iterator is any object that implements the __iter__ and __next__ methods. A generator is a specific type of iterator created using a function with yield statements. Generators are simpler to write than manual iterators and automatically maintain their state between calls.
# Manual iterator (verbose)
class Countdown:
def __init__(self, start):
self.current = start
def __iter__(self):
return self
def __next__(self):
if self.current <= 0:
raise StopIteration
value = self.current
self.current -= 1
return value
# Generator (clean and concise)
def countdown(start):
while start > 0:
yield start
start -= 1
# Both produce the same result
for n in Countdown(5):
print(n, end=" ") # 5 4 3 2 1
print()
for n in countdown(5):
print(n, end=" ") # 5 4 3 2 1
# Generators are lazy - great for large or infinite sequences
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
# Get the first 10 Fibonacci numbers
import itertools
first_10 = list(itertools.islice(fibonacci(), 10))
print(first_10) # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
Why interviewers ask this: Generators reveal your understanding of lazy evaluation, memory management, and the iterator protocol. Senior developers use them heavily for data pipelines.
with statement.Context managers handle resource setup and teardown automatically. The with statement guarantees that cleanup code runs even if an exception occurs. You can create context managers using the __enter__/__exit__ protocol or the contextlib.contextmanager decorator.
from contextlib import contextmanager
# Using the with statement for file handling
with open("example.txt", "w") as f:
f.write("Hello, World!")
# File is automatically closed here, even if an exception occurred
# Custom context manager using a class
class DatabaseConnection:
def __init__(self, connection_string):
self.connection_string = connection_string
self.connection = None
def __enter__(self):
print(f"Connecting to {self.connection_string}")
self.connection = {"status": "connected"} # Simulated
return self.connection
def __exit__(self, exc_type, exc_val, exc_tb):
print("Closing database connection")
self.connection = None
return False # Do not suppress exceptions
with DatabaseConnection("postgresql://localhost/mydb") as conn:
print(f"Connection status: {conn['status']}")
# Custom context manager using a generator (simpler)
@contextmanager
def timer(label):
import time
start = time.perf_counter()
try:
yield
finally:
elapsed = time.perf_counter() - start
print(f"{label}: {elapsed:.4f}s")
with timer("Data processing"):
total = sum(range(1_000_000))
Why interviewers ask this: Context managers are essential for resource management. Interviewers want to know if you handle connections, locks, and files safely.
__str__ and __repr__?__str__ returns a human-readable string intended for end users. __repr__ returns an unambiguous string intended for developers, ideally one that could recreate the object. When you call print(), Python uses __str__. When you inspect an object in the REPL or in a debugger, Python uses __repr__. If __str__ is not defined, Python falls back to __repr__.
class Money:
def __init__(self, amount, currency="USD"):
self.amount = amount
self.currency = currency
def __str__(self):
return f"${self.amount:.2f} {self.currency}"
def __repr__(self):
return f"Money({self.amount!r}, {self.currency!r})"
price = Money(19.99)
print(str(price)) # $19.99 USD (for end users)
print(repr(price)) # Money(19.99, 'USD') (for developers)
# In a list, Python uses __repr__
prices = [Money(9.99), Money(24.99, "EUR")]
print(prices) # [Money(9.99, 'USD'), Money(24.99, 'EUR')]
Why interviewers ask this: This checks whether you write classes that are easy to debug and log. Good __repr__ implementations save hours of debugging time.
A shallow copy creates a new object but inserts references to the same nested objects. A deep copy creates a new object and recursively copies all nested objects. This distinction matters when you have mutable objects nested inside other mutable objects.
import copy # Shallow copy original = [[1, 2, 3], [4, 5, 6]] shallow = copy.copy(original) shallow[0][0] = 999 print(original[0][0]) # 999 - the nested list is shared! # Deep copy original = [[1, 2, 3], [4, 5, 6]] deep = copy.deepcopy(original) deep[0][0] = 999 print(original[0][0]) # 1 - completely independent copy # Common shallow copy shortcuts my_list = [1, 2, 3] copy_1 = my_list[:] # Slice copy_2 = list(my_list) # Constructor copy_3 = my_list.copy() # .copy() method # All three are shallow copies # For flat lists (no nested mutables), shallow copy is fine
Why interviewers ask this: Confusing shallow and deep copies causes some of the most frustrating bugs in Python. This question tests whether you understand reference semantics.
A class is a blueprint that defines attributes and methods. An object (or instance) is a specific realization of that blueprint with actual data. In Python, classes are themselves objects (everything in Python is an object), which is why you can pass classes around as arguments and store them in variables.
class BankAccount:
"""A class is the blueprint."""
interest_rate = 0.02 # Class attribute - shared by all instances
def __init__(self, owner, balance=0):
self.owner = owner # Instance attribute - unique to each object
self.balance = balance
def deposit(self, amount):
self.balance += amount
return self.balance
def __repr__(self):
return f"BankAccount({self.owner!r}, balance={self.balance})"
# Objects are instances of the class
account_1 = BankAccount("Alice", 1000)
account_2 = BankAccount("Bob", 500)
account_1.deposit(250)
print(account_1) # BankAccount('Alice', balance=1250)
print(account_2) # BankAccount('Bob', balance=500)
# Both share the class attribute
print(account_1.interest_rate) # 0.02
print(account_2.interest_rate) # 0.02
Why interviewers ask this: This is foundational OOP. They want to confirm you understand instantiation and the relationship between class-level and instance-level attributes.
Python supports single inheritance, multiple inheritance, and multilevel inheritance. The super() function delegates method calls to a parent class in the Method Resolution Order (MRO). Python uses the C3 linearization algorithm to determine the MRO, which prevents the diamond problem ambiguity found in some other languages.
# Single inheritance
class Animal:
def __init__(self, name):
self.name = name
def speak(self):
raise NotImplementedError("Subclasses must implement speak()")
class Dog(Animal):
def speak(self):
return f"{self.name} says Woof!"
class Cat(Animal):
def speak(self):
return f"{self.name} says Meow!"
# Multiple inheritance
class Pet:
def __init__(self, owner):
self.owner = owner
class PetDog(Dog, Pet):
def __init__(self, name, owner):
Dog.__init__(self, name)
Pet.__init__(self, owner)
def info(self):
return f"{self.name} belongs to {self.owner}"
buddy = PetDog("Buddy", "Alice")
print(buddy.speak()) # Buddy says Woof!
print(buddy.info()) # Buddy belongs to Alice
# Check the Method Resolution Order
print(PetDog.__mro__)
# (PetDog, Dog, Animal, Pet, object)
Why interviewers ask this: They want to verify you understand the MRO and can reason about method resolution in complex inheritance hierarchies.
Always use the with statement for file operations to guarantee proper resource cleanup. Python supports reading, writing, and appending in both text and binary modes.
# Writing to a file
with open("output.txt", "w") as f:
f.write("Line 1\n")
f.write("Line 2\n")
# Reading the entire file
with open("output.txt", "r") as f:
content = f.read()
print(content)
# Reading line by line (memory efficient for large files)
with open("output.txt", "r") as f:
for line in f:
print(line.strip())
# Appending to a file
with open("output.txt", "a") as f:
f.write("Line 3\n")
# Working with JSON
import json
data = {"name": "Alice", "scores": [95, 87, 92]}
with open("data.json", "w") as f:
json.dump(data, f, indent=2)
with open("data.json", "r") as f:
loaded = json.load(f)
print(loaded["name"]) # Alice
Why interviewers ask this: File handling is a daily task. They want to see that you use context managers and know the difference between read modes.
These questions test deep understanding of Python internals, concurrency, design patterns, and performance. They separate experienced developers from those who have only scratched the surface.
The GIL is a mutex in CPython that allows only one thread to execute Python bytecode at a time. It exists because CPython's memory management (reference counting) is not thread-safe. The GIL means that CPU-bound multi-threaded Python programs do not achieve true parallelism. However, the GIL is released during I/O operations, so multi-threaded programs that are I/O-bound (network calls, file reads) can still benefit from threading.
import threading
import time
# CPU-bound task - GIL prevents true parallelism with threads
def cpu_bound(n):
total = 0
for i in range(n):
total += i * i
return total
# Single-threaded
start = time.perf_counter()
cpu_bound(10_000_000)
cpu_bound(10_000_000)
single_time = time.perf_counter() - start
print(f"Single-threaded: {single_time:.2f}s")
# Multi-threaded (NOT faster due to the GIL)
start = time.perf_counter()
t1 = threading.Thread(target=cpu_bound, args=(10_000_000,))
t2 = threading.Thread(target=cpu_bound, args=(10_000_000,))
t1.start()
t2.start()
t1.join()
t2.join()
threaded_time = time.perf_counter() - start
print(f"Multi-threaded: {threaded_time:.2f}s") # Similar or slower!
Why interviewers ask this: The GIL is one of the most important things to understand about CPython's concurrency model. Senior developers must know when to use threads vs. processes.
Use threading for I/O-bound tasks (waiting for network responses, reading files, database queries) because the GIL is released during I/O. Use multiprocessing for CPU-bound tasks (data processing, computation) because each process has its own Python interpreter and GIL, enabling true parallelism across CPU cores.
import threading
import multiprocessing
import time
import requests
# I/O-bound: threading is effective
def fetch_url(url):
response = requests.get(url, timeout=5)
return len(response.content)
urls = ["https://example.com"] * 5
# Threaded I/O (fast - threads release GIL during network I/O)
start = time.perf_counter()
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Threaded I/O: {time.perf_counter() - start:.2f}s")
# CPU-bound: multiprocessing achieves true parallelism
def heavy_computation(n):
return sum(i * i for i in range(n))
# Using multiprocessing Pool
if __name__ == "__main__":
with multiprocessing.Pool(processes=4) as pool:
results = pool.map(heavy_computation, [5_000_000] * 4)
print(f"Results: {[r // 1_000_000 for r in results]}")
Why interviewers ask this: This tests whether you can design concurrent systems appropriately. Choosing the wrong concurrency model leads to performance problems or bugs.
Python uses two mechanisms for memory management. The primary mechanism is reference counting: every object has a count of references pointing to it, and when that count reaches zero, the memory is immediately freed. The secondary mechanism is a cyclic garbage collector that detects and cleans up reference cycles (objects that reference each other but are no longer reachable from the program).
import sys
import gc
# Reference counting
a = [1, 2, 3]
print(sys.getrefcount(a)) # 2 (one for 'a', one for getrefcount's argument)
b = a
print(sys.getrefcount(a)) # 3
del b
print(sys.getrefcount(a)) # 2
# Circular references require the garbage collector
class Node:
def __init__(self, value):
self.value = value
self.next = None
# Create a circular reference
node1 = Node(1)
node2 = Node(2)
node1.next = node2
node2.next = node1 # Circular!
# Even after deleting references, refcount won't reach 0
del node1, node2
# The cyclic GC will eventually clean this up
# You can manually trigger garbage collection
collected = gc.collect()
print(f"Garbage collector freed {collected} objects")
# Check GC thresholds
print(gc.get_threshold()) # (700, 10, 10) - default thresholds
Why interviewers ask this: Senior developers need to understand memory behavior to write scalable applications and diagnose memory leaks.
__slots__ and when you would use it.By default, Python objects store their attributes in a __dict__ dictionary, which is flexible but memory-intensive. Defining __slots__ tells Python to use a fixed-size internal structure instead. This saves significant memory when creating millions of instances and provides slightly faster attribute access. The tradeoff is that you cannot add arbitrary attributes to instances.
import sys
class PointDict:
def __init__(self, x, y):
self.x = x
self.y = y
class PointSlots:
__slots__ = ("x", "y")
def __init__(self, x, y):
self.x = x
self.y = y
# Memory comparison
p1 = PointDict(1, 2)
p2 = PointSlots(1, 2)
print(sys.getsizeof(p1) + sys.getsizeof(p1.__dict__)) # ~200 bytes
print(sys.getsizeof(p2)) # ~56 bytes
# __slots__ prevents adding arbitrary attributes
p2.z = 3 # AttributeError: 'PointSlots' object has no attribute 'z'
Why interviewers ask this: This tests your understanding of Python's object model and your ability to optimize memory usage for performance-critical applications.
A metaclass is the class of a class. Just as a class defines how an instance behaves, a metaclass defines how a class behaves. The default metaclass is type. Metaclasses are an advanced feature used in frameworks (like Django's ORM and SQLAlchemy) to customize class creation, enforce constraints, or register classes automatically.
# Every class is an instance of 'type' print(type(int)) #print(type(str)) # # Custom metaclass class SingletonMeta(type): _instances = {} def __call__(cls, *args, **kwargs): if cls not in cls._instances: cls._instances[cls] = super().__call__(*args, **kwargs) return cls._instances[cls] class Database(metaclass=SingletonMeta): def __init__(self): self.connection = "connected" print("Database initialized") # Only one instance is ever created db1 = Database() # "Database initialized" db2 = Database() # No output - returns existing instance print(db1 is db2) # True
Why interviewers ask this: Metaclasses are rarely needed in everyday code, but understanding them demonstrates deep knowledge of Python's object model. Senior candidates should at least be able to explain what they are.
Descriptors are objects that define __get__, __set__, or __delete__ methods. They control what happens when an attribute is accessed, set, or deleted on another object. Properties, class methods, and static methods are all implemented using descriptors under the hood.
class Validated:
"""A descriptor that validates assigned values."""
def __init__(self, min_value=None, max_value=None):
self.min_value = min_value
self.max_value = max_value
def __set_name__(self, owner, name):
self.name = name
def __get__(self, obj, objtype=None):
if obj is None:
return self
return getattr(obj, f"_{self.name}", None)
def __set__(self, obj, value):
if self.min_value is not None and value < self.min_value:
raise ValueError(f"{self.name} must be >= {self.min_value}")
if self.max_value is not None and value > self.max_value:
raise ValueError(f"{self.name} must be <= {self.max_value}")
setattr(obj, f"_{self.name}", value)
class Product:
price = Validated(min_value=0)
quantity = Validated(min_value=0, max_value=10000)
def __init__(self, name, price, quantity):
self.name = name
self.price = price # Triggers Validated.__set__
self.quantity = quantity
item = Product("Widget", 9.99, 100)
print(item.price) # 9.99
# item.price = -5 # ValueError: price must be >= 0
Why interviewers ask this: Descriptors are the mechanism behind @property, @classmethod, and @staticmethod. Understanding them shows you grasp how Python's attribute access works internally.
unittest and pytest?unittest is Python's built-in testing framework, modeled after Java's JUnit. It requires subclassing TestCase and using assertion methods like assertEqual(). pytest is a third-party framework that uses plain assert statements, has a powerful fixture system, and supports plugins for parallel execution, coverage, and more. Most modern Python projects prefer pytest.
# unittest style
import unittest
class TestCalculator(unittest.TestCase):
def setUp(self):
self.calc_data = [1, 2, 3, 4, 5]
def test_sum(self):
self.assertEqual(sum(self.calc_data), 15)
def test_max(self):
self.assertEqual(max(self.calc_data), 5)
# pytest style (much cleaner)
import pytest
@pytest.fixture
def calc_data():
return [1, 2, 3, 4, 5]
def test_sum(calc_data):
assert sum(calc_data) == 15
def test_max(calc_data):
assert max(calc_data) == 5
# pytest parametrize - test multiple inputs cleanly
@pytest.mark.parametrize("input_val, expected", [
(1, 1),
(2, 4),
(3, 9),
(4, 16),
])
def test_square(input_val, expected):
assert input_val ** 2 == expected
Why interviewers ask this: Testing is non-negotiable in professional software development. They want to see that you have hands-on experience writing tests, not just running them.
Virtual environments create isolated Python installations where you can install packages without affecting the system Python or other projects. This prevents dependency conflicts and ensures reproducible builds. Every professional Python project should use one.
# Creating and using a virtual environment # $ python3 -m venv myproject_env # $ source myproject_env/bin/activate (Linux/Mac) # $ myproject_env\Scripts\activate (Windows) # Inside the venv, pip installs packages locally # $ pip install requests flask # $ pip freeze > requirements.txt # requirements.txt captures exact versions # requests==2.31.0 # flask==3.0.0 # Another developer reproduces the environment # $ python3 -m venv myproject_env # $ source myproject_env/bin/activate # $ pip install -r requirements.txt
Why interviewers ask this: If you cannot explain virtual environments, it signals that you have not worked on professional Python projects with dependency management.
Magic methods (or dunder methods, short for "double underscore") are special methods that Python calls implicitly. They let your objects work with built-in operators and functions. Some important ones beyond __init__, __str__, and __repr__:
class Vector:
def __init__(self, x, y):
self.x = x
self.y = y
def __add__(self, other):
return Vector(self.x + other.x, self.y + other.y)
def __mul__(self, scalar):
return Vector(self.x * scalar, self.y * scalar)
def __abs__(self):
return (self.x ** 2 + self.y ** 2) ** 0.5
def __eq__(self, other):
return self.x == other.x and self.y == other.y
def __len__(self):
return 2 # A 2D vector always has 2 components
def __getitem__(self, index):
if index == 0:
return self.x
elif index == 1:
return self.y
raise IndexError("Vector index out of range")
def __repr__(self):
return f"Vector({self.x}, {self.y})"
v1 = Vector(3, 4)
v2 = Vector(1, 2)
print(v1 + v2) # Vector(4, 6) - uses __add__
print(v1 * 3) # Vector(9, 12) - uses __mul__
print(abs(v1)) # 5.0 - uses __abs__
print(v1 == v2) # False - uses __eq__
print(len(v1)) # 2 - uses __len__
print(v1[0]) # 3 - uses __getitem__
Why interviewers ask this: Dunder methods define the Pythonic way to build objects that integrate seamlessly with the language. Mastery of these separates Python developers from people who write Python-flavored Java.
The async/await syntax enables cooperative multitasking for I/O-bound operations using a single thread. Unlike threads, coroutines give up control explicitly at await points, which avoids race conditions. Use asyncio when you need to handle many concurrent I/O operations (web servers, API clients, chat systems).
import asyncio
async def fetch_data(url, delay):
"""Simulate an async HTTP request."""
print(f"Fetching {url}...")
await asyncio.sleep(delay) # Non-blocking sleep
print(f"Done fetching {url}")
return {"url": url, "status": 200}
async def main():
# Run multiple I/O operations concurrently
tasks = [
fetch_data("https://api.example.com/users", 2),
fetch_data("https://api.example.com/orders", 1),
fetch_data("https://api.example.com/products", 3),
]
# asyncio.gather runs all tasks concurrently
results = await asyncio.gather(*tasks)
for result in results:
print(f" {result['url']} -> {result['status']}")
# Total time: ~3 seconds (not 6), because tasks run concurrently
asyncio.run(main())
Why interviewers ask this: Async programming is essential for high-performance Python applications. Interviewers want to see that you understand the event loop and know when async is the right tool.
f-strings instead of string concatenation. These details signal experience.yield for lazy iteration and with for resource management.functools.wraps, including decorators that accept arguments.unittest and pytest, and be able to write fixtures and parameterized tests.__slots__ are what separate Python developers from Python users.If you have ever tried to process a 10 GB log file by reading it entirely into memory, you already know why generators and iterators matter. They are Python’s answer to a fundamental problem: how do you work with sequences of data without materializing everything in memory at once?
An iterator is any object that produces values one at a time through a standard protocol. A generator is a special kind of iterator that you create with a function containing yield statements. Together, they let you build lazy pipelines that process data element by element, consuming only the memory needed for a single item at a time.
This is not just an academic concept. Every for loop in Python uses the iterator protocol under the hood. When you iterate over a file, a database cursor, or a range of numbers, you are already using iterators. Understanding how they work gives you the ability to write code that scales to datasets of any size without blowing up your memory footprint.
In this tutorial, we will cover the iterator protocol from the ground up, build custom iterators and generators, chain them into processing pipelines, and explore the itertools module. By the end, you will have a complete mental model for lazy evaluation in Python.
The iterator protocol is deceptively simple. It consists of two methods:
__iter__() — Returns the iterator object itself. This is what makes an object usable in a for loop.__next__() — Returns the next value in the sequence. When there are no more values, it raises StopIteration.That is the entire contract. Any object that implements both methods is an iterator. Any object that implements __iter__() (even if it returns a separate iterator object) is an iterable.
The distinction matters: a list is an iterable (it has __iter__() that returns a list iterator), but it is not itself an iterator (it does not have __next__()). The iterator is a separate object that tracks the current position.
# The iterator protocol in action
numbers = [10, 20, 30]
# Get an iterator from the iterable
it = iter(numbers) # Calls numbers.__iter__()
print(next(it)) # 10 — Calls it.__next__()
print(next(it)) # 20
print(next(it)) # 30
# print(next(it)) # Raises StopIteration
# This is exactly what a for loop does internally:
# 1. Calls iter() on the iterable to get an iterator
# 2. Calls next() repeatedly until StopIteration
# 3. Catches StopIteration silently and exits the loop
for num in [10, 20, 30]:
print(num)
# Equivalent to the manual iter()/next() calls above
Understanding StopIteration is key. It is not an error — it is the signal that tells Python the sequence is exhausted. The for loop catches it automatically, but if you call next() manually, you need to handle it yourself or pass a default value:
# Handling StopIteration manually
it = iter([1, 2])
print(next(it)) # 1
print(next(it)) # 2
print(next(it, "done")) # "done" — default value instead of StopIteration
# Without a default, you must catch the exception
it = iter([1])
try:
print(next(it)) # 1
print(next(it)) # StopIteration raised here
except StopIteration:
print("Iterator exhausted")
To make your own class work with for loops, implement the iterator protocol. Here is a class that counts up from a start value to a stop value:
class CountUp:
"""An iterator that counts from start to stop (inclusive)."""
def __init__(self, start, stop):
self.start = start
self.stop = stop
self.current = start
def __iter__(self):
return self
def __next__(self):
if self.current > self.stop:
raise StopIteration
value = self.current
self.current += 1
return value
# Use it in a for loop
for num in CountUp(1, 5):
print(num, end=" ") # 1 2 3 4 5
# Use it with list() to materialize all values
print(list(CountUp(10, 15))) # [10, 11, 12, 13, 14, 15]
# Use it with sum(), max(), any(), etc.
print(sum(CountUp(1, 100))) # 5050
Python’s built-in types are all iterable. The iter() function extracts an iterator from any iterable, and next() advances it one step.
# Lists
list_iter = iter([1, 2, 3])
print(next(list_iter)) # 1
print(next(list_iter)) # 2
# Strings (iterate character by character)
str_iter = iter("Python")
print(next(str_iter)) # 'P'
print(next(str_iter)) # 'y'
# Dictionaries (iterate over keys by default)
data = {"name": "Alice", "age": 30, "role": "engineer"}
dict_iter = iter(data)
print(next(dict_iter)) # 'name'
print(next(dict_iter)) # 'age'
# Iterate over values or key-value pairs
for value in data.values():
print(value, end=" ") # Alice 30 engineer
for key, value in data.items():
print(f"{key}={value}", end=" ") # name=Alice age=30 role=engineer
# Sets (order is not guaranteed)
set_iter = iter({3, 1, 4, 1, 5})
print(next(set_iter)) # Could be any element
# Files are iterators (they yield lines)
with open("example.txt", "w") as f:
f.write("line 1\nline 2\nline 3\n")
with open("example.txt") as f:
for line in f: # f is its own iterator
print(line.strip())
# line 1
# line 2
# line 3
Notice that files are their own iterators — calling iter(f) returns f itself. This is why you can iterate over a file directly in a for loop. It also means you can only iterate through a file once without resetting the file pointer.
Let us build a few more custom iterators to solidify the pattern. Each one implements __iter__() and __next__().
class Fibonacci:
"""An iterator that produces Fibonacci numbers up to a maximum value."""
def __init__(self, max_value):
self.max_value = max_value
self.a = 0
self.b = 1
def __iter__(self):
return self
def __next__(self):
if self.a > self.max_value:
raise StopIteration
value = self.a
self.a, self.b = self.b, self.a + self.b
return value
print(list(Fibonacci(100)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
# Works with any function that consumes an iterable
print(sum(Fibonacci(1000))) # 2583
class MyRange:
"""A simplified reimplementation of range()."""
def __init__(self, start, stop=None, step=1):
if stop is None:
self.start = 0
self.stop = start
else:
self.start = start
self.stop = stop
self.step = step
def __iter__(self):
# Return a new iterator each time — this allows reuse
current = self.start
while (self.step > 0 and current < self.stop) or \
(self.step < 0 and current > self.stop):
yield current # Using yield here makes __iter__ a generator
current += self.step
def __len__(self):
return max(0, (self.stop - self.start + self.step - 1) // self.step)
def __repr__(self):
return f"MyRange({self.start}, {self.stop}, {self.step})"
# Forward range
print(list(MyRange(5))) # [0, 1, 2, 3, 4]
print(list(MyRange(2, 8))) # [2, 3, 4, 5, 6, 7]
print(list(MyRange(0, 10, 3))) # [0, 3, 6, 9]
# Reverse range
print(list(MyRange(10, 0, -2))) # [10, 8, 6, 4, 2]
# Reusable (unlike a plain iterator)
r = MyRange(3)
print(list(r)) # [0, 1, 2]
print(list(r)) # [0, 1, 2] — works again because __iter__ creates a new generator
Notice the MyRange trick: instead of implementing __next__() directly, the __iter__() method uses yield, which makes it a generator function. Each call to __iter__() creates a fresh generator object, so the range is reusable. This is a common and powerful pattern.
Writing custom iterator classes is verbose. You need __init__, __iter__, __next__, manual state management, and StopIteration handling. Generators solve this by letting you write iterator logic as a simple function with yield statements.
When Python encounters a yield in a function body, that function becomes a generator function. Calling it does not execute the body — it returns a generator object that implements the iterator protocol automatically.
def count_up(start, stop):
"""A generator that counts from start to stop."""
current = start
while current <= stop:
yield current # Pause here, return current value
current += 1 # Resume here on next() call
# Calling the function returns a generator object (does NOT run the body)
gen = count_up(1, 5)
print(type(gen)) # <class 'generator'>
# The generator implements the iterator protocol
print(next(gen)) # 1
print(next(gen)) # 2
print(next(gen)) # 3
# Use in a for loop
for num in count_up(1, 5):
print(num, end=" ") # 1 2 3 4 5
When you call next() on a generator, execution proceeds from the current position until it hits a yield statement. At that point, the yielded value is returned to the caller, and the generator's entire state (local variables, instruction pointer) is frozen. The next next() call resumes from exactly where it left off.
def demonstrate_state():
print("Step 1: Starting")
yield "first"
print("Step 2: Resumed after first yield")
yield "second"
print("Step 3: Resumed after second yield")
yield "third"
print("Step 4: About to finish")
# No more yields — StopIteration will be raised
gen = demonstrate_state()
print(next(gen))
# Step 1: Starting
# 'first'
print(next(gen))
# Step 2: Resumed after first yield
# 'second'
print(next(gen))
# Step 3: Resumed after second yield
# 'third'
# print(next(gen))
# Step 4: About to finish
# Raises StopIteration
You can inspect a generator's state using the inspect module:
import inspect
def simple_gen():
yield 1
yield 2
gen = simple_gen()
print(inspect.getgeneratorstate(gen)) # GEN_CREATED
next(gen)
print(inspect.getgeneratorstate(gen)) # GEN_SUSPENDED
next(gen)
print(inspect.getgeneratorstate(gen)) # GEN_SUSPENDED
try:
next(gen)
except StopIteration:
pass
print(inspect.getgeneratorstate(gen)) # GEN_CLOSED
A generator moves through four states: GEN_CREATED (just created, not started), GEN_RUNNING (currently executing), GEN_SUSPENDED (paused at a yield), and GEN_CLOSED (finished or closed).
Compare the class-based Fibonacci iterator from earlier with the generator version:
# Generator version — drastically simpler
def fibonacci(max_value=None):
a, b = 0, 1
while max_value is None or a <= max_value:
yield a
a, b = b, a + b
# Finite sequence
print(list(fibonacci(100)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
# Infinite sequence (use itertools.islice to take a finite portion)
import itertools
print(list(itertools.islice(fibonacci(), 15)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377]
The generator version is 4 lines of logic compared to 12+ lines for the class. No __init__, no __iter__, no __next__, no StopIteration — Python handles all of it.
Generator expressions are to generators what list comprehensions are to lists. They use the same syntax as list comprehensions, but with parentheses instead of square brackets. The critical difference is that a generator expression produces values lazily — one at a time — while a list comprehension builds the entire list in memory.
import sys
# List comprehension — builds entire list in memory
squares_list = [x ** 2 for x in range(1_000_000)]
print(f"List size: {sys.getsizeof(squares_list):,} bytes") # ~8,448,728 bytes
# Generator expression — produces values on demand
squares_gen = (x ** 2 for x in range(1_000_000))
print(f"Generator size: {sys.getsizeof(squares_gen):,} bytes") # ~200 bytes
# Both support filtering
even_squares = (x ** 2 for x in range(20) if x % 2 == 0)
print(list(even_squares)) # [0, 4, 16, 36, 64, 100, 144, 196, 256, 324]
# Generator expressions can be passed directly to functions
# (no extra parentheses needed when it is the only argument)
total = sum(x ** 2 for x in range(1000))
print(total) # 332833500
max_val = max(len(word) for word in ["Python", "generators", "are", "powerful"])
print(max_val) # 10
has_negative = any(x < 0 for x in [1, -2, 3, 4])
print(has_negative) # True
import sys
def compare_memory(n):
"""Compare memory usage of list vs generator for n elements."""
# List comprehension
data_list = [x * 2 for x in range(n)]
list_size = sys.getsizeof(data_list)
# Generator expression
data_gen = (x * 2 for x in range(n))
gen_size = sys.getsizeof(data_gen)
print(f"n={n:>12,} | List: {list_size:>12,} bytes | Generator: {gen_size:>6,} bytes | Ratio: {list_size/gen_size:.0f}x")
compare_memory(100)
compare_memory(10_000)
compare_memory(1_000_000)
compare_memory(10_000_000)
# Output:
# n= 100 | List: 920 bytes | Generator: 200 bytes | Ratio: 5x
# n= 10,000 | List: 87,624 bytes | Generator: 200 bytes | Ratio: 438x
# n= 1,000,000 | List: 8,448,728 bytes | Generator: 200 bytes | Ratio: 42244x
# n= 10,000,000 | List: 80,000,056 bytes | Generator: 200 bytes | Ratio: 400000x
The generator's memory footprint is constant regardless of how many elements it produces. This is the fundamental advantage of lazy evaluation.
The yield from expression, introduced in Python 3.3, delegates iteration to a sub-generator or any iterable. It is cleaner than manually looping over a sub-iterable and yielding each element.
# Without yield from
def chain_manual(*iterables):
for iterable in iterables:
for item in iterable:
yield item
# With yield from — cleaner
def chain_elegant(*iterables):
for iterable in iterables:
yield from iterable
# Both produce the same result
result = list(chain_elegant([1, 2, 3], "abc", (10, 20)))
print(result) # [1, 2, 3, 'a', 'b', 'c', 10, 20]
def flatten(nested):
"""Recursively flatten a nested structure."""
for item in nested:
if isinstance(item, (list, tuple)):
yield from flatten(item) # Delegate to recursive call
else:
yield item
data = [1, [2, 3], [4, [5, 6, [7, 8]]], 9]
print(list(flatten(data))) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Works with mixed nesting
mixed = [1, (2, [3, 4]), [5, (6,)], 7]
print(list(flatten(mixed))) # [1, 2, 3, 4, 5, 6, 7]
def header_rows():
yield "Name,Age,City"
def data_rows():
yield "Alice,30,New York"
yield "Bob,25,San Francisco"
yield "Charlie,35,Chicago"
def footer_rows():
yield "---END OF REPORT---"
def full_report():
yield from header_rows()
yield from data_rows()
yield from footer_rows()
for line in full_report():
print(line)
# Name,Age,City
# Alice,30,New York
# Bob,25,San Francisco
# Charlie,35,Chicago
# ---END OF REPORT---
Generators are not just producers — they can also receive values. The send() method resumes a generator and sends a value that becomes the result of the yield expression inside the generator. This turns generators into coroutines that can both produce and consume data.
def running_average():
"""A generator that computes a running average."""
total = 0
count = 0
average = None
while True:
value = yield average # Receive a value, yield the current average
if value is None:
break
total += value
count += 1
average = total / count
# Usage
avg = running_average()
next(avg) # Prime the generator (advance to first yield)
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0
print(avg.send(40)) # 25.0
The first next() call is necessary to "prime" the generator — it advances execution to the first yield expression, where the generator is ready to receive a value. After that, send() both sends a value in and gets the next yielded value out.
def accumulator():
"""A coroutine that accumulates values and reports the running total."""
total = 0
while True:
value = yield total
if value is None:
return total # return value becomes StopIteration.value
total += value
acc = accumulator()
next(acc) # Prime
print(acc.send(5)) # 5
print(acc.send(10)) # 15
print(acc.send(3)) # 18
# Close the generator gracefully
try:
acc.send(None) # Triggers the return statement
except StopIteration as e:
print(f"Final total: {e.value}") # Final total: 18
# Practical coroutine: a filter that receives items and forwards matches
def grep_coroutine(pattern):
"""A coroutine that filters lines matching a pattern."""
print(f"Looking for: {pattern}")
matches = []
while True:
line = yield
if line is None:
break
if pattern in line:
matches.append(line)
print(f" Match: {line}")
return matches
# Usage
searcher = grep_coroutine("error")
next(searcher) # Prime
searcher.send("INFO: Server started")
searcher.send("ERROR: Connection timeout") # Match
searcher.send("DEBUG: Request received")
searcher.send("ERROR: Disk full") # Match
searcher.send("INFO: Shutting down")
try:
searcher.send(None) # Signal completion
except StopIteration as e:
print(f"All matches: {e.value}")
# Match: ERROR: Connection timeout
# Match: ERROR: Disk full
# All matches: ['ERROR: Connection timeout', 'ERROR: Disk full']
One of the most powerful patterns in Python is chaining generators into a processing pipeline. Each generator reads from the previous one, transforms the data, and passes it along. This works like Unix pipes — data flows through a chain of transformations without any intermediate lists being created in memory.
# Pipeline: Read lines -> filter non-empty -> strip whitespace -> convert to uppercase
def read_lines(text):
"""Stage 1: Split text into lines."""
for line in text.split("\n"):
yield line
def filter_non_empty(lines):
"""Stage 2: Remove empty lines."""
for line in lines:
if line.strip():
yield line
def strip_whitespace(lines):
"""Stage 3: Strip leading/trailing whitespace."""
for line in lines:
yield line.strip()
def to_uppercase(lines):
"""Stage 4: Convert to uppercase."""
for line in lines:
yield line.upper()
# Chain the pipeline
raw_text = """
hello world
Python generators
are powerful
and memory efficient
"""
pipeline = to_uppercase(
strip_whitespace(
filter_non_empty(
read_lines(raw_text)
)
)
)
for line in pipeline:
print(line)
# HELLO WORLD
# PYTHON GENERATORS
# ARE POWERFUL
# AND MEMORY EFFICIENT
# A more realistic pipeline: process log entries
def parse_log_entries(lines):
"""Parse each line into a structured dict."""
for line in lines:
parts = line.split(" | ")
if len(parts) == 3:
yield {
"timestamp": parts[0],
"level": parts[1],
"message": parts[2]
}
def filter_errors(entries):
"""Keep only ERROR entries."""
for entry in entries:
if entry["level"] == "ERROR":
yield entry
def format_alerts(entries):
"""Format entries as alert strings."""
for entry in entries:
yield f"ALERT [{entry['timestamp']}]: {entry['message']}"
# Simulate log data
log_data = [
"2024-01-15 10:00:01 | INFO | Server started",
"2024-01-15 10:00:05 | ERROR | Database connection failed",
"2024-01-15 10:00:10 | INFO | Retry attempt 1",
"2024-01-15 10:00:15 | ERROR | Database connection failed again",
"2024-01-15 10:00:20 | INFO | Connection restored",
"2024-01-15 10:00:25 | ERROR | Disk space low",
]
# Build the pipeline
alerts = format_alerts(filter_errors(parse_log_entries(log_data)))
for alert in alerts:
print(alert)
# ALERT [2024-01-15 10:00:05]: Database connection failed
# ALERT [2024-01-15 10:00:15]: Database connection failed again
# ALERT [2024-01-15 10:00:25]: Disk space low
Each stage processes one item at a time. No intermediate lists are created. This means you could pipe a 100 GB log file through this pipeline and it would use a trivial amount of memory.
The itertools module is Python's standard library for efficient iterator operations. Every function in it returns an iterator, so they compose naturally with generators and pipelines. Here are the functions you will use most often.
import itertools
# count: count from a start value with a step
for i in itertools.islice(itertools.count(10, 2), 5):
print(i, end=" ") # 10 12 14 16 18
print()
# cycle: repeat an iterable forever
colors = itertools.cycle(["red", "green", "blue"])
for _ in range(7):
print(next(colors), end=" ") # red green blue red green blue red
print()
# repeat: repeat a value n times (or forever)
fives = list(itertools.repeat(5, 4))
print(fives) # [5, 5, 5, 5]
# Practical use of repeat: initialize a grid
row = list(itertools.repeat(0, 5))
grid = [list(itertools.repeat(0, 5)) for _ in range(3)]
print(grid) # [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
import itertools
# chain: concatenate multiple iterables
combined = list(itertools.chain([1, 2], [3, 4], [5, 6]))
print(combined) # [1, 2, 3, 4, 5, 6]
# chain.from_iterable: chain from a single iterable of iterables
nested = [[1, 2], [3, 4], [5, 6]]
flat = list(itertools.chain.from_iterable(nested))
print(flat) # [1, 2, 3, 4, 5, 6]
# islice: slice an iterator (like list slicing but for iterators)
print(list(itertools.islice(range(100), 5))) # [0, 1, 2, 3, 4]
print(list(itertools.islice(range(100), 10, 20, 3))) # [10, 13, 16, 19]
# takewhile / dropwhile: take/drop based on a predicate
nums = [1, 3, 5, 7, 2, 4, 6, 8]
print(list(itertools.takewhile(lambda x: x < 6, nums))) # [1, 3, 5]
print(list(itertools.dropwhile(lambda x: x < 6, nums))) # [7, 2, 4, 6, 8]
# groupby: group consecutive elements by a key function
data = [("A", 1), ("A", 2), ("B", 3), ("B", 4), ("A", 5)]
for key, group in itertools.groupby(data, key=lambda x: x[0]):
print(f"{key}: {list(group)}")
# A: [('A', 1), ('A', 2)]
# B: [('B', 3), ('B', 4)]
# A: [('A', 5)] <-- Note: only groups CONSECUTIVE matches
import itertools
# combinations: all r-length combinations (no repeats, order doesn't matter)
print(list(itertools.combinations("ABCD", 2)))
# [('A','B'), ('A','C'), ('A','D'), ('B','C'), ('B','D'), ('C','D')]
# combinations_with_replacement: combinations allowing repeats
print(list(itertools.combinations_with_replacement("AB", 3)))
# [('A','A','A'), ('A','A','B'), ('A','B','B'), ('B','B','B')]
# permutations: all r-length arrangements (order matters)
print(list(itertools.permutations("ABC", 2)))
# [('A','B'), ('A','C'), ('B','A'), ('B','C'), ('C','A'), ('C','B')]
# product: Cartesian product (like nested for loops)
print(list(itertools.product("AB", [1, 2])))
# [('A',1), ('A',2), ('B',1), ('B',2)]
# Practical: generate all possible configs
sizes = ["small", "medium", "large"]
colors = ["red", "blue"]
materials = ["cotton", "silk"]
for combo in itertools.product(sizes, colors, materials):
print(combo)
# ('small', 'red', 'cotton')
# ('small', 'red', 'silk')
# ('small', 'blue', 'cotton')
# ... (12 total combinations)
This is the canonical use case for generators. Instead of loading an entire file into memory, you process it one line at a time.
def read_large_file(file_path):
"""Read a file line by line using a generator."""
with open(file_path, "r") as f:
for line in f:
yield line.strip()
def count_errors_in_log(file_path):
"""Count error lines in a log file without loading it into memory."""
error_count = 0
for line in read_large_file(file_path):
if "ERROR" in line:
error_count += 1
return error_count
# For a 10 GB log file, this uses ~1 line of memory at a time
# Instead of loading all 10 GB:
# count = count_errors_in_log("/var/log/huge_application.log")
# Alternative using generator expression:
# error_count = sum(1 for line in read_large_file(path) if "ERROR" in line)
import itertools
def primes():
"""Generate prime numbers indefinitely using a sieve approach."""
yield 2
composites = {} # Maps composite number -> list of primes that divide it
candidate = 3
while True:
if candidate not in composites:
# candidate is prime
yield candidate
composites[candidate * candidate] = [candidate]
else:
# candidate is composite; advance its prime factors
for prime in composites[candidate]:
composites.setdefault(candidate + prime, []).append(prime)
del composites[candidate]
candidate += 2 # Skip even numbers
# Get the first 20 prime numbers
first_20_primes = list(itertools.islice(primes(), 20))
print(first_20_primes)
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]
# Sum of the first 1000 primes
print(sum(itertools.islice(primes(), 1000))) # 3682913
import csv
from io import StringIO
# Simulated CSV data
csv_data = """name,department,salary
Alice,Engineering,120000
Bob,Marketing,85000
Charlie,Engineering,135000
Diana,Marketing,90000
Eve,Engineering,110000
Frank,HR,75000
Grace,Engineering,140000
"""
def read_csv_rows(csv_text):
"""Stage 1: Parse CSV into dictionaries."""
reader = csv.DictReader(StringIO(csv_text))
for row in reader:
yield row
def filter_department(rows, dept):
"""Stage 2: Keep only rows matching the department."""
for row in rows:
if row["department"] == dept:
yield row
def transform_salary(rows):
"""Stage 3: Convert salary to int and add a bonus field."""
for row in rows:
salary = int(row["salary"])
row["salary"] = salary
row["bonus"] = salary * 0.1 # 10% bonus
yield row
def aggregate(rows):
"""Stage 4: Compute total salary and average."""
total = 0
count = 0
for row in rows:
total += row["salary"]
count += 1
yield row # Pass through for downstream consumers
# After iteration, print the summary
if count > 0:
print(f"\nTotal salary: ${total:,}")
print(f"Average salary: ${total/count:,.0f}")
print(f"Headcount: {count}")
# Build and run the pipeline
pipeline = aggregate(
transform_salary(
filter_department(
read_csv_rows(csv_data),
"Engineering"
)
)
)
for emp in pipeline:
print(f"{emp['name']}: ${emp['salary']:,} (bonus: ${emp['bonus']:,.0f})")
# Alice: $120,000 (bonus: $12,000)
# Charlie: $135,000 (bonus: $13,500)
# Eve: $110,000 (bonus: $11,000)
# Grace: $140,000 (bonus: $14,000)
#
# Total salary: $505,000
# Average salary: $126,250
# Headcount: 4
import time
def paginated_api_fetch(base_url, page_size=100):
"""
Generator that fetches paginated API results.
Yields individual items across all pages.
"""
page = 1
while True:
# Simulate API call (replace with real requests.get())
url = f"{base_url}?page={page}&size={page_size}"
print(f"Fetching: {url}")
# Simulated response
if page <= 3:
results = [{"id": i, "name": f"Item {i}"}
for i in range((page-1)*page_size + 1, page*page_size + 1)]
else:
results = [] # No more data
if not results:
break # No more pages
yield from results # Yield each item individually
page += 1
time.sleep(0.1) # Rate limiting
# The consumer does not need to know about pagination
for item in paginated_api_fetch("https://api.example.com/items", page_size=2):
print(f" Processing: {item}")
if item["id"] >= 5:
break # Stop early — remaining pages are never fetched!
# Output:
# Fetching: https://api.example.com/items?page=1&size=2
# Processing: {'id': 1, 'name': 'Item 1'}
# Processing: {'id': 2, 'name': 'Item 2'}
# Fetching: https://api.example.com/items?page=2&size=2
# Processing: {'id': 3, 'name': 'Item 3'}
# Processing: {'id': 4, 'name': 'Item 4'}
# Fetching: https://api.example.com/items?page=3&size=2
# Processing: {'id': 5, 'name': 'Item 5'}
Notice the key advantage: when the consumer breaks out of the loop, the generator stops fetching. Pages 4, 5, 6, etc. are never requested. Lazy evaluation means you only do the work that is actually needed.
Let us put hard numbers on the difference between lists and generators.
import sys
import time
import tracemalloc
def benchmark_list_vs_generator(n):
"""Compare list vs generator for summing n squared numbers."""
# List approach
tracemalloc.start()
start = time.perf_counter()
result_list = sum([x ** 2 for x in range(n)])
list_time = time.perf_counter() - start
list_peak = tracemalloc.get_traced_memory()[1]
tracemalloc.stop()
# Generator approach
tracemalloc.start()
start = time.perf_counter()
result_gen = sum(x ** 2 for x in range(n))
gen_time = time.perf_counter() - start
gen_peak = tracemalloc.get_traced_memory()[1]
tracemalloc.stop()
assert result_list == result_gen
print(f"n = {n:>12,}")
print(f" List: {list_time:.4f}s | Peak memory: {list_peak:>12,} bytes")
print(f" Generator: {gen_time:.4f}s | Peak memory: {gen_peak:>12,} bytes")
print(f" Memory saved: {(1 - gen_peak/list_peak)*100:.1f}%")
print()
benchmark_list_vs_generator(100_000)
benchmark_list_vs_generator(1_000_000)
benchmark_list_vs_generator(10_000_000)
# Typical output:
# n = 100,000
# List: 0.0234s | Peak memory: 824,464 bytes
# Generator: 0.0228s | Peak memory: 464 bytes
# Memory saved: 99.9%
#
# n = 1,000,000
# List: 0.2451s | Peak memory: 8,448,688 bytes
# Generator: 0.2389s | Peak memory: 464 bytes
# Memory saved: 100.0%
#
# n = 10,000,000
# List: 2.5102s | Peak memory: 80,000,048 bytes
# Generator: 2.4231s | Peak memory: 464 bytes
# Memory saved: 100.0%
Key takeaways from the benchmark:
sum(), generators are slightly faster because they avoid the overhead of allocating and populating a list.Generators have some surprising behaviors that trip up even experienced developers. Here are the ones you must know.
# Generators can only be consumed ONCE
gen = (x ** 2 for x in range(5))
print(list(gen)) # [0, 1, 4, 9, 16]
print(list(gen)) # [] — exhausted! No error, just empty.
# This is a common bug:
def get_numbers():
yield 1
yield 2
yield 3
nums = get_numbers()
print(sum(nums)) # 6
print(sum(nums)) # 0 — the generator is already exhausted!
# Fix: recreate the generator each time, or use a list if you need multiple passes
nums_list = list(get_numbers())
print(sum(nums_list)) # 6
print(sum(nums_list)) # 6
gen = (x for x in range(10))
# These all fail:
# gen[0] # TypeError: 'generator' object is not subscriptable
# gen[2:5] # TypeError: 'generator' object is not subscriptable
# len(gen) # TypeError: object of type 'generator' has no len()
# Workarounds:
import itertools
# Get the nth element (consumes n elements)
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
gen = (x ** 2 for x in range(10))
print(nth(gen, 3)) # 9 (the 4th element, 0-indexed)
# Slice an iterator
gen = (x ** 2 for x in range(10))
print(list(itertools.islice(gen, 2, 5))) # [4, 9, 16]
# A subtle bug: storing a generator and trying to use it in multiple places
def get_even_numbers(n):
return (x for x in range(n) if x % 2 == 0)
evens = get_even_numbers(20)
# First use works fine
for x in evens:
if x > 6:
break
print(f"Stopped at {x}") # Stopped at 8
# Second use — CONTINUES from where we left off, not from the beginning!
remaining = list(evens)
print(remaining) # [10, 12, 14, 16, 18]
# If you expected [0, 2, 4, 6, 8, 10, 12, 14, 16, 18], you have a bug.
# Variables in generator expressions are evaluated lazily
funcs = []
for i in range(5):
funcs.append(lambda: i) # All lambdas capture the SAME variable i
print([f() for f in funcs]) # [4, 4, 4, 4, 4] — not [0, 1, 2, 3, 4]!
# Fix: use a default argument to capture the current value
funcs = []
for i in range(5):
funcs.append(lambda i=i: i) # Each lambda gets its own copy
print([f() for f in funcs]) # [0, 1, 2, 3, 4]
Here are the guidelines I follow when deciding how to use generators in production code.
# GOOD: generator for processing a large file
def process_log_file(path):
with open(path) as f:
for line in f:
if "ERROR" in line:
yield parse_error(line)
# BAD: loading entire file into memory
def process_log_file_bad(path):
with open(path) as f:
lines = f.readlines() # Entire file in memory!
return [parse_error(line) for line in lines if "ERROR" in line]
# GOOD: generator expression passed directly to sum() total = sum(order.total for order in orders if order.status == "completed") # UNNECESSARY: creating an intermediate list total = sum([order.total for order in orders if order.status == "completed"])
import itertools
# GOOD: use itertools.chain instead of nested loops
all_items = itertools.chain(list_a, list_b, list_c)
# GOOD: use itertools.groupby for grouping
for key, group in itertools.groupby(sorted_data, key=extract_key):
process_group(key, list(group))
# GOOD: use itertools.islice for taking the first N items from an iterator
first_ten = list(itertools.islice(infinite_generator(), 10))
# If you need to iterate multiple times, use a class with __iter__
class DataSource:
def __init__(self, path):
self.path = path
def __iter__(self):
with open(self.path) as f:
for line in f:
yield line.strip()
# Each for loop gets a fresh iterator
source = DataSource("data.txt")
count = sum(1 for _ in source) # First pass: count lines
total = sum(len(line) for line in source) # Second pass: total chars
def fetch_records(query):
"""
Yield records matching the query from the database.
WARNING: This generator can only be consumed once.
If you need multiple passes, materialize with list().
"""
cursor = db.execute(query)
for row in cursor:
yield transform(row)
__iter__() and __next__(). They produce values one at a time and raise StopIteration when done. Every for loop in Python uses this protocol.yield. They are dramatically simpler to write than class-based iterators. The function's state is automatically saved and restored between next() calls.(expr for x in iterable if condition). They use constant memory regardless of the source size.chain, islice, groupby, combinations, permutations, and product instead of writing your own.If you have been writing Python for any length of time, you have almost certainly run into the moment where installing a package for one project breaks another. Maybe you upgraded requests for Project A, and suddenly Project B throws import errors because it depends on an older version. Or worse, you installed something system-wide with sudo pip install and corrupted your operating system’s Python environment. These are not edge cases — they are inevitable consequences of working without virtual environments.
Virtual environments solve this problem by giving each project its own isolated Python installation with its own set of packages. Combined with pip, Python’s package manager, they form the foundation of every professional Python workflow. Whether you are building a Flask API, training a machine learning model, or writing automation scripts, understanding virtual environments and pip is non-negotiable. This tutorial covers everything from the basics to advanced tooling that senior engineers use daily in production.
To appreciate what virtual environments give you, consider what happens without them. Every Python installation has a single site-packages directory where third-party packages get installed. When you run pip install flask without a virtual environment, Flask and all its dependencies land in that global site-packages folder. Every Python script on your system now sees that version of Flask.
Here is where things go wrong:
Dependency conflicts. Project A requires SQLAlchemy==1.4 and Project B requires SQLAlchemy==2.0. Since there is only one site-packages, you cannot have both versions installed simultaneously. Installing one overwrites the other, and one of your projects breaks.
System Python pollution. On macOS and most Linux distributions, the operating system ships with a Python installation that system tools depend on. Installing packages into system Python with pip install (especially with sudo) can overwrite libraries that your OS needs. I have seen developers render their terminal unusable by upgrading six or urllib3 system-wide.
Reproducibility failures. Without an isolated environment, you have no reliable way to know which packages your project actually needs versus what happens to be installed on your machine. When your teammate clones the repo and runs it, it fails with mysterious import errors because they do not have the same random collection of packages you accumulated over months.
Version ambiguity. Running python on different machines might invoke Python 2.7, 3.8, or 3.12. Without explicit environment management, you are guessing which interpreter and which package versions your code will encounter in production.
# This is what chaos looks like sudo pip install flask # Installs into system Python pip install django==3.2 # Might conflict with existing packages pip install requests # Which project needs this? All of them? Some? pip list # 200+ packages, no idea which project uses what
Virtual environments eliminate every one of these problems.
Python 3.3+ includes the venv module in the standard library, so you do not need to install anything extra. This is the recommended way to create virtual environments.
# Navigate to your project directory cd ~/projects/my-flask-app # Create a virtual environment python3 -m venv venv
This creates a venv directory inside your project containing a copy of the Python interpreter, the pip package manager, and an empty site-packages directory. The directory structure looks like this:
venv/ ├── bin/ # Scripts (activate, pip, python) — Linux/macOS │ ├── activate # Bash/Zsh activation script │ ├── activate.csh # C shell activation │ ├── activate.fish # Fish shell activation │ ├── pip │ ├── pip3 │ ├── python -> python3 │ └── python3 -> /usr/bin/python3 ├── include/ # C headers for compiling extensions ├── lib/ # Installed packages go here │ └── python3.12/ │ └── site-packages/ ├── lib64 -> lib # Symlink on some systems └── pyvenv.cfg # Configuration file
The most common names for virtual environment directories are venv, .venv, and env. I recommend venv or .venv because they are immediately recognizable, and every .gitignore template for Python already includes them. The dot prefix in .venv hides it from normal directory listings, which some developers prefer.
# All of these are common and acceptable python3 -m venv venv python3 -m venv .venv python3 -m venv env # You can also name it after the project, though this is less common python3 -m venv myproject-env
Always create the virtual environment inside your project’s root directory. This keeps everything self-contained and makes it obvious which environment belongs to which project. Some developers prefer to store all virtual environments in a central location like ~/.virtualenvs/, but this adds complexity without much benefit unless you are using virtualenvwrapper.
If you have multiple Python versions installed, you can specify which one to use:
# Use a specific Python version python3.11 -m venv venv python3.12 -m venv venv # On Windows py -3.11 -m venv venv
In rare cases, such as Docker containers where you want a minimal environment, you can create a virtual environment without pip:
# Create without pip (smaller, faster) python3 -m venv --without-pip venv
Creating a virtual environment does not automatically use it. You must activate it first, which modifies your shell’s PATH so that python and pip commands point to the virtual environment’s binaries instead of the system ones.
# macOS / Linux (Bash or Zsh) source venv/bin/activate # macOS / Linux (Fish shell) source venv/bin/activate.fish # macOS / Linux (Csh / Tcsh) source venv/bin/activate.csh # Windows (Command Prompt) venv\Scripts\activate.bat # Windows (PowerShell) venv\Scripts\Activate.ps1
When a virtual environment is active, your shell prompt changes to show the environment name in parentheses:
# Before activation $ whoami folau # After activation (venv) $ whoami folau # Verify Python is using the venv (venv) $ which python /home/folau/projects/my-flask-app/venv/bin/python (venv) $ which pip /home/folau/projects/my-flask-app/venv/bin/pip
Activation is simpler than it sounds. It prepends the virtual environment’s bin/ (or Scripts/ on Windows) directory to your PATH environment variable. That is it. When you type python, your shell finds the venv’s Python before the system Python because it appears earlier in PATH.
# Before activation $ echo $PATH /usr/local/bin:/usr/bin:/bin # After activation (venv) $ echo $PATH /home/folau/projects/my-flask-app/venv/bin:/usr/local/bin:/usr/bin:/bin
When you are done working on a project, deactivate the environment to return to your system Python:
# Works on all platforms (venv) $ deactivate $
You do not strictly need to activate a virtual environment. You can call the venv’s Python or pip directly by using the full path:
# Run Python from the venv without activating ./venv/bin/python my_script.py # Install a package without activating ./venv/bin/pip install requests
This is particularly useful in shell scripts, cron jobs, and CI/CD pipelines where activating is unnecessary overhead.
pip is the standard package manager for Python. It downloads and installs packages from the Python Package Index (PyPI), which hosts over 500,000 packages. When you work inside a virtual environment, pip installs packages only into that environment’s site-packages, keeping everything isolated.
# Install the latest version pip install requests # Install a specific version pip install requests==2.31.0 # Install a minimum version pip install "requests>=2.28.0" # Install a version range pip install "requests>=2.28.0,<3.0.0" # Install multiple packages at once pip install flask sqlalchemy redis # Install with extras (optional dependencies) pip install "fastapi[all]" pip install "celery[redis]"
# Upgrade to the latest version pip install --upgrade requests pip install -U requests # Short form # Upgrade pip itself pip install --upgrade pip
# Uninstall a package pip uninstall requests # Uninstall without confirmation prompt pip uninstall -y requests # Uninstall multiple packages pip uninstall flask sqlalchemy redis
Note that pip uninstall only removes the specified package. It does not remove that package's dependencies, even if nothing else needs them. This can leave orphaned packages in your environment.
# List all installed packages pip list # List outdated packages pip list --outdated # Show detailed info about a specific package pip show requests
The output of pip show is useful for debugging dependency issues:
(venv) $ pip show requests Name: requests Version: 2.31.0 Summary: Python HTTP for Humans. Home-page: https://requests.readthedocs.io Author: Kenneth Reitz License: Apache 2.0 Location: /home/folau/projects/my-app/venv/lib/python3.12/site-packages Requires: certifi, charset-normalizer, idna, urllib3 Required-by: httpx, some-other-package
The pip freeze command outputs every installed package and its exact version in a format that can be fed back into pip. This is how you capture your project's dependencies:
# Output all installed packages with versions pip freeze # Save to a requirements file pip freeze > requirements.txt
The output looks like this:
certifi==2024.2.2 charset-normalizer==3.3.2 flask==3.0.2 idna==3.6 jinja2==3.1.3 markupsafe==2.1.5 requests==2.31.0 urllib3==2.2.1 werkzeug==3.0.1
# Install all packages from requirements.txt pip install -r requirements.txt # Install from multiple requirement files pip install -r requirements.txt -r requirements-dev.txt
The requirements.txt file is the traditional way to declare Python project dependencies. It is a plain text file where each line specifies a package and optionally a version constraint.
# Pinned versions (recommended for applications) flask==3.0.2 requests==2.31.0 sqlalchemy==2.0.27 # Minimum version requests>=2.28.0 # Version range requests>=2.28.0,<3.0.0 # Compatible release (>=2.31.0, <2.32.0) requests~=2.31.0 # Any version (avoid this) requests # Comments # This is a comment flask==3.0.2 # Web framework # Include another requirements file -r requirements-base.txt
A common pattern is to maintain separate requirement files for production and development:
# requirements.txt (production) flask==3.0.2 gunicorn==21.2.0 psycopg2-binary==2.9.9 redis==5.0.1 # requirements-dev.txt (development) -r requirements.txt pytest==8.0.2 pytest-cov==4.1.0 black==24.2.0 flake8==7.0.0 mypy==1.8.0 ipdb==0.13.13
Notice how requirements-dev.txt includes requirements.txt with the -r flag. This means installing dev dependencies automatically installs production dependencies as well, avoiding duplication.
For applications (web apps, APIs, services), always pin exact versions with ==. This guarantees that every environment — your laptop, your teammate's laptop, staging, production — runs identical code. Unpinned or loosely pinned dependencies are one of the most common sources of “works on my machine” bugs.
For libraries (packages you publish for others to install), use flexible version constraints like >= or ~=. If your library pins exact versions, it creates conflicts when users install it alongside other packages that need different versions of the same dependency.
Raw pip freeze has a significant limitation: it dumps every installed package, including transitive dependencies (dependencies of your dependencies). This makes it hard to tell which packages you actually chose to install versus which ones came along for the ride. pip-tools solves this elegantly.
pip install pip-tools
With pip-tools, you maintain a requirements.in file that lists only your direct dependencies. Then pip-compile resolves all transitive dependencies and writes a fully pinned requirements.txt.
# requirements.in (what YOU want) flask requests sqlalchemy celery[redis]
# Generate the pinned requirements.txt pip-compile requirements.in
The generated requirements.txt includes hashes and comments showing where each dependency came from:
#
# This file is autogenerated by pip-compile with Python 3.12
# by the following command:
#
# pip-compile requirements.in
#
certifi==2024.2.2
# via requests
charset-normalizer==3.3.2
# via requests
flask==3.0.2
# via -r requirements.in
idna==3.6
# via requests
jinja2==3.1.3
# via flask
requests==2.31.0
# via -r requirements.in
sqlalchemy==2.0.27
# via -r requirements.in
pip-sync goes a step further: it installs exactly the packages in requirements.txt and removes anything else. This ensures your environment matches the lock file precisely.
# Sync your environment to match requirements.txt exactly pip-sync requirements.txt # Sync with multiple requirement files pip-sync requirements.txt requirements-dev.txt
# Upgrade all packages pip-compile --upgrade requirements.in # Upgrade a specific package pip-compile --upgrade-package requests requirements.in # Then sync your environment pip-sync requirements.txt
The Python ecosystem has several tools beyond venv and pip for environment and dependency management. Here is when to reach for each one.
Pipenv combines virtual environment management and dependency resolution into a single tool. It uses a Pipfile instead of requirements.txt and generates a Pipfile.lock for deterministic builds.
# Install pipenv pip install pipenv # Create environment and install a package pipenv install flask # Install dev dependency pipenv install --dev pytest # Activate the shell pipenv shell # Run a command without activating pipenv run python app.py
Pipenv was once the officially recommended tool, but its development stalled for years. It has since resumed active development, but many teams have moved to other tools. Use it if your team already uses it or if you want a simple all-in-one solution.
Poetry is the most popular modern alternative. It handles dependency management, virtual environments, building, and publishing — all through a pyproject.toml file.
# Install Poetry curl -sSL https://install.python-poetry.org | python3 - # Create a new project poetry new my-project # Add dependencies poetry add flask poetry add --group dev pytest # Install dependencies poetry install # Run commands in the environment poetry run python app.py poetry shell
Poetry is excellent for projects that are both applications and libraries. Its dependency resolver is more sophisticated than pip's, and pyproject.toml is cleaner than requirements.txt. Use Poetry for greenfield projects where you want a modern, complete toolchain.
Conda is a cross-language package manager popular in data science. Unlike pip, it can install non-Python dependencies (C libraries, R packages, system tools), which is critical for scientific computing packages like NumPy, SciPy, and TensorFlow that depend on compiled native code.
# Create a conda environment conda create -n myenv python=3.12 # Activate conda activate myenv # Install packages conda install numpy pandas scikit-learn # Export environment conda env export > environment.yml # Recreate from file conda env create -f environment.yml
Use conda if you are doing data science or machine learning work, especially if you need packages with complex native dependencies. For web development and general-purpose Python, stick with venv + pip or Poetry.
pyproject.toml is the modern standard for Python project configuration, defined in PEP 518 and PEP 621. It replaces setup.py, setup.cfg, and even requirements.txt as the single source of truth for project metadata and dependencies.
# pyproject.toml
[build-system]
requires = ["setuptools>=68.0", "wheel"]
build-backend = "setuptools.backends._legacy:_Backend"
[project]
name = "my-flask-app"
version = "1.0.0"
description = "A production Flask application"
requires-python = ">=3.10"
authors = [
{name = "Folau Kaveinga", email = "folau@example.com"}
]
dependencies = [
"flask>=3.0,<4.0",
"sqlalchemy>=2.0",
"requests>=2.28",
"gunicorn>=21.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0",
"black>=24.0",
"mypy>=1.8",
"ruff>=0.2",
]
[tool.black]
line-length = 88
target-version = ["py312"]
[tool.ruff]
line-length = 88
select = ["E", "F", "I"]
[tool.mypy]
python_version = "3.12"
strict = true
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-v --tb=short"
The advantage of pyproject.toml is consolidation. Your project metadata, dependencies, and tool configuration all live in one file instead of being scattered across setup.py, requirements.txt, mypy.ini, pytest.ini, .flake8, and more.
# Install the project in development mode pip install -e . # Install with dev dependencies pip install -e ".[dev]" # Build the project python -m build
Virtual environments isolate packages, but they do not solve the problem of needing different Python versions for different projects. pyenv fills that gap by letting you install and switch between multiple Python versions seamlessly.
# macOS (via Homebrew) brew install pyenv # Linux curl https://pyenv.run | bash # Add to your shell profile (~/.bashrc or ~/.zshrc) export PYENV_ROOT="$HOME/.pyenv" export PATH="$PYENV_ROOT/bin:$PATH" eval "$(pyenv init -)"
# List available Python versions pyenv install --list | grep "^ 3" # Install specific versions pyenv install 3.11.8 pyenv install 3.12.2 # Set global default pyenv global 3.12.2 # Set version for a specific project directory cd ~/projects/legacy-app pyenv local 3.11.8 # Creates .python-version file # Now create a venv with the correct version python -m venv venv # Uses 3.11.8 because of .python-version
The combination of pyenv (for Python version management) and venv (for package isolation) gives you complete control over your Python environments.
Most modern IDEs detect and integrate with virtual environments automatically, providing code completion, linting, and debugging support based on the packages installed in your venv.
VS Code's Python extension automatically detects virtual environments in your workspace. To configure it:
Cmd+Shift+P on macOS, Ctrl+Shift+P on Windows/Linux)venv/bin/pythonYou can also set it in .vscode/settings.json:
{
"python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
"python.terminal.activateEnvironment": true
}
When python.terminal.activateEnvironment is true, VS Code automatically activates the virtual environment whenever you open a new terminal.
PyCharm has first-class virtual environment support:
venv/bin/pythonPyCharm can also create virtual environments for you when starting a new project. It detects requirements.txt files and offers to install dependencies automatically.
A common question is whether you need virtual environments inside Docker containers. After all, each container is already an isolated environment. The answer is nuanced.
If your Docker container runs a single Python application and nothing else, a virtual environment adds no practical benefit. The container itself provides the isolation:
# Dockerfile without venv (acceptable for simple apps) FROM python:3.12-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000"]
There are legitimate reasons to use virtual environments inside containers:
Multi-stage builds. Virtual environments make it easy to copy only the installed packages from a build stage to a slim runtime stage:
# Dockerfile with venv (recommended for production) FROM python:3.12-slim AS builder WORKDIR /app RUN python -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt FROM python:3.12-slim AS runtime COPY --from=builder /opt/venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" WORKDIR /app COPY . . CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000"]
Avoiding system package conflicts. Some base images include Python packages that the OS depends on. Installing your dependencies into a venv prevents overwriting these system packages.
Cleaner separation. When your container runs multiple processes or includes system Python tools, a venv keeps your application packages cleanly separated.
Here is the complete workflow for starting a new Python project with proper environment management:
# 1. Create project directory mkdir ~/projects/my-api && cd ~/projects/my-api # 2. Initialize git git init # 3. Create virtual environment python3 -m venv venv # 4. Add venv to .gitignore echo "venv/" >> .gitignore echo "__pycache__/" >> .gitignore echo "*.pyc" >> .gitignore echo ".env" >> .gitignore # 5. Activate the environment source venv/bin/activate # 6. Upgrade pip pip install --upgrade pip # 7. Install your dependencies pip install flask sqlalchemy pytest # 8. Freeze dependencies pip freeze > requirements.txt # 9. Make your initial commit git add . git commit -m "Initial project setup with Flask, SQLAlchemy"
When you clone a project that uses virtual environments, here is how to get up and running:
# 1. Clone the repository git clone https://github.com/team/project.git cd project # 2. Create a fresh virtual environment python3 -m venv venv # 3. Activate it source venv/bin/activate # 4. Install exact dependencies from the lock file pip install -r requirements.txt # 5. Verify everything works python -m pytest
If the project uses pyproject.toml instead:
# Install the project and its dependencies pip install -e ".[dev]"
Upgrading dependencies in a production project requires discipline. Never blindly upgrade everything at once.
# 1. Check what is outdated pip list --outdated # 2. Upgrade one package at a time pip install --upgrade requests # 3. Run your test suite python -m pytest # 4. If tests pass, update requirements.txt pip freeze > requirements.txt # 5. Commit the change with a clear message git add requirements.txt git commit -m "Upgrade requests from 2.28.0 to 2.31.0"
For a safer approach using pip-tools:
# Upgrade a specific package and re-resolve all dependencies pip-compile --upgrade-package requests requirements.in pip-sync requirements.txt python -m pytest git add requirements.txt git commit -m "Upgrade requests to 2.31.0"
Here is a typical GitHub Actions workflow that uses virtual environments:
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Create virtual environment
run: python -m venv venv
- name: Install dependencies
run: |
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run linters
run: |
source venv/bin/activate
ruff check .
mypy .
- name: Run tests
run: |
source venv/bin/activate
pytest --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: coverage.xml
Virtual environments contain thousands of files, are platform-specific (a venv created on macOS will not work on Linux), and include hardcoded paths. Never commit them. Add this to your .gitignore:
# .gitignore venv/ .venv/ env/ *.pyc __pycache__/
Running pip install outside a virtual environment installs packages globally, which eventually leads to conflicts. On macOS and Linux, some people use sudo pip install, which is even worse because it modifies files owned by the operating system.
# NEVER do this sudo pip install flask # ALWAYS activate a venv first source venv/bin/activate pip install flask
If you install packages without activating your virtual environment, they go into the global Python. The most common symptom is: “I installed the package, but Python says it cannot find it.”
# Check which pip you are using which pip # Should show: /path/to/your/project/venv/bin/pip # NOT: /usr/bin/pip or /usr/local/bin/pip
Installing a new package and forgetting to update requirements.txt means your teammates and CI/CD pipeline will not have that package. Make it a habit to freeze after every install:
# Install and freeze in one command pip install requests && pip freeze > requirements.txt
The version of pip bundled with python -m venv is often outdated. Old pip versions have slower dependency resolution and may fail to install packages that require newer features. Always upgrade pip immediately after creating a new environment.
# First thing after activation pip install --upgrade pip
If you are using conda, avoid installing packages with pip unless the package is not available through conda. Mixing the two can lead to dependency conflicts that are extremely difficult to debug. If you must use pip inside a conda environment, install conda packages first.
requirements.txt, not the environment itself.== in requirements.txt for deployable applications. Use flexible ranges only for libraries.requirements.txt and requirements-dev.txt (or use pyproject.toml optional dependencies).pip install --upgrade pip right after creating a new virtual environment.pip freeze works for simple projects, but pip-compile gives you traceable, reproducible dependency resolution..python-version file, pyproject.toml's requires-python, or at minimum a note in your README.venv directory and create a fresh one. They are disposable by design./path/to/venv/bin/python script.py.python -m venv venv to create environments and source venv/bin/activate to activate them. This is built into Python — no extra tools required.pip is the standard package manager. The core commands you will use daily are pip install, pip freeze, and pip install -r requirements.txt.requirements.txt for applications. Use pip-tools or Poetry for better dependency management on larger projects.pyproject.toml is the modern replacement for setup.py and requirements.txt. New projects should adopt it.pyenv when you need different Python versions for different projects.sudo pip. Never skip creating a venv because your project is “too small.”Almost every real-world application needs to persist data, and relational databases remain the backbone of most production systems. MySQL, the world’s most popular open-source relational database, pairs naturally with Python — one of the world’s most popular programming languages. Whether you are building a web application with Flask or Django, automating data pipelines, or writing microservices, knowing how to talk to MySQL from Python is a non-negotiable skill.
In this tutorial you will learn everything from establishing a basic connection to managing transactions, pooling connections for production workloads, and even mapping your tables to Python objects with SQLAlchemy. Every example is production-minded: parameterized queries, proper error handling, and clean resource management from the start.
The most common MySQL driver for Python is mysql-connector-python, maintained by Oracle. Install it with pip:
pip install mysql-connector-python
A popular alternative is PyMySQL, a pure-Python driver that requires no C extensions:
pip install pymysql
Both libraries follow the Python DB-API 2.0 specification (PEP 249), so the core patterns — connect, cursor, execute, fetch — are nearly identical. This tutorial uses mysql-connector-python for all examples. If you are using PyMySQL, swap the import and connection call and the rest of your code stays the same.
You will also need a running MySQL server. If you do not have one, the quickest path is Docker:
# Pull and run MySQL 8 in a container docker run --name mysql-tutorial \ -e MYSQL_ROOT_PASSWORD=rootpass \ -p 3306:3306 \ -d mysql:8
Every interaction starts with a connection. You provide the host, port, user, password, and optionally a database name:
import mysql.connector
# Establish a connection
conn = mysql.connector.connect(
host="127.0.0.1",
port=3306,
user="root",
password="rootpass"
)
print("Connected:", conn.is_connected()) # True
# Always close when done
conn.close()
If the connection fails — wrong password, server not running, network issue — mysql.connector.Error is raised. Always wrap your connection logic in a try/except block:
import mysql.connector
from mysql.connector import Error
try:
conn = mysql.connector.connect(
host="127.0.0.1",
user="root",
password="rootpass"
)
if conn.is_connected():
info = conn.get_server_info()
print(f"Connected to MySQL Server version {info}")
except Error as e:
print(f"Error connecting to MySQL: {e}")
finally:
if 'conn' in locals() and conn.is_connected():
conn.close()
print("Connection closed")
Once connected, use a cursor to execute SQL statements. Let us create a database and a table:
import mysql.connector
from mysql.connector import Error
conn = mysql.connector.connect(
host="127.0.0.1",
user="root",
password="rootpass"
)
cursor = conn.cursor()
# Create database
cursor.execute("CREATE DATABASE IF NOT EXISTS tutorial_db")
cursor.execute("USE tutorial_db")
# Create table
create_table_sql = """
CREATE TABLE IF NOT EXISTS users (
id INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50) NOT NULL UNIQUE,
email VARCHAR(100) NOT NULL,
age INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
"""
cursor.execute(create_table_sql)
print("Database and table created successfully")
cursor.close()
conn.close()
You can also connect directly to a database by passing the database parameter:
conn = mysql.connector.connect(
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
CRUD — Create, Read, Update, Delete — covers the four fundamental data operations. Let us walk through each one.
Single insert:
import mysql.connector
conn = mysql.connector.connect(
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
cursor = conn.cursor()
sql = "INSERT INTO users (username, email, age) VALUES (%s, %s, %s)"
values = ("alice", "alice@example.com", 30)
cursor.execute(sql, values)
conn.commit() # IMPORTANT: commit the transaction
print(f"Inserted user with ID: {cursor.lastrowid}")
cursor.close()
conn.close()
Batch insert with executemany():
sql = "INSERT INTO users (username, email, age) VALUES (%s, %s, %s)"
users = [
("bob", "bob@example.com", 25),
("charlie", "charlie@example.com", 35),
("diana", "diana@example.com", 28),
("eve", "eve@example.com", 32),
]
cursor.executemany(sql, users)
conn.commit()
print(f"Inserted {cursor.rowcount} rows")
executemany() is significantly faster than looping with individual execute() calls because the driver can optimize the network round-trips.
The cursor provides three fetch methods:
fetchone() — returns the next row as a tuple, or Nonefetchall() — returns all remaining rows as a list of tuplesfetchmany(size) — returns up to size rows# Fetch all users
cursor.execute("SELECT id, username, email, age FROM users")
rows = cursor.fetchall()
for row in rows:
print(f"ID: {row[0]}, Username: {row[1]}, Email: {row[2]}, Age: {row[3]}")
For more readable code, use a dictionary cursor so each row is a dict instead of a tuple:
cursor = conn.cursor(dictionary=True)
cursor.execute("SELECT * FROM users WHERE age > %s", (28,))
for user in cursor.fetchall():
print(f"{user['username']} ({user['email']}) - Age {user['age']}")
Fetching one row at a time is memory-efficient for large result sets:
cursor.execute("SELECT * FROM users ORDER BY created_at DESC")
row = cursor.fetchone()
while row:
print(row)
row = cursor.fetchone()
Fetching in batches balances memory and performance:
cursor.execute("SELECT * FROM users")
while True:
batch = cursor.fetchmany(size=2)
if not batch:
break
for row in batch:
print(row)
sql = "UPDATE users SET email = %s, age = %s WHERE username = %s"
values = ("alice_new@example.com", 31, "alice")
cursor.execute(sql, values)
conn.commit()
print(f"Rows affected: {cursor.rowcount}")
sql = "DELETE FROM users WHERE username = %s"
cursor.execute(sql, ("eve",))
conn.commit()
print(f"Deleted {cursor.rowcount} row(s)")
Always check cursor.rowcount after UPDATE and DELETE to confirm the operation affected the expected number of rows.
This is not optional — it is a hard requirement for any production code. Parameterized queries prevent SQL injection, one of the most dangerous and most common web vulnerabilities.
Never do this:
# DANGEROUS — SQL injection vulnerability!
username = input("Enter username: ")
cursor.execute(f"SELECT * FROM users WHERE username = '{username}'")
If a user enters ' OR '1'='1, that query returns every row in the table. Worse, they could enter '; DROP TABLE users; -- and destroy your data.
Always do this:
# SAFE — parameterized query
username = input("Enter username: ")
cursor.execute("SELECT * FROM users WHERE username = %s", (username,))
user = cursor.fetchone()
The %s placeholder tells the driver to properly escape and quote the value. This works regardless of what the user types — the database sees it as a literal value, not executable SQL.
Key rules:
%s as the placeholder (not ? — that is for SQLite)(value,)f"", .format(), %) to build SQLA transaction groups multiple SQL statements into a single atomic unit. Either all of them succeed, or none of them do. MySQL with InnoDB supports full ACID transactions.
import mysql.connector
from mysql.connector import Error
conn = mysql.connector.connect(
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
try:
cursor = conn.cursor()
# Transfer "credits" from alice to bob (both must succeed)
cursor.execute(
"UPDATE users SET age = age - 1 WHERE username = %s", ("alice",)
)
cursor.execute(
"UPDATE users SET age = age + 1 WHERE username = %s", ("bob",)
)
conn.commit() # Both updates are saved
print("Transaction committed")
except Error as e:
conn.rollback() # Undo everything if any statement fails
print(f"Transaction rolled back: {e}")
finally:
cursor.close()
conn.close()
By default, mysql-connector-python does not auto-commit. You must call conn.commit() explicitly. If you want auto-commit behavior (not recommended for multi-statement operations), set it at connection time:
# Auto-commit mode — each statement is its own transaction
conn = mysql.connector.connect(
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db",
autocommit=True
)
When to use explicit transactions:
Opening and closing database connections is expensive. In a web application handling hundreds of requests per second, creating a new connection for every request wastes time and resources. Connection pooling solves this by maintaining a pool of reusable connections.
from mysql.connector import pooling
# Create a connection pool
pool = pooling.MySQLConnectionPool(
pool_name="tutorial_pool",
pool_size=5,
pool_reset_session=True,
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
# Get a connection from the pool
conn = pool.get_connection()
cursor = conn.cursor(dictionary=True)
cursor.execute("SELECT * FROM users")
for user in cursor.fetchall():
print(user)
cursor.close()
conn.close() # Returns the connection to the pool, does not destroy it
When you call conn.close() on a pooled connection, it goes back to the pool instead of being destroyed. The next call to pool.get_connection() can reuse it immediately.
Pool sizing guidelines:
pool_size=5 and increase based on load testingSHOW STATUS LIKE 'Threads_connected' in MySQLHere is a thread-safe pattern for a web application:
from mysql.connector import pooling, Error
class Database:
"""Thread-safe database access using connection pooling."""
def __init__(self, **kwargs):
self.pool = pooling.MySQLConnectionPool(
pool_name="app_pool",
pool_size=10,
**kwargs
)
def execute_query(self, query, params=None, fetch=False):
conn = self.pool.get_connection()
try:
cursor = conn.cursor(dictionary=True)
cursor.execute(query, params)
if fetch:
result = cursor.fetchall()
else:
conn.commit()
result = cursor.rowcount
return result
except Error as e:
conn.rollback()
raise e
finally:
cursor.close()
conn.close()
# Usage
db = Database(
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
users = db.execute_query("SELECT * FROM users WHERE age > %s", (25,), fetch=True)
print(users)
Context managers (the with statement) guarantee that resources are cleaned up even if an exception occurs. Let us build a reusable context manager for database operations:
from contextlib import contextmanager
import mysql.connector
from mysql.connector import Error
@contextmanager
def get_db_connection(config):
"""Context manager that provides a database connection."""
conn = mysql.connector.connect(**config)
try:
yield conn
except Error as e:
conn.rollback()
raise e
finally:
conn.close()
@contextmanager
def get_db_cursor(conn, dictionary=True):
"""Context manager that provides a cursor and commits on success."""
cursor = conn.cursor(dictionary=dictionary)
try:
yield cursor
conn.commit()
except Error as e:
conn.rollback()
raise e
finally:
cursor.close()
# Configuration
DB_CONFIG = {
"host": "127.0.0.1",
"user": "root",
"password": "rootpass",
"database": "tutorial_db"
}
# Usage — clean and exception-safe
with get_db_connection(DB_CONFIG) as conn:
with get_db_cursor(conn) as cursor:
cursor.execute(
"INSERT INTO users (username, email, age) VALUES (%s, %s, %s)",
("frank", "frank@example.com", 29)
)
print(f"Inserted row ID: {cursor.lastrowid}")
# Connection and cursor are automatically closed here
This pattern is the recommended way to manage database resources in production Python applications. It eliminates an entire class of bugs — leaked connections, uncommitted transactions, and unclosed cursors.
For pooled connections, combine the two patterns:
from mysql.connector import pooling
from contextlib import contextmanager
pool = pooling.MySQLConnectionPool(
pool_name="app_pool",
pool_size=5,
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
@contextmanager
def get_connection():
conn = pool.get_connection()
try:
yield conn
finally:
conn.close() # Returns to pool
@contextmanager
def get_cursor(conn):
cursor = conn.cursor(dictionary=True)
try:
yield cursor
conn.commit()
except Exception:
conn.rollback()
raise
finally:
cursor.close()
# Usage
with get_connection() as conn:
with get_cursor(conn) as cursor:
cursor.execute("SELECT COUNT(*) AS total FROM users")
result = cursor.fetchone()
print(f"Total users: {result['total']}")
So far, every example has used raw SQL. That works well for simple applications and gives you full control. But as your application grows — more tables, more relationships, more complex queries — writing raw SQL becomes tedious and error-prone. That is where an ORM (Object-Relational Mapper) shines.
SQLAlchemy is Python’s most powerful and most widely used ORM. Install it alongside the MySQL driver:
pip install sqlalchemy mysql-connector-python
SQLAlchemy needs three things to get started:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, DeclarativeBase
# Connection URL format: mysql+connector://user:password@host:port/database
engine = create_engine(
"mysql+mysqlconnector://root:rootpass@127.0.0.1:3306/tutorial_db",
echo=False, # Set True to log all SQL statements
pool_size=5,
max_overflow=10
)
# Create a session factory
SessionLocal = sessionmaker(bind=engine)
# Base class for models
class Base(DeclarativeBase):
pass
Each model class maps to a database table. Columns become class attributes:
from sqlalchemy import Column, Integer, String, DateTime, ForeignKey
from sqlalchemy.orm import relationship
from sqlalchemy.sql import func
class User(Base):
__tablename__ = "orm_users"
id = Column(Integer, primary_key=True, autoincrement=True)
username = Column(String(50), unique=True, nullable=False)
email = Column(String(100), nullable=False)
age = Column(Integer)
created_at = Column(DateTime, server_default=func.now())
# One-to-many relationship
posts = relationship("Post", back_populates="author",
cascade="all, delete-orphan")
def __repr__(self):
return f"<User(id={self.id}, username='{self.username}')>"
class Post(Base):
__tablename__ = "orm_posts"
id = Column(Integer, primary_key=True, autoincrement=True)
title = Column(String(200), nullable=False)
body = Column(String(5000))
user_id = Column(Integer, ForeignKey("orm_users.id"), nullable=False)
created_at = Column(DateTime, server_default=func.now())
author = relationship("User", back_populates="posts")
def __repr__(self):
return f"<Post(id={self.id}, title='{self.title}')>"
# Create all tables
Base.metadata.create_all(engine)
# CREATE
session = SessionLocal()
new_user = User(username="grace", email="grace@example.com", age=27)
session.add(new_user)
session.commit()
print(f"Created: {new_user}")
# Add a post for this user
new_post = Post(title="My First Post", body="Hello from SQLAlchemy!",
user_id=new_user.id)
session.add(new_post)
session.commit()
# READ
user = session.query(User).filter_by(username="grace").first()
print(f"Found: {user}")
print(f"Posts: {user.posts}") # Lazy-loaded relationship
# All users older than 25
young_users = session.query(User).filter(User.age > 25).all()
for u in young_users:
print(u)
# UPDATE
user.email = "grace_updated@example.com"
session.commit()
# DELETE
session.delete(user) # Also deletes posts due to cascade
session.commit()
session.close()
With the ORM, notice how you never write a single line of SQL. SQLAlchemy generates it for you, handles parameterization, and manages the transaction lifecycle.
from contextlib import contextmanager
@contextmanager
def get_session():
session = SessionLocal()
try:
yield session
session.commit()
except Exception:
session.rollback()
raise
finally:
session.close()
# Usage
with get_session() as session:
user = User(username="henry", email="henry@example.com", age=33)
session.add(user)
# Automatically committed when the block exits without error
| Use ORM When | Use Raw SQL When |
|---|---|
| Building a CRUD-heavy application | Running complex analytical queries |
| You need relationship management | You need maximum query performance |
| Rapid prototyping and iteration | Migrating or bulk-loading data |
| Working with multiple database backends | Using database-specific features |
| Team members vary in SQL skill | Debugging performance issues |
Many production applications use both — ORM for standard CRUD and raw SQL (via session.execute()) for complex queries and reporting.
A complete user management module with registration, authentication, and profile updates:
import mysql.connector
from mysql.connector import pooling, Error
import hashlib
import os
from contextlib import contextmanager
# --- Database Setup ---
pool = pooling.MySQLConnectionPool(
pool_name="user_mgmt_pool",
pool_size=5,
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
@contextmanager
def get_connection():
conn = pool.get_connection()
try:
yield conn
finally:
conn.close()
@contextmanager
def get_cursor(conn, dictionary=True):
cursor = conn.cursor(dictionary=dictionary)
try:
yield cursor
conn.commit()
except Error:
conn.rollback()
raise
finally:
cursor.close()
def init_db():
"""Create the accounts table if it does not exist."""
with get_connection() as conn:
with get_cursor(conn) as cursor:
cursor.execute("""
CREATE TABLE IF NOT EXISTS accounts (
id INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50) NOT NULL UNIQUE,
email VARCHAR(100) NOT NULL UNIQUE,
password_hash VARCHAR(128) NOT NULL,
salt VARCHAR(64) NOT NULL,
full_name VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP
)
""")
def hash_password(password, salt=None):
"""Hash a password with a random salt."""
if salt is None:
salt = os.urandom(32).hex()
hashed = hashlib.sha256((salt + password).encode()).hexdigest()
return hashed, salt
def register_user(username, email, password, full_name=None):
"""Register a new user. Returns user ID on success."""
password_hash, salt = hash_password(password)
with get_connection() as conn:
with get_cursor(conn) as cursor:
try:
cursor.execute(
"""INSERT INTO accounts
(username, email, password_hash, salt, full_name)
VALUES (%s, %s, %s, %s, %s)""",
(username, email, password_hash, salt, full_name)
)
print(f"User '{username}' registered with ID {cursor.lastrowid}")
return cursor.lastrowid
except Error as e:
if e.errno == 1062: # Duplicate entry
print("Registration failed: username or email already exists")
return None
raise
def login(username, password):
"""Authenticate a user. Returns user dict or None."""
with get_connection() as conn:
with get_cursor(conn) as cursor:
cursor.execute(
"""SELECT id, username, email, password_hash, salt, full_name
FROM accounts WHERE username = %s""",
(username,)
)
user = cursor.fetchone()
if user is None:
print("Login failed: user not found")
return None
hashed, _ = hash_password(password, user["salt"])
if hashed != user["password_hash"]:
print("Login failed: incorrect password")
return None
print(f"Welcome back, {user['full_name'] or user['username']}!")
return {
"id": user["id"],
"username": user["username"],
"email": user["email"],
"full_name": user["full_name"]
}
def update_profile(user_id, **kwargs):
"""Update user profile fields. Only updates provided fields."""
allowed_fields = {"email", "full_name"}
updates = {k: v for k, v in kwargs.items() if k in allowed_fields}
if not updates:
print("No valid fields to update")
return False
set_clause = ", ".join(f"{field} = %s" for field in updates)
values = list(updates.values()) + [user_id]
with get_connection() as conn:
with get_cursor(conn) as cursor:
cursor.execute(
f"UPDATE accounts SET {set_clause} WHERE id = %s",
tuple(values)
)
if cursor.rowcount > 0:
print(f"Profile updated for user ID {user_id}")
return True
print("User not found")
return False
# --- Demo ---
if __name__ == "__main__":
init_db()
# Register
user_id = register_user(
"johndoe", "john@example.com", "s3cur3P@ss", "John Doe"
)
# Login
user = login("johndoe", "s3cur3P@ss")
# Update profile
if user:
update_profile(
user["id"],
email="john.doe@newmail.com",
full_name="John A. Doe"
)
import mysql.connector
from mysql.connector import pooling, Error
from contextlib import contextmanager
from decimal import Decimal
pool = pooling.MySQLConnectionPool(
pool_name="inventory_pool",
pool_size=5,
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
@contextmanager
def db_cursor(dictionary=True):
conn = pool.get_connection()
cursor = conn.cursor(dictionary=dictionary)
try:
yield cursor
conn.commit()
except Error:
conn.rollback()
raise
finally:
cursor.close()
conn.close()
def init_inventory():
with db_cursor() as cursor:
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
sku VARCHAR(50) NOT NULL UNIQUE,
price DECIMAL(10, 2) NOT NULL,
quantity INT NOT NULL DEFAULT 0,
category VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
def add_product(name, sku, price, quantity=0, category=None):
with db_cursor() as cursor:
cursor.execute(
"""INSERT INTO products (name, sku, price, quantity, category)
VALUES (%s, %s, %s, %s, %s)""",
(name, sku, price, quantity, category)
)
return cursor.lastrowid
def restock(sku, amount):
"""Add stock to an existing product."""
with db_cursor() as cursor:
cursor.execute(
"UPDATE products SET quantity = quantity + %s WHERE sku = %s",
(amount, sku)
)
if cursor.rowcount == 0:
raise ValueError(f"Product with SKU '{sku}' not found")
print(f"Restocked {amount} units of {sku}")
def sell(sku, amount):
"""Reduce stock. Raises error if insufficient stock."""
with db_cursor() as cursor:
# Check current stock
cursor.execute(
"SELECT quantity FROM products WHERE sku = %s", (sku,)
)
product = cursor.fetchone()
if product is None:
raise ValueError(f"Product '{sku}' not found")
if product["quantity"] < amount:
raise ValueError(
f"Insufficient stock: {product['quantity']} available, "
f"{amount} requested"
)
cursor.execute(
"UPDATE products SET quantity = quantity - %s WHERE sku = %s",
(amount, sku)
)
print(f"Sold {amount} units of {sku}")
def get_low_stock(threshold=10):
"""Find products that need restocking."""
with db_cursor() as cursor:
cursor.execute(
"""SELECT name, sku, quantity FROM products
WHERE quantity <= %s ORDER BY quantity ASC""",
(threshold,)
)
return cursor.fetchall()
def get_inventory_value():
"""Calculate total inventory value."""
with db_cursor() as cursor:
cursor.execute(
"SELECT SUM(price * quantity) AS total_value FROM products"
)
result = cursor.fetchone()
return result["total_value"] or Decimal("0.00")
def search_products(keyword):
"""Search products by name or category."""
with db_cursor() as cursor:
pattern = f"%{keyword}%"
cursor.execute(
"""SELECT * FROM products
WHERE name LIKE %s OR category LIKE %s""",
(pattern, pattern)
)
return cursor.fetchall()
# --- Demo ---
if __name__ == "__main__":
init_inventory()
# Add products
add_product("Mechanical Keyboard", "KB-001", 89.99, 50, "Electronics")
add_product("USB-C Cable", "CB-001", 12.99, 200, "Accessories")
add_product("Monitor Stand", "MS-001", 45.00, 15, "Furniture")
add_product("Webcam HD", "WC-001", 59.99, 8, "Electronics")
# Sell some items
sell("KB-001", 5)
restock("WC-001", 20)
# Reports
print("\nLow stock items:")
for item in get_low_stock(threshold=20):
print(f" {item['name']} (SKU: {item['sku']}): {item['quantity']} left")
print(f"\nTotal inventory value: ${get_inventory_value():,.2f}")
print("\nElectronics products:")
for p in search_products("Electronics"):
print(f" {p['name']} - ${p['price']} ({p['quantity']} in stock)")
A reusable data access layer that any application can build on — similar to a repository pattern used in web frameworks:
import mysql.connector
from mysql.connector import pooling, Error
from contextlib import contextmanager
class DataAccessLayer:
"""A generic, reusable data access layer for MySQL."""
def __init__(self, host, user, password, database, pool_size=5):
self.pool = pooling.MySQLConnectionPool(
pool_name="dal_pool",
pool_size=pool_size,
host=host,
user=user,
password=password,
database=database
)
@contextmanager
def _get_cursor(self):
conn = self.pool.get_connection()
cursor = conn.cursor(dictionary=True)
try:
yield cursor, conn
finally:
cursor.close()
conn.close()
def fetch_all(self, query, params=None):
"""Execute a SELECT and return all rows."""
with self._get_cursor() as (cursor, conn):
cursor.execute(query, params)
return cursor.fetchall()
def fetch_one(self, query, params=None):
"""Execute a SELECT and return the first row."""
with self._get_cursor() as (cursor, conn):
cursor.execute(query, params)
return cursor.fetchone()
def execute(self, query, params=None):
"""Execute INSERT, UPDATE, or DELETE. Returns affected row count."""
with self._get_cursor() as (cursor, conn):
cursor.execute(query, params)
conn.commit()
return cursor.rowcount
def insert(self, query, params=None):
"""Execute an INSERT and return the new row's ID."""
with self._get_cursor() as (cursor, conn):
cursor.execute(query, params)
conn.commit()
return cursor.lastrowid
def execute_many(self, query, params_list):
"""Execute a batch operation. Returns affected row count."""
with self._get_cursor() as (cursor, conn):
cursor.executemany(query, params_list)
conn.commit()
return cursor.rowcount
def execute_transaction(self, operations):
"""
Execute multiple operations in a single transaction.
operations: list of (query, params) tuples
"""
with self._get_cursor() as (cursor, conn):
try:
for query, params in operations:
cursor.execute(query, params)
conn.commit()
return True
except Error:
conn.rollback()
raise
# --- Usage Example ---
dal = DataAccessLayer(
host="127.0.0.1",
user="root",
password="rootpass",
database="tutorial_db"
)
# Insert
user_id = dal.insert(
"INSERT INTO users (username, email, age) VALUES (%s, %s, %s)",
("ivy", "ivy@example.com", 26)
)
# Read
users = dal.fetch_all("SELECT * FROM users WHERE age > %s", (25,))
for user in users:
print(user)
# Update
affected = dal.execute(
"UPDATE users SET age = %s WHERE username = %s",
(27, "ivy")
)
# Transaction
dal.execute_transaction([
("UPDATE users SET age = age - 1 WHERE username = %s", ("alice",)),
("UPDATE users SET age = age + 1 WHERE username = %s", ("bob",)),
])
These are the mistakes that burn developers most often. Learn them here so you do not learn them in a production outage.
We covered this above, but it bears repeating. Never build SQL strings with user input. Always use parameterized queries. This is the number-one security vulnerability in web applications, and it is completely preventable.
If your INSERTs and UPDATEs seem to work but the data disappears, you forgot to call conn.commit(). The default mode is manual commit — every write must be explicitly committed.
# This does NOTHING to the database without commit()
cursor.execute(
"INSERT INTO users (username, email) VALUES (%s, %s)",
("ghost", "ghost@example.com")
)
# conn.commit() <-- Missing! Data is lost when connection closes.
If you open connections without closing them, your application eventually exhausts the MySQL connection limit (default: 151). Use context managers or try/finally blocks to guarantee cleanup:
# BAD — if an exception occurs, connection is never closed
conn = mysql.connector.connect(**config)
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
# ... exception here means conn.close() never runs
conn.close()
# GOOD — finally block guarantees cleanup
conn = mysql.connector.connect(**config)
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
results = cursor.fetchall()
finally:
conn.close()
This is especially common with ORMs. If you load a list of users, then loop through them loading each user's posts individually, you make 1 + N queries instead of a single JOIN:
# BAD — N+1 queries
users = session.query(User).all() # 1 query
for user in users:
print(user.posts) # N queries (1 per user)
# GOOD — eager loading with joinedload
from sqlalchemy.orm import joinedload
users = (
session.query(User)
.options(joinedload(User.posts))
.all()
) # 1 query
for user in users:
print(user.posts) # No additional queries
Database operations can fail for many reasons: deadlocks, timeouts, constraint violations, server restarts. Always wrap database calls in try/except and handle failures gracefully.
Never store raw passwords. Always hash them with a salt. Use bcrypt or argon2 in production — our example used SHA-256 for simplicity, but dedicated password hashing libraries are much more secure.
mysql.connector.Error, log the details, and fail gracefully. Do not let raw database errors leak to your users.os.environ or a secrets manager.%s placeholders) are mandatory — they prevent SQL injection and should be your default.commit() / rollback()) ensure data consistency for multi-statement operations.With these patterns and practices in your toolkit, you can confidently build Python applications backed by MySQL — from quick scripts to production web services.
NumPy (Numerical Python) is the foundational library for numerical computing in Python. If you’ve worked with data science, machine learning, image processing, or scientific computing in Python, you’ve almost certainly used NumPy — whether directly or through libraries built on top of it like pandas, scikit-learn, TensorFlow, and OpenCV.
Here’s why NumPy matters:
In this tutorial, we’ll go deep on NumPy arrays — from creation to manipulation, from indexing to linear algebra. By the end, you’ll have a solid, practical understanding of the library that underpins nearly all of Python’s data stack.
NumPy is available via pip. If you don’t have it installed yet:
pip install numpy
If you’re using Anaconda, NumPy comes pre-installed. You can verify your installation:
import numpy as np print(np.__version__)
The convention of importing NumPy as np is universal in the Python ecosystem. Stick with it — every tutorial, Stack Overflow answer, and library documentation assumes this alias.
NumPy arrays (ndarray objects) are the core data structure. There are several ways to create them, each suited to different situations.
The most straightforward way to create a NumPy array is from an existing Python list or tuple:
import numpy as np
# 1D array
a = np.array([1, 2, 3, 4, 5])
print(a)
# Output: [1 2 3 4 5]
# 2D array (matrix)
b = np.array([[1, 2, 3],
[4, 5, 6]])
print(b)
# Output:
# [[1 2 3]
# [4 5 6]]
# 3D array
c = np.array([[[1, 2], [3, 4]],
[[5, 6], [7, 8]]])
print(c.shape)
# Output: (2, 2, 2)
# Specifying data type explicitly
d = np.array([1, 2, 3], dtype=np.float64)
print(d)
# Output: [1. 2. 3.]
When you need arrays pre-filled with zeros or ones (common for initializing weight matrices, accumulators, or masks):
# 1D array of zeros zeros_1d = np.zeros(5) print(zeros_1d) # Output: [0. 0. 0. 0. 0.] # 2D array of zeros (3 rows, 4 columns) zeros_2d = np.zeros((3, 4)) print(zeros_2d) # Output: # [[0. 0. 0. 0.] # [0. 0. 0. 0.] # [0. 0. 0. 0.]] # 1D array of ones ones_1d = np.ones(4) print(ones_1d) # Output: [1. 1. 1. 1.] # 2D array of ones with integer type ones_int = np.ones((2, 3), dtype=np.int32) print(ones_int) # Output: # [[1 1 1] # [1 1 1]] # Full array with a custom fill value filled = np.full((2, 3), 7) print(filled) # Output: # [[7 7 7] # [7 7 7]] # Identity matrix eye = np.eye(3) print(eye) # Output: # [[1. 0. 0.] # [0. 1. 0.] # [0. 0. 1.]]
np.arange() works like Python’s range() but returns an array. np.linspace() creates evenly spaced values between two endpoints — extremely useful for plotting and numerical methods.
# arange: start, stop (exclusive), step a = np.arange(0, 10, 2) print(a) # Output: [0 2 4 6 8] # arange with float step b = np.arange(0, 1, 0.2) print(b) # Output: [0. 0.2 0.4 0.6 0.8] # linspace: start, stop (inclusive), number of points c = np.linspace(0, 1, 5) print(c) # Output: [0. 0.25 0.5 0.75 1. ] # linspace is ideal for generating x-values for plots x = np.linspace(0, 2 * np.pi, 100) # 100 points from 0 to 2π
NumPy’s random module is essential for simulations, testing, and machine learning initialization:
# Uniform random values between 0 and 1 rand_uniform = np.random.rand(3, 3) print(rand_uniform) # Output: 3x3 matrix of random floats in [0, 1) # Standard normal distribution (mean=0, std=1) rand_normal = np.random.randn(3, 3) print(rand_normal) # Output: 3x3 matrix of values from normal distribution # Random integers rand_int = np.random.randint(1, 100, size=(2, 4)) print(rand_int) # Output: 2x4 matrix of random ints between 1 and 99 # Reproducible random numbers with seed np.random.seed(42) reproducible = np.random.rand(3) print(reproducible) # Output: [0.37454012 0.95071431 0.73199394] # Using the newer Generator API (recommended for new code) rng = np.random.default_rng(seed=42) values = rng.random(5) print(values) # Output: [0.77395605 0.43887844 0.85859792 0.69736803 0.09417735] # Random choice from an array choices = rng.choice([10, 20, 30, 40, 50], size=3, replace=False) print(choices) # Output: 3 random elements without replacement
Understanding array properties is essential for debugging and writing correct NumPy code. Every ndarray carries metadata about its structure:
import numpy as np
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
# shape: dimensions as a tuple (rows, columns)
print(f"Shape: {arr.shape}")
# Output: Shape: (3, 4)
# ndim: number of dimensions (axes)
print(f"Dimensions: {arr.ndim}")
# Output: Dimensions: 2
# size: total number of elements
print(f"Total elements: {arr.size}")
# Output: Total elements: 12
# dtype: data type of elements
print(f"Data type: {arr.dtype}")
# Output: Data type: int64
# itemsize: size of each element in bytes
print(f"Bytes per element: {arr.itemsize}")
# Output: Bytes per element: 8
# nbytes: total memory consumed
print(f"Total bytes: {arr.nbytes}")
# Output: Total bytes: 96
# Practical example: understanding memory usage
large_arr = np.zeros((1000, 1000), dtype=np.float64)
print(f"Memory: {large_arr.nbytes / 1024 / 1024:.1f} MB")
# Output: Memory: 7.6 MB
# Same array with float32 uses half the memory
small_arr = np.zeros((1000, 1000), dtype=np.float32)
print(f"Memory: {small_arr.nbytes / 1024 / 1024:.1f} MB")
# Output: Memory: 3.8 MB
The dtype attribute is particularly important. NumPy supports many data types: int8, int16, int32, int64, float16, float32, float64, complex64, complex128, bool, and more. Choosing the right dtype can significantly impact both memory usage and computation speed.
NumPy’s indexing is more powerful than Python list indexing. Mastering it will save you from writing unnecessary loops.
arr = np.array([10, 20, 30, 40, 50, 60, 70, 80]) # Basic indexing (0-based) print(arr[0]) # 10 print(arr[-1]) # 80 print(arr[-2]) # 70 # Slicing: start:stop:step print(arr[2:5]) # [30 40 50] print(arr[:3]) # [10 20 30] print(arr[5:]) # [60 70 80] print(arr[::2]) # [10 30 50 70] — every other element print(arr[::-1]) # [80 70 60 50 40 30 20 10] — reversed
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
# Single element: [row, col]
print(matrix[0, 0]) # 1
print(matrix[2, 3]) # 12
# Entire row
print(matrix[1]) # [5 6 7 8]
print(matrix[1, :]) # [5 6 7 8] — equivalent
# Entire column
print(matrix[:, 2]) # [ 3 7 11 15]
# Sub-matrix (rows 0-1, columns 1-2)
print(matrix[0:2, 1:3])
# Output:
# [[2 3]
# [6 7]]
# Every other row, every other column
print(matrix[::2, ::2])
# Output:
# [[ 1 3]
# [ 9 11]]
Boolean indexing is one of NumPy’s most powerful features. You create a boolean mask and use it to filter elements:
arr = np.array([15, 22, 8, 41, 3, 67, 29, 55]) # Elements greater than 20 mask = arr > 20 print(mask) # Output: [False True False True False True True True] print(arr[mask]) # Output: [22 41 67 29 55] # Shorthand — most common pattern print(arr[arr > 20]) # Output: [22 41 67 29 55] # Combining conditions (use & for AND, | for OR, ~ for NOT) print(arr[(arr > 10) & (arr < 50)]) # Output: [15 22 41 29] print(arr[(arr < 10) | (arr > 50)]) # Output: [ 8 3 67 55] # Boolean indexing on 2D arrays matrix = np.array([[1, 2], [3, 4], [5, 6]]) print(matrix[matrix % 2 == 0]) # Output: [2 4 6] — returns a flat array of even numbers
Fancy indexing lets you use arrays of indices to access multiple elements at once:
arr = np.array([10, 20, 30, 40, 50])
# Select elements at indices 0, 2, and 4
indices = np.array([0, 2, 4])
print(arr[indices])
# Output: [10 30 50]
# Works with 2D arrays too
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
# Select specific rows
print(matrix[[0, 2, 3]])
# Output:
# [[ 1 2 3]
# [ 7 8 9]
# [10 11 12]]
# Select specific elements: (row0,col1), (row1,col2), (row2,col0)
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 0])
print(matrix[rows, cols])
# Output: [2 6 7]
NumPy’s real power shows up in array operations. Everything is vectorized — no loops needed.
a = np.array([1, 2, 3, 4]) b = np.array([10, 20, 30, 40]) # Arithmetic is element-wise print(a + b) # [11 22 33 44] print(a - b) # [ -9 -18 -27 -36] print(a * b) # [ 10 40 90 160] print(b / a) # [10. 10. 10. 10.] print(a ** 2) # [ 1 4 9 16] # Comparison operators return boolean arrays print(a > 2) # [False False True True] print(a == b) # [False False False False] # Scalar operations are broadcast to every element print(a + 100) # [101 102 103 104] print(a * 3) # [ 3 6 9 12]
Broadcasting is the mechanism that lets NumPy perform operations on arrays of different shapes. It’s one of the most important concepts to understand:
# Broadcasting a scalar across an array
arr = np.array([[1, 2, 3],
[4, 5, 6]])
print(arr * 10)
# Output:
# [[10 20 30]
# [40 50 60]]
# Broadcasting a 1D array across rows of a 2D array
row = np.array([100, 200, 300])
print(arr + row)
# Output:
# [[101 202 303]
# [104 205 306]]
# Broadcasting a column vector across columns
col = np.array([[10],
[20]])
print(arr + col)
# Output:
# [[11 12 13]
# [24 25 26]]
# Practical example: centering data (subtracting column means)
data = np.array([[1.0, 200, 3000],
[2.0, 400, 6000],
[3.0, 600, 9000]])
col_means = data.mean(axis=0)
print(f"Column means: {col_means}")
# Output: Column means: [2.000e+00 4.000e+02 6.000e+03]
centered = data - col_means
print(centered)
# Output:
# [[-1.000e+00 -2.000e+02 -3.000e+03]
# [ 0.000e+00 0.000e+00 0.000e+00]
# [ 1.000e+00 2.000e+02 3.000e+03]]
Broadcasting rules:
ValueError.arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Global aggregations
print(f"Sum: {arr.sum()}") # 45
print(f"Mean: {arr.mean()}") # 5.0
print(f"Min: {arr.min()}") # 1
print(f"Max: {arr.max()}") # 9
print(f"Std Dev: {arr.std():.4f}") # 2.5820
# Aggregation along axes
# axis=0 → collapse rows (compute across rows → one value per column)
# axis=1 → collapse columns (compute across columns → one value per row)
print(f"Column sums: {arr.sum(axis=0)}") # [12 15 18]
print(f"Row sums: {arr.sum(axis=1)}") # [ 6 15 24]
print(f"Column means: {arr.mean(axis=0)}") # [4. 5. 6.]
print(f"Row means: {arr.mean(axis=1)}") # [2. 5. 8.]
# Other useful aggregations
print(f"Cumulative sum: {np.array([1,2,3,4]).cumsum()}")
# Output: [ 1 3 6 10]
print(f"Product: {np.array([1,2,3,4]).prod()}")
# Output: 24
# argmin and argmax — index of min/max value
scores = np.array([82, 91, 76, 95, 88])
print(f"Best score index: {scores.argmax()}") # 3
print(f"Worst score index: {scores.argmin()}") # 2
Reshaping lets you change the dimensions of an array without changing its data. This is critical when preparing data for machine learning models or matrix operations.
arr = np.arange(12) print(arr) # Output: [ 0 1 2 3 4 5 6 7 8 9 10 11] # Reshape to 3 rows × 4 columns reshaped = arr.reshape(3, 4) print(reshaped) # Output: # [[ 0 1 2 3] # [ 4 5 6 7] # [ 8 9 10 11]] # Reshape to 4 rows × 3 columns print(arr.reshape(4, 3)) # Output: # [[ 0 1 2] # [ 3 4 5] # [ 6 7 8] # [ 9 10 11]] # Use -1 to let NumPy infer one dimension print(arr.reshape(2, -1)) # 2 rows, auto-compute columns → (2, 6) print(arr.reshape(-1, 3)) # auto-compute rows, 3 columns → (4, 3) # Reshape to 3D print(arr.reshape(2, 2, 3).shape) # Output: (2, 2, 3) # IMPORTANT: total elements must match # arr.reshape(3, 5) # ValueError: cannot reshape array of size 12 into shape (3,5)
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
# flatten() — always returns a copy
flat = matrix.flatten()
print(flat)
# Output: [1 2 3 4 5 6]
flat[0] = 999
print(matrix[0, 0]) # 1 — original unchanged (it's a copy)
# ravel() — returns a view when possible (more memory efficient)
raveled = matrix.ravel()
print(raveled)
# Output: [1 2 3 4 5 6]
raveled[0] = 999
print(matrix[0, 0]) # 999 — original IS changed (it's a view)
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
print(f"Original shape: {matrix.shape}")
# Output: Original shape: (2, 3)
transposed = matrix.T
print(f"Transposed shape: {transposed.shape}")
# Output: Transposed shape: (3, 2)
print(transposed)
# Output:
# [[1 4]
# [2 5]
# [3 6]]
# np.transpose() and .T are equivalent for 2D arrays
# For higher dimensions, np.transpose() lets you specify axis order
arr_3d = np.arange(24).reshape(2, 3, 4)
print(arr_3d.shape) # (2, 3, 4)
print(np.transpose(arr_3d, (1, 0, 2)).shape) # (3, 2, 4)
Combining and dividing arrays is a common operation when preparing datasets or assembling results.
a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) # Vertical stack — adds rows vs = np.vstack([a, b]) print(vs) # Output: # [[1 2 3] # [4 5 6]] # Horizontal stack — concatenates side by side hs = np.hstack([a, b]) print(hs) # Output: [1 2 3 4 5 6] # 2D stacking m1 = np.array([[1, 2], [3, 4]]) m2 = np.array([[5, 6], [7, 8]]) print(np.vstack([m1, m2])) # Output: # [[1 2] # [3 4] # [5 6] # [7 8]] print(np.hstack([m1, m2])) # Output: # [[1 2 5 6] # [3 4 7 8]] # np.concatenate — general purpose (specify axis) print(np.concatenate([m1, m2], axis=0)) # same as vstack print(np.concatenate([m1, m2], axis=1)) # same as hstack # Column stack — treats 1D arrays as columns c1 = np.array([1, 2, 3]) c2 = np.array([4, 5, 6]) print(np.column_stack([c1, c2])) # Output: # [[1 4] # [2 5] # [3 6]]
arr = np.arange(16).reshape(4, 4)
print(arr)
# Output:
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [12 13 14 15]]
# Split into 2 equal parts along rows (axis=0)
top, bottom = np.vsplit(arr, 2)
print("Top:\n", top)
# Output:
# [[0 1 2 3]
# [4 5 6 7]]
print("Bottom:\n", bottom)
# Output:
# [[ 8 9 10 11]
# [12 13 14 15]]
# Split into 2 equal parts along columns (axis=1)
left, right = np.hsplit(arr, 2)
print("Left:\n", left)
# Output:
# [[ 0 1]
# [ 4 5]
# [ 8 9]
# [12 13]]
# Split at specific indices
first, second, third = np.split(arr, [1, 3], axis=0)
print(f"First (row 0): {first}")
print(f"Second (rows 1-2):\n{second}")
print(f"Third (row 3): {third}")
NumPy provides a comprehensive set of mathematical functions — all vectorized and optimized.
arr = np.array([1, 4, 9, 16, 25]) # Square root print(np.sqrt(arr)) # Output: [1. 2. 3. 4. 5.] # Exponential (e^x) print(np.exp(np.array([0, 1, 2]))) # Output: [1. 2.71828183 7.3890561 ] # Natural logarithm print(np.log(np.array([1, np.e, np.e**2]))) # Output: [0. 1. 2.] # Log base 10 and base 2 print(np.log10(np.array([1, 10, 100, 1000]))) # Output: [0. 1. 2. 3.] print(np.log2(np.array([1, 2, 4, 8]))) # Output: [0. 1. 2. 3.] # Trigonometric functions angles = np.array([0, np.pi/6, np.pi/4, np.pi/3, np.pi/2]) print(np.sin(angles)) # Output: [0. 0.5 0.70710678 0.8660254 1. ] print(np.cos(angles)) # Output: [1.00000000e+00 8.66025404e-01 7.07106781e-01 5.00000000e-01 6.12323400e-17] # Absolute value print(np.abs(np.array([-3, -1, 0, 2, 5]))) # Output: [3 1 0 2 5] # Rounding vals = np.array([1.23, 2.67, 3.5, 4.89]) print(np.round(vals, 1)) # [1.2 2.7 3.5 4.9] print(np.floor(vals)) # [1. 2. 3. 4.] print(np.ceil(vals)) # [2. 3. 4. 5.]
# Dot product of 1D arrays (scalar result)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))
# Output: 32 (1*4 + 2*5 + 3*6)
# Matrix multiplication
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
# Three equivalent ways to multiply matrices
print(np.dot(A, B))
print(A @ B) # @ operator (Python 3.5+)
print(np.matmul(A, B))
# All output:
# [[19 22]
# [43 50]]
# IMPORTANT: * is element-wise, NOT matrix multiplication
print(A * B)
# Output:
# [[ 5 12]
# [21 32]]
# Cross product
print(np.cross(np.array([1, 0, 0]), np.array([0, 1, 0])))
# Output: [0 0 1]
A = np.array([[1, 2],
[3, 4]])
# Determinant
print(f"Determinant: {np.linalg.det(A):.1f}")
# Output: Determinant: -2.0
# Inverse
A_inv = np.linalg.inv(A)
print(f"Inverse:\n{A_inv}")
# Output:
# [[-2. 1. ]
# [ 1.5 -0.5]]
# Verify: A × A_inv = Identity
print(np.round(A @ A_inv))
# Output:
# [[1. 0.]
# [0. 1.]]
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Matrix rank
print(f"Rank: {np.linalg.matrix_rank(A)}")
# Output: Rank: 2
# Norm
print(f"Frobenius norm: {np.linalg.norm(A):.4f}")
# Output: Frobenius norm: 5.4772
Understanding why NumPy is faster than Python lists is important for making good design decisions.
import numpy as np
import time
size = 1_000_000
# Python list approach
py_list = list(range(size))
start = time.time()
py_result = [x ** 2 for x in py_list]
py_time = time.time() - start
print(f"Python list: {py_time:.4f} seconds")
# NumPy approach
np_arr = np.arange(size)
start = time.time()
np_result = np_arr ** 2
np_time = time.time() - start
print(f"NumPy array: {np_time:.4f} seconds")
print(f"NumPy is {py_time / np_time:.0f}x faster")
# Typical output:
# Python list: 0.1654 seconds
# NumPy array: 0.0012 seconds
# NumPy is 138x faster
import sys
# Python list of 1000 integers
py_list = list(range(1000))
py_size = sys.getsizeof(py_list) + sum(sys.getsizeof(x) for x in py_list)
print(f"Python list: {py_size:,} bytes")
# NumPy array of 1000 integers
np_arr = np.arange(1000, dtype=np.int64)
print(f"NumPy array: {np_arr.nbytes:,} bytes")
print(f"Python list uses {py_size / np_arr.nbytes:.1f}x more memory")
# Typical output:
# Python list: 36,056 bytes
# NumPy array: 8,000 bytes
# Python list uses 4.5x more memory
Why is NumPy faster?
Digital images are just NumPy arrays. A grayscale image is a 2D array; a color image is 3D (height × width × channels).
import numpy as np
# Simulate a small 5x5 grayscale image (values 0-255)
image = np.array([
[50, 80, 120, 160, 200],
[55, 85, 125, 165, 205],
[60, 90, 130, 170, 210],
[65, 95, 135, 175, 215],
[70, 100, 140, 180, 220]
], dtype=np.uint8)
print(f"Image shape: {image.shape}")
print(f"Pixel value range: {image.min()} - {image.max()}")
# Invert the image (negative)
inverted = 255 - image
print(f"Inverted:\n{inverted}")
# Increase brightness (clamp to 255)
brightened = np.clip(image.astype(np.int16) + 50, 0, 255).astype(np.uint8)
print(f"Brightened:\n{brightened}")
# Threshold to binary (black/white)
threshold = 128
binary = (image > threshold).astype(np.uint8) * 255
print(f"Binary:\n{binary}")
# Normalize to [0, 1] range (common preprocessing step)
normalized = image.astype(np.float32) / 255.0
print(f"Normalized range: {normalized.min():.2f} - {normalized.max():.2f}")
# Simulate RGB image processing
rgb_image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
print(f"RGB shape: {rgb_image.shape}") # (100, 100, 3)
# Convert to grayscale using weighted average
weights = np.array([0.2989, 0.5870, 0.1140]) # Standard luminance weights
grayscale = np.dot(rgb_image[...,:3], weights).astype(np.uint8)
print(f"Grayscale shape: {grayscale.shape}") # (100, 100)
import numpy as np
# Simulate exam scores for 5 subjects, 100 students
np.random.seed(42)
scores = np.random.normal(loc=72, scale=12, size=(100, 5))
scores = np.clip(scores, 0, 100).round(1)
subjects = ['Math', 'Science', 'English', 'History', 'Art']
print("=== Class Statistics ===\n")
# Per-subject statistics
for i, subject in enumerate(subjects):
col = scores[:, i]
print(f"{subject:>10}: mean={col.mean():.1f}, "
f"std={col.std():.1f}, "
f"min={col.min():.1f}, "
f"max={col.max():.1f}, "
f"median={np.median(col):.1f}")
print(f"\n{'Overall':>10}: mean={scores.mean():.1f}, std={scores.std():.1f}")
# Find top 5 students by average score
student_averages = scores.mean(axis=1)
top_5_indices = np.argsort(student_averages)[-5:][::-1]
print(f"\nTop 5 students (by index): {top_5_indices}")
for idx in top_5_indices:
print(f" Student {idx}: avg = {student_averages[idx]:.1f}")
# Correlation between subjects
correlation = np.corrcoef(scores.T)
print(f"\nCorrelation matrix shape: {correlation.shape}")
print(f"Math-Science correlation: {correlation[0, 1]:.3f}")
# Percentile analysis
print(f"\n90th percentile per subject:")
for i, subject in enumerate(subjects):
p90 = np.percentile(scores[:, i], 90)
print(f" {subject}: {p90:.1f}")
# Students scoring above 90 in all subjects
high_achievers = np.all(scores > 90, axis=1)
print(f"\nStudents scoring >90 in ALL subjects: {high_achievers.sum()}")
Solving systems of linear equations is a fundamental operation in engineering and data science. Consider:
import numpy as np
# Solve the system:
# 2x + 3y - z = 1
# 4x + y + 2z = 2
# -2x + 7y - 3z = -1
# Coefficient matrix
A = np.array([[2, 3, -1],
[4, 1, 2],
[-2, 7, -3]])
# Constants vector
b = np.array([1, 2, -1])
# Solve using np.linalg.solve (faster and more stable than computing inverse)
x = np.linalg.solve(A, b)
print(f"Solution: x={x[0]:.4f}, y={x[1]:.4f}, z={x[2]:.4f}")
# Verify the solution
residual = A @ x - b
print(f"Residual (should be ~0): {residual}")
print(f"Max error: {np.abs(residual).max():.2e}")
# Least squares solution for overdetermined systems
# (more equations than unknowns — common in data fitting)
# Fit y = mx + c to noisy data
np.random.seed(42)
x_data = np.linspace(0, 10, 50)
y_data = 2.5 * x_data + 1.3 + np.random.normal(0, 1, 50)
# Set up matrix A for y = mx + c
A_fit = np.column_stack([x_data, np.ones(len(x_data))])
# Solve via least squares
result, residuals, rank, sv = np.linalg.lstsq(A_fit, y_data, rcond=None)
m, c = result
print(f"\nLeast squares fit: y = {m:.4f}x + {c:.4f}")
print(f"(True values: y = 2.5000x + 1.3000)")
Normalization and standardization are essential preprocessing steps in machine learning. NumPy makes them trivial:
import numpy as np
# Sample dataset: 5 samples with 3 features of different scales
data = np.array([
[25.0, 50000, 3.5],
[30.0, 60000, 4.2],
[22.0, 45000, 3.1],
[35.0, 80000, 4.8],
[28.0, 55000, 3.9]
])
feature_names = ['Age', 'Salary', 'GPA']
print("Original data:")
print(data)
# Min-Max Normalization: scale to [0, 1]
min_vals = data.min(axis=0)
max_vals = data.max(axis=0)
normalized = (data - min_vals) / (max_vals - min_vals)
print(f"\nMin-Max Normalized (range [0, 1]):")
for i, name in enumerate(feature_names):
print(f" {name}: min={normalized[:, i].min():.2f}, max={normalized[:, i].max():.2f}")
print(normalized)
# Z-Score Standardization: mean=0, std=1
mean_vals = data.mean(axis=0)
std_vals = data.std(axis=0)
standardized = (data - mean_vals) / std_vals
print(f"\nZ-Score Standardized (mean≈0, std≈1):")
for i, name in enumerate(feature_names):
print(f" {name}: mean={standardized[:, i].mean():.4f}, std={standardized[:, i].std():.4f}")
print(standardized)
# Robust scaling (using median and IQR — resistant to outliers)
median_vals = np.median(data, axis=0)
q75 = np.percentile(data, 75, axis=0)
q25 = np.percentile(data, 25, axis=0)
iqr = q75 - q25
robust_scaled = (data - median_vals) / iqr
print(f"\nRobust Scaled (using median and IQR):")
print(robust_scaled)
Even experienced developers trip over these. Save yourself the debugging time.
This is the single most common source of bugs in NumPy code:
import numpy as np original = np.array([1, 2, 3, 4, 5]) # Slicing creates a VIEW, not a copy view = original[1:4] view[0] = 999 print(original) # Output: [ 1 999 3 4 5] — original is modified! # To create an independent copy, use .copy() original = np.array([1, 2, 3, 4, 5]) safe_copy = original[1:4].copy() safe_copy[0] = 999 print(original) # Output: [1 2 3 4 5] — original is safe # How to check: use np.shares_memory() a = np.array([1, 2, 3, 4, 5]) b = a[1:4] c = a[1:4].copy() print(np.shares_memory(a, b)) # True — b is a view print(np.shares_memory(a, c)) # False — c is a copy # Boolean and fancy indexing ALWAYS return copies d = a[a > 2] print(np.shares_memory(a, d)) # False
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
# This works — (3,) broadcasts to (2, 3)
row = np.array([10, 20, 30])
print(a + row)
# This FAILS — shapes (2, 3) and (2,) are incompatible
col_wrong = np.array([10, 20])
try:
print(a + col_wrong)
except ValueError as e:
print(f"Error: {e}")
# Error: operands could not be broadcast together with shapes (2,3) (2,)
# Fix: reshape to column vector (2, 1)
col_right = np.array([[10], [20]]) # shape (2, 1)
print(a + col_right)
# Output:
# [[11 12 13]
# [24 25 26]]
# Alternatively, use np.newaxis (or None — they're the same)
col_also_right = np.array([10, 20])[:, np.newaxis]
print(col_also_right.shape) # (2, 1)
print(a + col_also_right) # same result
import numpy as np # int8 can only hold values from -128 to 127 arr = np.array([100, 120, 130], dtype=np.int8) print(arr) # Output: [100 120 -126] — 130 overflowed silently! result = arr + np.int8(50) print(result) # Output: [-106 -86 -76] — completely wrong, no warning! # Fix: use a larger dtype arr_safe = np.array([100, 120, 130], dtype=np.int32) result_safe = arr_safe + 50 print(result_safe) # Output: [150 170 180] — correct # Watch out with uint8 (common for image data, range 0-255) img_pixel = np.array([250], dtype=np.uint8) print(img_pixel + np.uint8(10)) # Output: [4] — wrapped around! (250 + 10 = 260 → 260 % 256 = 4) # Fix: cast before arithmetic print(img_pixel.astype(np.int16) + 10) # Output: [260] — correct
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6]])
# DON'T: Chained indexing may not work for setting values
# arr[arr > 3][0] = 99 # This might NOT modify arr
# DO: Use direct indexing
arr[arr > 3] = 99
print(arr)
# Output:
# [[ 1 2 3]
# [99 99 99]]
# Or use np.where for conditional replacement
arr2 = np.array([[1, 2, 3],
[4, 5, 6]])
result = np.where(arr2 > 3, 99, arr2)
print(result)
# Output:
# [[ 1 2 3]
# [99 99 99]]
Follow these guidelines to write efficient, maintainable NumPy code.
import numpy as np
data = np.random.rand(1_000_000)
# BAD: Python loop
result_slow = np.empty(len(data))
for i in range(len(data)):
result_slow[i] = data[i] ** 2 + 2 * data[i] + 1
# GOOD: Vectorized operation (10-100x faster)
result_fast = data ** 2 + 2 * data + 1
# For custom functions, use np.vectorize (still not as fast as native ufuncs)
def custom_func(x):
if x > 0.5:
return x ** 2
else:
return 0
vectorized_func = np.vectorize(custom_func)
result = vectorized_func(data)
# BEST: Use np.where instead of vectorize
result_best = np.where(data > 0.5, data ** 2, 0)
import numpy as np # Use the smallest dtype that fits your data # Integers small_ints = np.array([1, 2, 3, 4], dtype=np.int8) # -128 to 127 medium_ints = np.array([1, 2, 3, 4], dtype=np.int32) # -2B to 2B big_ints = np.array([1, 2, 3, 4], dtype=np.int64) # default, but 2x memory # Floats — float32 is usually sufficient for ML weights = np.random.randn(1000, 1000).astype(np.float32) # 3.8 MB # vs np.float64 which would be 7.6 MB # Boolean arrays for masks mask = np.zeros(1000, dtype=np.bool_) # 1 byte per element vs 8 for int64
import numpy as np data = np.random.rand(1000, 3) means = data.mean(axis=0) # shape (3,) # BAD: manually tiling to match shapes means_tiled = np.tile(means, (1000, 1)) # creates unnecessary copy centered_slow = data - means_tiled # GOOD: let broadcasting handle it (no extra memory) centered_fast = data - means # (1000, 3) - (3,) → broadcasting
import numpy as np
n = 10000
# BAD: growing an array with append (copies entire array each time)
result = np.array([])
for i in range(n):
result = np.append(result, i ** 2)
# GOOD: preallocate and fill
result = np.empty(n)
for i in range(n):
result[i] = i ** 2
# BEST: vectorize completely
result = np.arange(n) ** 2
import numpy as np arr = np.random.rand(1_000_000) # Creates a new array (uses extra memory) arr = arr * 2 # In-place operation (modifies existing array, saves memory) arr *= 2 # NumPy also provides in-place functions np.multiply(arr, 2, out=arr) np.add(arr, 1, out=arr)
.copy() when you need independence.float32 instead of float64 halves memory usage. Watch out for integer overflow with small dtypes like int8 and uint8.np.linalg.solve() is faster and more numerically stable than computing matrix inverses manually.np.append() in a loop. Preallocate with np.empty() or np.zeros(), or better yet, vectorize the computation entirely.NumPy is one of those libraries where the investment in learning it well pays dividends across your entire Python career. The patterns and concepts here — vectorization, broadcasting, memory-aware programming — are transferable to GPU computing, distributed computing, and any high-performance numerical work.