Hasura allows you to define role-based access control rules for each of the models/tables that you use. Access control rules help in restricting querying on a table based on certain conditions.
NumPy (Numerical Python) is the foundational library for numerical computing in Python. If you’ve worked with data science, machine learning, image processing, or scientific computing in Python, you’ve almost certainly used NumPy — whether directly or through libraries built on top of it like pandas, scikit-learn, TensorFlow, and OpenCV.
Here’s why NumPy matters:
In this tutorial, we’ll go deep on NumPy arrays — from creation to manipulation, from indexing to linear algebra. By the end, you’ll have a solid, practical understanding of the library that underpins nearly all of Python’s data stack.
NumPy is available via pip. If you don’t have it installed yet:
pip install numpy
If you’re using Anaconda, NumPy comes pre-installed. You can verify your installation:
import numpy as np print(np.__version__)
The convention of importing NumPy as np is universal in the Python ecosystem. Stick with it — every tutorial, Stack Overflow answer, and library documentation assumes this alias.
NumPy arrays (ndarray objects) are the core data structure. There are several ways to create them, each suited to different situations.
The most straightforward way to create a NumPy array is from an existing Python list or tuple:
import numpy as np
# 1D array
a = np.array([1, 2, 3, 4, 5])
print(a)
# Output: [1 2 3 4 5]
# 2D array (matrix)
b = np.array([[1, 2, 3],
[4, 5, 6]])
print(b)
# Output:
# [[1 2 3]
# [4 5 6]]
# 3D array
c = np.array([[[1, 2], [3, 4]],
[[5, 6], [7, 8]]])
print(c.shape)
# Output: (2, 2, 2)
# Specifying data type explicitly
d = np.array([1, 2, 3], dtype=np.float64)
print(d)
# Output: [1. 2. 3.]
When you need arrays pre-filled with zeros or ones (common for initializing weight matrices, accumulators, or masks):
# 1D array of zeros zeros_1d = np.zeros(5) print(zeros_1d) # Output: [0. 0. 0. 0. 0.] # 2D array of zeros (3 rows, 4 columns) zeros_2d = np.zeros((3, 4)) print(zeros_2d) # Output: # [[0. 0. 0. 0.] # [0. 0. 0. 0.] # [0. 0. 0. 0.]] # 1D array of ones ones_1d = np.ones(4) print(ones_1d) # Output: [1. 1. 1. 1.] # 2D array of ones with integer type ones_int = np.ones((2, 3), dtype=np.int32) print(ones_int) # Output: # [[1 1 1] # [1 1 1]] # Full array with a custom fill value filled = np.full((2, 3), 7) print(filled) # Output: # [[7 7 7] # [7 7 7]] # Identity matrix eye = np.eye(3) print(eye) # Output: # [[1. 0. 0.] # [0. 1. 0.] # [0. 0. 1.]]
np.arange() works like Python’s range() but returns an array. np.linspace() creates evenly spaced values between two endpoints — extremely useful for plotting and numerical methods.
# arange: start, stop (exclusive), step a = np.arange(0, 10, 2) print(a) # Output: [0 2 4 6 8] # arange with float step b = np.arange(0, 1, 0.2) print(b) # Output: [0. 0.2 0.4 0.6 0.8] # linspace: start, stop (inclusive), number of points c = np.linspace(0, 1, 5) print(c) # Output: [0. 0.25 0.5 0.75 1. ] # linspace is ideal for generating x-values for plots x = np.linspace(0, 2 * np.pi, 100) # 100 points from 0 to 2π
NumPy’s random module is essential for simulations, testing, and machine learning initialization:
# Uniform random values between 0 and 1 rand_uniform = np.random.rand(3, 3) print(rand_uniform) # Output: 3x3 matrix of random floats in [0, 1) # Standard normal distribution (mean=0, std=1) rand_normal = np.random.randn(3, 3) print(rand_normal) # Output: 3x3 matrix of values from normal distribution # Random integers rand_int = np.random.randint(1, 100, size=(2, 4)) print(rand_int) # Output: 2x4 matrix of random ints between 1 and 99 # Reproducible random numbers with seed np.random.seed(42) reproducible = np.random.rand(3) print(reproducible) # Output: [0.37454012 0.95071431 0.73199394] # Using the newer Generator API (recommended for new code) rng = np.random.default_rng(seed=42) values = rng.random(5) print(values) # Output: [0.77395605 0.43887844 0.85859792 0.69736803 0.09417735] # Random choice from an array choices = rng.choice([10, 20, 30, 40, 50], size=3, replace=False) print(choices) # Output: 3 random elements without replacement
Understanding array properties is essential for debugging and writing correct NumPy code. Every ndarray carries metadata about its structure:
import numpy as np
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
# shape: dimensions as a tuple (rows, columns)
print(f"Shape: {arr.shape}")
# Output: Shape: (3, 4)
# ndim: number of dimensions (axes)
print(f"Dimensions: {arr.ndim}")
# Output: Dimensions: 2
# size: total number of elements
print(f"Total elements: {arr.size}")
# Output: Total elements: 12
# dtype: data type of elements
print(f"Data type: {arr.dtype}")
# Output: Data type: int64
# itemsize: size of each element in bytes
print(f"Bytes per element: {arr.itemsize}")
# Output: Bytes per element: 8
# nbytes: total memory consumed
print(f"Total bytes: {arr.nbytes}")
# Output: Total bytes: 96
# Practical example: understanding memory usage
large_arr = np.zeros((1000, 1000), dtype=np.float64)
print(f"Memory: {large_arr.nbytes / 1024 / 1024:.1f} MB")
# Output: Memory: 7.6 MB
# Same array with float32 uses half the memory
small_arr = np.zeros((1000, 1000), dtype=np.float32)
print(f"Memory: {small_arr.nbytes / 1024 / 1024:.1f} MB")
# Output: Memory: 3.8 MB
The dtype attribute is particularly important. NumPy supports many data types: int8, int16, int32, int64, float16, float32, float64, complex64, complex128, bool, and more. Choosing the right dtype can significantly impact both memory usage and computation speed.
NumPy’s indexing is more powerful than Python list indexing. Mastering it will save you from writing unnecessary loops.
arr = np.array([10, 20, 30, 40, 50, 60, 70, 80]) # Basic indexing (0-based) print(arr[0]) # 10 print(arr[-1]) # 80 print(arr[-2]) # 70 # Slicing: start:stop:step print(arr[2:5]) # [30 40 50] print(arr[:3]) # [10 20 30] print(arr[5:]) # [60 70 80] print(arr[::2]) # [10 30 50 70] — every other element print(arr[::-1]) # [80 70 60 50 40 30 20 10] — reversed
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
# Single element: [row, col]
print(matrix[0, 0]) # 1
print(matrix[2, 3]) # 12
# Entire row
print(matrix[1]) # [5 6 7 8]
print(matrix[1, :]) # [5 6 7 8] — equivalent
# Entire column
print(matrix[:, 2]) # [ 3 7 11 15]
# Sub-matrix (rows 0-1, columns 1-2)
print(matrix[0:2, 1:3])
# Output:
# [[2 3]
# [6 7]]
# Every other row, every other column
print(matrix[::2, ::2])
# Output:
# [[ 1 3]
# [ 9 11]]
Boolean indexing is one of NumPy’s most powerful features. You create a boolean mask and use it to filter elements:
arr = np.array([15, 22, 8, 41, 3, 67, 29, 55]) # Elements greater than 20 mask = arr > 20 print(mask) # Output: [False True False True False True True True] print(arr[mask]) # Output: [22 41 67 29 55] # Shorthand — most common pattern print(arr[arr > 20]) # Output: [22 41 67 29 55] # Combining conditions (use & for AND, | for OR, ~ for NOT) print(arr[(arr > 10) & (arr < 50)]) # Output: [15 22 41 29] print(arr[(arr < 10) | (arr > 50)]) # Output: [ 8 3 67 55] # Boolean indexing on 2D arrays matrix = np.array([[1, 2], [3, 4], [5, 6]]) print(matrix[matrix % 2 == 0]) # Output: [2 4 6] — returns a flat array of even numbers
Fancy indexing lets you use arrays of indices to access multiple elements at once:
arr = np.array([10, 20, 30, 40, 50])
# Select elements at indices 0, 2, and 4
indices = np.array([0, 2, 4])
print(arr[indices])
# Output: [10 30 50]
# Works with 2D arrays too
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
# Select specific rows
print(matrix[[0, 2, 3]])
# Output:
# [[ 1 2 3]
# [ 7 8 9]
# [10 11 12]]
# Select specific elements: (row0,col1), (row1,col2), (row2,col0)
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 0])
print(matrix[rows, cols])
# Output: [2 6 7]
NumPy’s real power shows up in array operations. Everything is vectorized — no loops needed.
a = np.array([1, 2, 3, 4]) b = np.array([10, 20, 30, 40]) # Arithmetic is element-wise print(a + b) # [11 22 33 44] print(a - b) # [ -9 -18 -27 -36] print(a * b) # [ 10 40 90 160] print(b / a) # [10. 10. 10. 10.] print(a ** 2) # [ 1 4 9 16] # Comparison operators return boolean arrays print(a > 2) # [False False True True] print(a == b) # [False False False False] # Scalar operations are broadcast to every element print(a + 100) # [101 102 103 104] print(a * 3) # [ 3 6 9 12]
Broadcasting is the mechanism that lets NumPy perform operations on arrays of different shapes. It’s one of the most important concepts to understand:
# Broadcasting a scalar across an array
arr = np.array([[1, 2, 3],
[4, 5, 6]])
print(arr * 10)
# Output:
# [[10 20 30]
# [40 50 60]]
# Broadcasting a 1D array across rows of a 2D array
row = np.array([100, 200, 300])
print(arr + row)
# Output:
# [[101 202 303]
# [104 205 306]]
# Broadcasting a column vector across columns
col = np.array([[10],
[20]])
print(arr + col)
# Output:
# [[11 12 13]
# [24 25 26]]
# Practical example: centering data (subtracting column means)
data = np.array([[1.0, 200, 3000],
[2.0, 400, 6000],
[3.0, 600, 9000]])
col_means = data.mean(axis=0)
print(f"Column means: {col_means}")
# Output: Column means: [2.000e+00 4.000e+02 6.000e+03]
centered = data - col_means
print(centered)
# Output:
# [[-1.000e+00 -2.000e+02 -3.000e+03]
# [ 0.000e+00 0.000e+00 0.000e+00]
# [ 1.000e+00 2.000e+02 3.000e+03]]
Broadcasting rules:
ValueError.arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Global aggregations
print(f"Sum: {arr.sum()}") # 45
print(f"Mean: {arr.mean()}") # 5.0
print(f"Min: {arr.min()}") # 1
print(f"Max: {arr.max()}") # 9
print(f"Std Dev: {arr.std():.4f}") # 2.5820
# Aggregation along axes
# axis=0 → collapse rows (compute across rows → one value per column)
# axis=1 → collapse columns (compute across columns → one value per row)
print(f"Column sums: {arr.sum(axis=0)}") # [12 15 18]
print(f"Row sums: {arr.sum(axis=1)}") # [ 6 15 24]
print(f"Column means: {arr.mean(axis=0)}") # [4. 5. 6.]
print(f"Row means: {arr.mean(axis=1)}") # [2. 5. 8.]
# Other useful aggregations
print(f"Cumulative sum: {np.array([1,2,3,4]).cumsum()}")
# Output: [ 1 3 6 10]
print(f"Product: {np.array([1,2,3,4]).prod()}")
# Output: 24
# argmin and argmax — index of min/max value
scores = np.array([82, 91, 76, 95, 88])
print(f"Best score index: {scores.argmax()}") # 3
print(f"Worst score index: {scores.argmin()}") # 2
Reshaping lets you change the dimensions of an array without changing its data. This is critical when preparing data for machine learning models or matrix operations.
arr = np.arange(12) print(arr) # Output: [ 0 1 2 3 4 5 6 7 8 9 10 11] # Reshape to 3 rows × 4 columns reshaped = arr.reshape(3, 4) print(reshaped) # Output: # [[ 0 1 2 3] # [ 4 5 6 7] # [ 8 9 10 11]] # Reshape to 4 rows × 3 columns print(arr.reshape(4, 3)) # Output: # [[ 0 1 2] # [ 3 4 5] # [ 6 7 8] # [ 9 10 11]] # Use -1 to let NumPy infer one dimension print(arr.reshape(2, -1)) # 2 rows, auto-compute columns → (2, 6) print(arr.reshape(-1, 3)) # auto-compute rows, 3 columns → (4, 3) # Reshape to 3D print(arr.reshape(2, 2, 3).shape) # Output: (2, 2, 3) # IMPORTANT: total elements must match # arr.reshape(3, 5) # ValueError: cannot reshape array of size 12 into shape (3,5)
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
# flatten() — always returns a copy
flat = matrix.flatten()
print(flat)
# Output: [1 2 3 4 5 6]
flat[0] = 999
print(matrix[0, 0]) # 1 — original unchanged (it's a copy)
# ravel() — returns a view when possible (more memory efficient)
raveled = matrix.ravel()
print(raveled)
# Output: [1 2 3 4 5 6]
raveled[0] = 999
print(matrix[0, 0]) # 999 — original IS changed (it's a view)
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
print(f"Original shape: {matrix.shape}")
# Output: Original shape: (2, 3)
transposed = matrix.T
print(f"Transposed shape: {transposed.shape}")
# Output: Transposed shape: (3, 2)
print(transposed)
# Output:
# [[1 4]
# [2 5]
# [3 6]]
# np.transpose() and .T are equivalent for 2D arrays
# For higher dimensions, np.transpose() lets you specify axis order
arr_3d = np.arange(24).reshape(2, 3, 4)
print(arr_3d.shape) # (2, 3, 4)
print(np.transpose(arr_3d, (1, 0, 2)).shape) # (3, 2, 4)
Combining and dividing arrays is a common operation when preparing datasets or assembling results.
a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) # Vertical stack — adds rows vs = np.vstack([a, b]) print(vs) # Output: # [[1 2 3] # [4 5 6]] # Horizontal stack — concatenates side by side hs = np.hstack([a, b]) print(hs) # Output: [1 2 3 4 5 6] # 2D stacking m1 = np.array([[1, 2], [3, 4]]) m2 = np.array([[5, 6], [7, 8]]) print(np.vstack([m1, m2])) # Output: # [[1 2] # [3 4] # [5 6] # [7 8]] print(np.hstack([m1, m2])) # Output: # [[1 2 5 6] # [3 4 7 8]] # np.concatenate — general purpose (specify axis) print(np.concatenate([m1, m2], axis=0)) # same as vstack print(np.concatenate([m1, m2], axis=1)) # same as hstack # Column stack — treats 1D arrays as columns c1 = np.array([1, 2, 3]) c2 = np.array([4, 5, 6]) print(np.column_stack([c1, c2])) # Output: # [[1 4] # [2 5] # [3 6]]
arr = np.arange(16).reshape(4, 4)
print(arr)
# Output:
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [12 13 14 15]]
# Split into 2 equal parts along rows (axis=0)
top, bottom = np.vsplit(arr, 2)
print("Top:\n", top)
# Output:
# [[0 1 2 3]
# [4 5 6 7]]
print("Bottom:\n", bottom)
# Output:
# [[ 8 9 10 11]
# [12 13 14 15]]
# Split into 2 equal parts along columns (axis=1)
left, right = np.hsplit(arr, 2)
print("Left:\n", left)
# Output:
# [[ 0 1]
# [ 4 5]
# [ 8 9]
# [12 13]]
# Split at specific indices
first, second, third = np.split(arr, [1, 3], axis=0)
print(f"First (row 0): {first}")
print(f"Second (rows 1-2):\n{second}")
print(f"Third (row 3): {third}")
NumPy provides a comprehensive set of mathematical functions — all vectorized and optimized.
arr = np.array([1, 4, 9, 16, 25]) # Square root print(np.sqrt(arr)) # Output: [1. 2. 3. 4. 5.] # Exponential (e^x) print(np.exp(np.array([0, 1, 2]))) # Output: [1. 2.71828183 7.3890561 ] # Natural logarithm print(np.log(np.array([1, np.e, np.e**2]))) # Output: [0. 1. 2.] # Log base 10 and base 2 print(np.log10(np.array([1, 10, 100, 1000]))) # Output: [0. 1. 2. 3.] print(np.log2(np.array([1, 2, 4, 8]))) # Output: [0. 1. 2. 3.] # Trigonometric functions angles = np.array([0, np.pi/6, np.pi/4, np.pi/3, np.pi/2]) print(np.sin(angles)) # Output: [0. 0.5 0.70710678 0.8660254 1. ] print(np.cos(angles)) # Output: [1.00000000e+00 8.66025404e-01 7.07106781e-01 5.00000000e-01 6.12323400e-17] # Absolute value print(np.abs(np.array([-3, -1, 0, 2, 5]))) # Output: [3 1 0 2 5] # Rounding vals = np.array([1.23, 2.67, 3.5, 4.89]) print(np.round(vals, 1)) # [1.2 2.7 3.5 4.9] print(np.floor(vals)) # [1. 2. 3. 4.] print(np.ceil(vals)) # [2. 3. 4. 5.]
# Dot product of 1D arrays (scalar result)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))
# Output: 32 (1*4 + 2*5 + 3*6)
# Matrix multiplication
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
# Three equivalent ways to multiply matrices
print(np.dot(A, B))
print(A @ B) # @ operator (Python 3.5+)
print(np.matmul(A, B))
# All output:
# [[19 22]
# [43 50]]
# IMPORTANT: * is element-wise, NOT matrix multiplication
print(A * B)
# Output:
# [[ 5 12]
# [21 32]]
# Cross product
print(np.cross(np.array([1, 0, 0]), np.array([0, 1, 0])))
# Output: [0 0 1]
A = np.array([[1, 2],
[3, 4]])
# Determinant
print(f"Determinant: {np.linalg.det(A):.1f}")
# Output: Determinant: -2.0
# Inverse
A_inv = np.linalg.inv(A)
print(f"Inverse:\n{A_inv}")
# Output:
# [[-2. 1. ]
# [ 1.5 -0.5]]
# Verify: A × A_inv = Identity
print(np.round(A @ A_inv))
# Output:
# [[1. 0.]
# [0. 1.]]
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Matrix rank
print(f"Rank: {np.linalg.matrix_rank(A)}")
# Output: Rank: 2
# Norm
print(f"Frobenius norm: {np.linalg.norm(A):.4f}")
# Output: Frobenius norm: 5.4772
Understanding why NumPy is faster than Python lists is important for making good design decisions.
import numpy as np
import time
size = 1_000_000
# Python list approach
py_list = list(range(size))
start = time.time()
py_result = [x ** 2 for x in py_list]
py_time = time.time() - start
print(f"Python list: {py_time:.4f} seconds")
# NumPy approach
np_arr = np.arange(size)
start = time.time()
np_result = np_arr ** 2
np_time = time.time() - start
print(f"NumPy array: {np_time:.4f} seconds")
print(f"NumPy is {py_time / np_time:.0f}x faster")
# Typical output:
# Python list: 0.1654 seconds
# NumPy array: 0.0012 seconds
# NumPy is 138x faster
import sys
# Python list of 1000 integers
py_list = list(range(1000))
py_size = sys.getsizeof(py_list) + sum(sys.getsizeof(x) for x in py_list)
print(f"Python list: {py_size:,} bytes")
# NumPy array of 1000 integers
np_arr = np.arange(1000, dtype=np.int64)
print(f"NumPy array: {np_arr.nbytes:,} bytes")
print(f"Python list uses {py_size / np_arr.nbytes:.1f}x more memory")
# Typical output:
# Python list: 36,056 bytes
# NumPy array: 8,000 bytes
# Python list uses 4.5x more memory
Why is NumPy faster?
Digital images are just NumPy arrays. A grayscale image is a 2D array; a color image is 3D (height × width × channels).
import numpy as np
# Simulate a small 5x5 grayscale image (values 0-255)
image = np.array([
[50, 80, 120, 160, 200],
[55, 85, 125, 165, 205],
[60, 90, 130, 170, 210],
[65, 95, 135, 175, 215],
[70, 100, 140, 180, 220]
], dtype=np.uint8)
print(f"Image shape: {image.shape}")
print(f"Pixel value range: {image.min()} - {image.max()}")
# Invert the image (negative)
inverted = 255 - image
print(f"Inverted:\n{inverted}")
# Increase brightness (clamp to 255)
brightened = np.clip(image.astype(np.int16) + 50, 0, 255).astype(np.uint8)
print(f"Brightened:\n{brightened}")
# Threshold to binary (black/white)
threshold = 128
binary = (image > threshold).astype(np.uint8) * 255
print(f"Binary:\n{binary}")
# Normalize to [0, 1] range (common preprocessing step)
normalized = image.astype(np.float32) / 255.0
print(f"Normalized range: {normalized.min():.2f} - {normalized.max():.2f}")
# Simulate RGB image processing
rgb_image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
print(f"RGB shape: {rgb_image.shape}") # (100, 100, 3)
# Convert to grayscale using weighted average
weights = np.array([0.2989, 0.5870, 0.1140]) # Standard luminance weights
grayscale = np.dot(rgb_image[...,:3], weights).astype(np.uint8)
print(f"Grayscale shape: {grayscale.shape}") # (100, 100)
import numpy as np
# Simulate exam scores for 5 subjects, 100 students
np.random.seed(42)
scores = np.random.normal(loc=72, scale=12, size=(100, 5))
scores = np.clip(scores, 0, 100).round(1)
subjects = ['Math', 'Science', 'English', 'History', 'Art']
print("=== Class Statistics ===\n")
# Per-subject statistics
for i, subject in enumerate(subjects):
col = scores[:, i]
print(f"{subject:>10}: mean={col.mean():.1f}, "
f"std={col.std():.1f}, "
f"min={col.min():.1f}, "
f"max={col.max():.1f}, "
f"median={np.median(col):.1f}")
print(f"\n{'Overall':>10}: mean={scores.mean():.1f}, std={scores.std():.1f}")
# Find top 5 students by average score
student_averages = scores.mean(axis=1)
top_5_indices = np.argsort(student_averages)[-5:][::-1]
print(f"\nTop 5 students (by index): {top_5_indices}")
for idx in top_5_indices:
print(f" Student {idx}: avg = {student_averages[idx]:.1f}")
# Correlation between subjects
correlation = np.corrcoef(scores.T)
print(f"\nCorrelation matrix shape: {correlation.shape}")
print(f"Math-Science correlation: {correlation[0, 1]:.3f}")
# Percentile analysis
print(f"\n90th percentile per subject:")
for i, subject in enumerate(subjects):
p90 = np.percentile(scores[:, i], 90)
print(f" {subject}: {p90:.1f}")
# Students scoring above 90 in all subjects
high_achievers = np.all(scores > 90, axis=1)
print(f"\nStudents scoring >90 in ALL subjects: {high_achievers.sum()}")
Solving systems of linear equations is a fundamental operation in engineering and data science. Consider:
import numpy as np
# Solve the system:
# 2x + 3y - z = 1
# 4x + y + 2z = 2
# -2x + 7y - 3z = -1
# Coefficient matrix
A = np.array([[2, 3, -1],
[4, 1, 2],
[-2, 7, -3]])
# Constants vector
b = np.array([1, 2, -1])
# Solve using np.linalg.solve (faster and more stable than computing inverse)
x = np.linalg.solve(A, b)
print(f"Solution: x={x[0]:.4f}, y={x[1]:.4f}, z={x[2]:.4f}")
# Verify the solution
residual = A @ x - b
print(f"Residual (should be ~0): {residual}")
print(f"Max error: {np.abs(residual).max():.2e}")
# Least squares solution for overdetermined systems
# (more equations than unknowns — common in data fitting)
# Fit y = mx + c to noisy data
np.random.seed(42)
x_data = np.linspace(0, 10, 50)
y_data = 2.5 * x_data + 1.3 + np.random.normal(0, 1, 50)
# Set up matrix A for y = mx + c
A_fit = np.column_stack([x_data, np.ones(len(x_data))])
# Solve via least squares
result, residuals, rank, sv = np.linalg.lstsq(A_fit, y_data, rcond=None)
m, c = result
print(f"\nLeast squares fit: y = {m:.4f}x + {c:.4f}")
print(f"(True values: y = 2.5000x + 1.3000)")
Normalization and standardization are essential preprocessing steps in machine learning. NumPy makes them trivial:
import numpy as np
# Sample dataset: 5 samples with 3 features of different scales
data = np.array([
[25.0, 50000, 3.5],
[30.0, 60000, 4.2],
[22.0, 45000, 3.1],
[35.0, 80000, 4.8],
[28.0, 55000, 3.9]
])
feature_names = ['Age', 'Salary', 'GPA']
print("Original data:")
print(data)
# Min-Max Normalization: scale to [0, 1]
min_vals = data.min(axis=0)
max_vals = data.max(axis=0)
normalized = (data - min_vals) / (max_vals - min_vals)
print(f"\nMin-Max Normalized (range [0, 1]):")
for i, name in enumerate(feature_names):
print(f" {name}: min={normalized[:, i].min():.2f}, max={normalized[:, i].max():.2f}")
print(normalized)
# Z-Score Standardization: mean=0, std=1
mean_vals = data.mean(axis=0)
std_vals = data.std(axis=0)
standardized = (data - mean_vals) / std_vals
print(f"\nZ-Score Standardized (mean≈0, std≈1):")
for i, name in enumerate(feature_names):
print(f" {name}: mean={standardized[:, i].mean():.4f}, std={standardized[:, i].std():.4f}")
print(standardized)
# Robust scaling (using median and IQR — resistant to outliers)
median_vals = np.median(data, axis=0)
q75 = np.percentile(data, 75, axis=0)
q25 = np.percentile(data, 25, axis=0)
iqr = q75 - q25
robust_scaled = (data - median_vals) / iqr
print(f"\nRobust Scaled (using median and IQR):")
print(robust_scaled)
Even experienced developers trip over these. Save yourself the debugging time.
This is the single most common source of bugs in NumPy code:
import numpy as np original = np.array([1, 2, 3, 4, 5]) # Slicing creates a VIEW, not a copy view = original[1:4] view[0] = 999 print(original) # Output: [ 1 999 3 4 5] — original is modified! # To create an independent copy, use .copy() original = np.array([1, 2, 3, 4, 5]) safe_copy = original[1:4].copy() safe_copy[0] = 999 print(original) # Output: [1 2 3 4 5] — original is safe # How to check: use np.shares_memory() a = np.array([1, 2, 3, 4, 5]) b = a[1:4] c = a[1:4].copy() print(np.shares_memory(a, b)) # True — b is a view print(np.shares_memory(a, c)) # False — c is a copy # Boolean and fancy indexing ALWAYS return copies d = a[a > 2] print(np.shares_memory(a, d)) # False
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
# This works — (3,) broadcasts to (2, 3)
row = np.array([10, 20, 30])
print(a + row)
# This FAILS — shapes (2, 3) and (2,) are incompatible
col_wrong = np.array([10, 20])
try:
print(a + col_wrong)
except ValueError as e:
print(f"Error: {e}")
# Error: operands could not be broadcast together with shapes (2,3) (2,)
# Fix: reshape to column vector (2, 1)
col_right = np.array([[10], [20]]) # shape (2, 1)
print(a + col_right)
# Output:
# [[11 12 13]
# [24 25 26]]
# Alternatively, use np.newaxis (or None — they're the same)
col_also_right = np.array([10, 20])[:, np.newaxis]
print(col_also_right.shape) # (2, 1)
print(a + col_also_right) # same result
import numpy as np # int8 can only hold values from -128 to 127 arr = np.array([100, 120, 130], dtype=np.int8) print(arr) # Output: [100 120 -126] — 130 overflowed silently! result = arr + np.int8(50) print(result) # Output: [-106 -86 -76] — completely wrong, no warning! # Fix: use a larger dtype arr_safe = np.array([100, 120, 130], dtype=np.int32) result_safe = arr_safe + 50 print(result_safe) # Output: [150 170 180] — correct # Watch out with uint8 (common for image data, range 0-255) img_pixel = np.array([250], dtype=np.uint8) print(img_pixel + np.uint8(10)) # Output: [4] — wrapped around! (250 + 10 = 260 → 260 % 256 = 4) # Fix: cast before arithmetic print(img_pixel.astype(np.int16) + 10) # Output: [260] — correct
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6]])
# DON'T: Chained indexing may not work for setting values
# arr[arr > 3][0] = 99 # This might NOT modify arr
# DO: Use direct indexing
arr[arr > 3] = 99
print(arr)
# Output:
# [[ 1 2 3]
# [99 99 99]]
# Or use np.where for conditional replacement
arr2 = np.array([[1, 2, 3],
[4, 5, 6]])
result = np.where(arr2 > 3, 99, arr2)
print(result)
# Output:
# [[ 1 2 3]
# [99 99 99]]
Follow these guidelines to write efficient, maintainable NumPy code.
import numpy as np
data = np.random.rand(1_000_000)
# BAD: Python loop
result_slow = np.empty(len(data))
for i in range(len(data)):
result_slow[i] = data[i] ** 2 + 2 * data[i] + 1
# GOOD: Vectorized operation (10-100x faster)
result_fast = data ** 2 + 2 * data + 1
# For custom functions, use np.vectorize (still not as fast as native ufuncs)
def custom_func(x):
if x > 0.5:
return x ** 2
else:
return 0
vectorized_func = np.vectorize(custom_func)
result = vectorized_func(data)
# BEST: Use np.where instead of vectorize
result_best = np.where(data > 0.5, data ** 2, 0)
import numpy as np # Use the smallest dtype that fits your data # Integers small_ints = np.array([1, 2, 3, 4], dtype=np.int8) # -128 to 127 medium_ints = np.array([1, 2, 3, 4], dtype=np.int32) # -2B to 2B big_ints = np.array([1, 2, 3, 4], dtype=np.int64) # default, but 2x memory # Floats — float32 is usually sufficient for ML weights = np.random.randn(1000, 1000).astype(np.float32) # 3.8 MB # vs np.float64 which would be 7.6 MB # Boolean arrays for masks mask = np.zeros(1000, dtype=np.bool_) # 1 byte per element vs 8 for int64
import numpy as np data = np.random.rand(1000, 3) means = data.mean(axis=0) # shape (3,) # BAD: manually tiling to match shapes means_tiled = np.tile(means, (1000, 1)) # creates unnecessary copy centered_slow = data - means_tiled # GOOD: let broadcasting handle it (no extra memory) centered_fast = data - means # (1000, 3) - (3,) → broadcasting
import numpy as np
n = 10000
# BAD: growing an array with append (copies entire array each time)
result = np.array([])
for i in range(n):
result = np.append(result, i ** 2)
# GOOD: preallocate and fill
result = np.empty(n)
for i in range(n):
result[i] = i ** 2
# BEST: vectorize completely
result = np.arange(n) ** 2
import numpy as np arr = np.random.rand(1_000_000) # Creates a new array (uses extra memory) arr = arr * 2 # In-place operation (modifies existing array, saves memory) arr *= 2 # NumPy also provides in-place functions np.multiply(arr, 2, out=arr) np.add(arr, 1, out=arr)
.copy() when you need independence.float32 instead of float64 halves memory usage. Watch out for integer overflow with small dtypes like int8 and uint8.np.linalg.solve() is faster and more numerically stable than computing matrix inverses manually.np.append() in a loop. Preallocate with np.empty() or np.zeros(), or better yet, vectorize the computation entirely.NumPy is one of those libraries where the investment in learning it well pays dividends across your entire Python career. The patterns and concepts here — vectorization, broadcasting, memory-aware programming — are transferable to GPU computing, distributed computing, and any high-performance numerical work.
Serialization is the process of converting an in-memory data structure (objects, dictionaries, lists) into a format that can be stored on disk, transmitted over a network, or cached for later retrieval. Deserialization is the reverse — reconstructing the original data structure from the serialized format.
If you have ever saved application state to a file, sent JSON to a REST API, or read a YAML configuration file, you have already been using serialization. It is one of the most fundamental operations in software engineering, and Python gives you several powerful modules to handle it.
Why serialization matters:
In this tutorial, we will cover the most important serialization formats and libraries in Python: JSON, pickle, YAML, XML, dataclasses, and marshmallow. Each has its strengths, trade-offs, and ideal use cases.
JSON (JavaScript Object Notation) is the most widely used serialization format on the web. It is human-readable, language-agnostic, and supported by virtually every programming language. Python’s built-in json module handles JSON serialization and deserialization out of the box.
Use json.dumps() to serialize a Python object to a JSON string, and json.loads() to deserialize a JSON string back to a Python object.
import json
# Serialize Python dict to JSON string
user = {
"name": "Folau",
"age": 30,
"email": "folau@example.com",
"skills": ["Python", "Java", "AWS"],
"active": True
}
json_string = json.dumps(user)
print(json_string)
# {"name": "Folau", "age": 30, "email": "folau@example.com", "skills": ["Python", "Java", "AWS"], "active": true}
print(type(json_string))
# <class 'str'>
# Deserialize JSON string back to Python dict
parsed = json.loads(json_string)
print(parsed["name"]) # Folau
print(parsed["skills"]) # ['Python', 'Java', 'AWS']
print(type(parsed)) # <class 'dict'>
Notice that Python’s True becomes JSON’s true, and None becomes null. The json module handles these conversions automatically.
When you need to write JSON directly to a file or read from one, use json.dump() and json.load() (without the trailing “s”).
import json
user = {
"name": "Folau",
"age": 30,
"roles": ["admin", "developer"]
}
# Write to file
with open("user.json", "w") as f:
json.dump(user, f, indent=2)
# Read from file
with open("user.json", "r") as f:
loaded_user = json.load(f)
print(loaded_user)
# {'name': 'Folau', 'age': 30, 'roles': ['admin', 'developer']}
Tip: Always use with statements for file operations. It guarantees the file is properly closed even if an exception occurs.
The json.dumps() function accepts several formatting options that make output more readable.
import json
config = {
"database": {
"host": "localhost",
"port": 5432,
"name": "myapp_db"
},
"cache": {
"enabled": True,
"ttl_seconds": 300
},
"debug": False
}
# Pretty print with 4-space indentation
pretty = json.dumps(config, indent=4)
print(pretty)
# Sort keys alphabetically
sorted_json = json.dumps(config, indent=2, sort_keys=True)
print(sorted_json)
# Compact output (minimize whitespace)
compact = json.dumps(config, separators=(",", ":"))
print(compact)
# {"database":{"host":"localhost","port":5432,"name":"myapp_db"},"cache":{"enabled":true,"ttl_seconds":300},"debug":false}
Use indent for config files and logs where readability matters. Use separators=(",", ":") when you need minimal payload size (e.g., sending data over a network).
The json module can only serialize basic Python types: dict, list, str, int, float, bool, and None. Anything else will raise a TypeError. This commonly happens with datetime objects, sets, custom classes, and bytes.
import json
from datetime import datetime
data = {
"event": "deployment",
"timestamp": datetime.now()
}
# This will FAIL
try:
json.dumps(data)
except TypeError as e:
print(f"Error: {e}")
# Error: Object of type datetime is not JSON serializable
The simplest fix is the default parameter, which provides a fallback serializer for unsupported types.
import json
from datetime import datetime, date
from decimal import Decimal
def json_serializer(obj):
"""Custom serializer for objects not handled by default json encoder."""
if isinstance(obj, (datetime, date)):
return obj.isoformat()
if isinstance(obj, Decimal):
return float(obj)
if isinstance(obj, set):
return list(obj)
if isinstance(obj, bytes):
return obj.decode("utf-8")
raise TypeError(f"Type {type(obj)} is not JSON serializable")
data = {
"event": "deployment",
"timestamp": datetime.now(),
"cost": Decimal("49.99"),
"tags": {"urgent", "production"},
"payload": b"raw bytes here"
}
result = json.dumps(data, default=json_serializer, indent=2)
print(result)
For more control, subclass json.JSONEncoder. This is cleaner when you have complex serialization logic that you want to reuse across your application.
import json
from datetime import datetime, date
from decimal import Decimal
class AppJSONEncoder(json.JSONEncoder):
"""Custom JSON encoder for application-specific types."""
def default(self, obj):
if isinstance(obj, (datetime, date)):
return obj.isoformat()
if isinstance(obj, Decimal):
return str(obj)
if isinstance(obj, set):
return sorted(list(obj))
if isinstance(obj, bytes):
return obj.decode("utf-8")
# Let the base class raise TypeError for unknown types
return super().default(obj)
data = {
"user": "Folau",
"created_at": datetime(2024, 1, 15, 10, 30, 0),
"balance": Decimal("1250.75"),
"permissions": {"read", "write", "admin"}
}
print(json.dumps(data, cls=AppJSONEncoder, indent=2))
When to use which approach:
default parameter — Quick one-off serializationJSONEncoder subclass — Reusable across your codebase, better for large projects
While JSON handles basic data types, Python’s pickle module can serialize almost any Python object — including classes, functions, nested structures, and even lambda expressions. The trade-off is that pickle output is binary (not human-readable) and Python-specific (other languages cannot read it).
import pickle
# A complex Python object that JSON cannot handle
class User:
def __init__(self, name, age, scores):
self.name = name
self.age = age
self.scores = scores
def average_score(self):
return sum(self.scores) / len(self.scores)
def __repr__(self):
return f"User(name={self.name}, age={self.age})"
user = User("Folau", 30, [95, 88, 72, 90])
# Serialize to bytes
pickled = pickle.dumps(user)
print(type(pickled)) # <class 'bytes'>
print(len(pickled)) # varies
# Deserialize back to object
restored = pickle.loads(pickled)
print(restored) # User(name=Folau, age=30)
print(restored.average_score()) # 86.25
import pickle
user = User("Folau", 30, [95, 88, 72, 90])
# Write to file (binary mode!)
with open("user.pkl", "wb") as f:
pickle.dump(user, f)
# Read from file
with open("user.pkl", "rb") as f:
loaded_user = pickle.load(f)
print(loaded_user.name) # Folau
print(loaded_user.average_score()) # 86.25
Important: Always open pickle files in binary mode ("wb" and "rb"). Pickle produces bytes, not text.
| Feature | JSON | pickle |
|---|---|---|
| Human-readable | Yes | No (binary) |
| Language support | Universal | Python only |
| Custom objects | Requires custom encoder | Works out of the box |
| Security | Safe to deserialize | Can execute arbitrary code |
| Speed | Moderate | Fast for Python objects |
| Best for | APIs, config files, data exchange | Caching, internal Python storage |
WARNING: Never unpickle data from untrusted sources! Pickle can execute arbitrary code during deserialization. A malicious pickle payload can run system commands, delete files, or open network connections. Only use pickle with data you created yourself or from a fully trusted source.
import pickle
import os
# This is what a MALICIOUS pickle payload looks like.
# DO NOT run this — it demonstrates the danger.
class Malicious:
def __reduce__(self):
# This would execute a system command when unpickled!
return (os.system, ("echo 'You have been hacked!'",))
# If someone sends you a pickle file, it could contain code like this.
# NEVER do: pickle.loads(untrusted_data)
# SAFE alternatives for untrusted data:
# - Use json.loads() for JSON data
# - Use yaml.safe_load() for YAML data
# - Use pickle only for data YOU created
YAML (YAML Ain’t Markup Language) is popular for configuration files because it is more human-friendly than JSON — no braces, no quotes around keys, and it supports comments. Python uses the PyYAML library to work with YAML.
# Install first: pip install pyyaml
import yaml
# Python dict to YAML string
config = {
"database": {
"host": "localhost",
"port": 5432,
"name": "myapp_db",
"credentials": {
"username": "admin",
"password": "secret"
}
},
"logging": {
"level": "INFO",
"file": "/var/log/app.log"
},
"features": ["auth", "caching", "rate_limiting"]
}
yaml_string = yaml.dump(config, default_flow_style=False, sort_keys=False)
print(yaml_string)
Output:
database:
host: localhost
port: 5432
name: myapp_db
credentials:
username: admin
password: secret
logging:
level: INFO
file: /var/log/app.log
features:
- auth
- caching
- rate_limiting
import yaml yaml_content = """ server: host: 0.0.0.0 port: 8080 workers: 4 database: url: postgresql://localhost:5432/myapp pool_size: 10 # Timeout in seconds timeout: 30 features: - authentication - rate_limiting - caching """ # ALWAYS use safe_load, never yaml.load() without a Loader config = yaml.safe_load(yaml_content) print(config["server"]["port"]) # 8080 print(config["database"]["url"]) # postgresql://localhost:5432/myapp print(config["features"]) # ['authentication', 'rate_limiting', 'caching']
import yaml
import os
def load_config(config_path="config.yaml"):
"""Load application configuration from YAML file."""
if not os.path.exists(config_path):
raise FileNotFoundError(f"Config file not found: {config_path}")
with open(config_path, "r") as f:
config = yaml.safe_load(f)
# Override with environment variables if set
if os.environ.get("DB_HOST"):
config["database"]["host"] = os.environ["DB_HOST"]
if os.environ.get("DB_PASSWORD"):
config["database"]["password"] = os.environ["DB_PASSWORD"]
return config
def save_config(config, config_path="config.yaml"):
"""Save configuration back to YAML file."""
with open(config_path, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
# Usage
# config = load_config("config.yaml")
# print(config["database"]["host"])
Why YAML over JSON for config? YAML supports comments, is easier to read and edit by hand, and does not require quotes around string keys. JSON is better for data interchange because it is stricter and more widely supported programmatically.
XML (eXtensible Markup Language) is less common for new projects but still widely used in enterprise systems, SOAP APIs, and legacy codebases. Python’s standard library includes xml.etree.ElementTree for working with XML.
import xml.etree.ElementTree as ET
# Create XML programmatically
root = ET.Element("users")
user1 = ET.SubElement(root, "user", id="1")
ET.SubElement(user1, "name").text = "Folau"
ET.SubElement(user1, "email").text = "folau@example.com"
ET.SubElement(user1, "role").text = "admin"
user2 = ET.SubElement(root, "user", id="2")
ET.SubElement(user2, "name").text = "Jane"
ET.SubElement(user2, "email").text = "jane@example.com"
ET.SubElement(user2, "role").text = "developer"
# Convert to string
xml_string = ET.tostring(root, encoding="unicode", xml_declaration=True)
print(xml_string)
import xml.etree.ElementTree as ET
# Parse XML string
xml_data = """
<users>
<user id="1">
<name>Folau</name>
<email>folau@example.com</email>
<role>admin</role>
</user>
<user id="2">
<name>Jane</name>
<email>jane@example.com</email>
<role>developer</role>
</user>
</users>
"""
root = ET.fromstring(xml_data)
for user in root.findall("user"):
user_id = user.get("id")
name = user.find("name").text
email = user.find("email").text
role = user.find("role").text
print(f"ID: {user_id}, Name: {name}, Email: {email}, Role: {role}")
# Output:
# ID: 1, Name: Folau, Email: folau@example.com, Role: admin
# ID: 2, Name: Jane, Email: jane@example.com, Role: developer
When to use XML: SOAP web services, configuration files for Java-based systems (Maven pom.xml, Android manifests), RSS/Atom feeds, and legacy integrations. For new Python projects, JSON or YAML are almost always better choices.
Python’s dataclasses module (introduced in Python 3.7) provides a clean way to define data-holding classes. Combined with the dataclasses.asdict() function, they integrate well with JSON serialization.
import json
from dataclasses import dataclass, asdict, field
from typing import List
@dataclass
class Address:
street: str
city: str
state: str
zip_code: str
@dataclass
class Employee:
name: str
age: int
department: str
skills: List[str] = field(default_factory=list)
address: Address = None
def to_json(self):
"""Serialize to JSON string."""
return json.dumps(asdict(self), indent=2)
@classmethod
def from_json(cls, json_string):
"""Deserialize from JSON string."""
data = json.loads(json_string)
# Handle nested Address object
if data.get("address"):
data["address"] = Address(**data["address"])
return cls(**data)
# Create and serialize
employee = Employee(
name="Folau",
age=30,
department="Engineering",
skills=["Python", "AWS", "Docker"],
address=Address("123 Main St", "San Francisco", "CA", "94102")
)
json_output = employee.to_json()
print(json_output)
# Deserialize back
restored = Employee.from_json(json_output)
print(restored.name) # Folau
print(restored.address.city) # San Francisco
print(restored.skills) # ['Python', 'AWS', 'Docker']
Why dataclasses for serialization?
asdict() provides automatic conversion to a dictionary (ready for json.dumps())
For production applications that need validation, type coercion, and well-defined schemas, the marshmallow library is the gold standard. It separates your data model from your serialization logic, which keeps things clean as your application grows.
# Install first: pip install marshmallow
from marshmallow import Schema, fields, validate, post_load
class User:
def __init__(self, name, email, age, role="viewer"):
self.name = name
self.email = email
self.age = age
self.role = role
def __repr__(self):
return f"User(name={self.name}, email={self.email}, role={self.role})"
class UserSchema(Schema):
name = fields.Str(required=True, validate=validate.Length(min=1, max=100))
email = fields.Email(required=True)
age = fields.Int(required=True, validate=validate.Range(min=0, max=150))
role = fields.Str(validate=validate.OneOf(["admin", "editor", "viewer"]))
@post_load
def make_user(self, data, **kwargs):
return User(**data)
schema = UserSchema()
# Deserialize (load) — validates and creates object
user_data = {"name": "Folau", "email": "folau@example.com", "age": 30, "role": "admin"}
user = schema.load(user_data)
print(user) # User(name=Folau, email=folau@example.com, role=admin)
# Serialize (dump) — converts object to dict
output = schema.dump(user)
print(output) # {'name': 'Folau', 'email': 'folau@example.com', 'age': 30, 'role': 'admin'}
# Validation error example
try:
bad_data = {"name": "", "email": "not-an-email", "age": -5}
schema.load(bad_data)
except Exception as e:
print(f"Validation errors: {e}")
Key benefits of marshmallow:
This is one of the most common real-world serialization tasks: fetching data from a REST API, processing it, and serializing the results.
import json
import urllib.request
from dataclasses import dataclass, asdict
from typing import List, Optional
@dataclass
class Todo:
id: int
title: str
completed: bool
user_id: int
@classmethod
def from_api_response(cls, data: dict) -> "Todo":
"""Create Todo from API response dict."""
return cls(
id=data["id"],
title=data["title"],
completed=data["completed"],
user_id=data["userId"]
)
def fetch_todos(limit: int = 10) -> List[Todo]:
"""Fetch todos from JSONPlaceholder API."""
url = f"https://jsonplaceholder.typicode.com/todos?_limit={limit}"
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
return [Todo.from_api_response(item) for item in data]
def save_todos(todos: List[Todo], filepath: str):
"""Serialize todos to JSON file."""
data = [asdict(todo) for todo in todos]
with open(filepath, "w") as f:
json.dump(data, f, indent=2)
print(f"Saved {len(todos)} todos to {filepath}")
def load_todos(filepath: str) -> List[Todo]:
"""Deserialize todos from JSON file."""
with open(filepath, "r") as f:
data = json.load(f)
return [Todo(**item) for item in data]
# Fetch from API, process, and save
todos = fetch_todos(limit=5)
completed = [t for t in todos if t.completed]
print(f"Completed: {len(completed)} / {len(todos)}")
save_todos(todos, "todos.json")
restored = load_todos("todos.json")
print(f"Loaded {len(restored)} todos from file")
import json
import os
from datetime import datetime
class ConfigManager:
"""Manage application configuration with JSON persistence."""
def __init__(self, config_path="app_config.json"):
self.config_path = config_path
self.config = self._load_or_create()
def _load_or_create(self):
"""Load existing config or create default."""
if os.path.exists(self.config_path):
with open(self.config_path, "r") as f:
return json.load(f)
return self._default_config()
def _default_config(self):
"""Return default configuration."""
return {
"app_name": "MyApp",
"version": "1.0.0",
"database": {
"host": "localhost",
"port": 5432,
"name": "myapp_db"
},
"logging": {
"level": "INFO",
"file": "app.log"
},
"last_modified": datetime.now().isoformat()
}
def get(self, key, default=None):
"""Get a config value using dot notation: 'database.host'."""
keys = key.split(".")
value = self.config
for k in keys:
if isinstance(value, dict) and k in value:
value = value[k]
else:
return default
return value
def set(self, key, value):
"""Set a config value using dot notation."""
keys = key.split(".")
config = self.config
for k in keys[:-1]:
config = config.setdefault(k, {})
config[keys[-1]] = value
self.config["last_modified"] = datetime.now().isoformat()
self._save()
def _save(self):
"""Persist config to disk."""
with open(self.config_path, "w") as f:
json.dump(self.config, f, indent=2)
# Usage
config = ConfigManager("app_config.json")
print(config.get("database.host")) # localhost
print(config.get("logging.level")) # INFO
config.set("database.host", "db.production.com")
config.set("logging.level", "WARNING")
print(config.get("database.host")) # db.production.com
import json
import csv
import os
class DataExporter:
"""Export and import data between JSON and CSV formats."""
@staticmethod
def json_to_csv(json_path, csv_path):
"""Convert a JSON array of objects to CSV."""
with open(json_path, "r") as f:
data = json.load(f)
if not data:
print("No data to export")
return
# Use keys from first record as CSV headers
headers = list(data[0].keys())
with open(csv_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=headers)
writer.writeheader()
writer.writerows(data)
print(f"Exported {len(data)} records to {csv_path}")
@staticmethod
def csv_to_json(csv_path, json_path):
"""Convert CSV to JSON array of objects."""
records = []
with open(csv_path, "r") as f:
reader = csv.DictReader(f)
for row in reader:
records.append(dict(row))
with open(json_path, "w") as f:
json.dump(records, f, indent=2)
print(f"Imported {len(records)} records to {json_path}")
@staticmethod
def export_summary(data, output_path):
"""Export a summary report as JSON."""
summary = {
"total_records": len(data),
"exported_at": __import__("datetime").datetime.now().isoformat(),
"sample": data[:3] if len(data) >= 3 else data
}
with open(output_path, "w") as f:
json.dump(summary, f, indent=2)
print(f"Summary saved to {output_path}")
# Example usage
employees = [
{"name": "Folau", "department": "Engineering", "salary": 95000},
{"name": "Jane", "department": "Marketing", "salary": 85000},
{"name": "Bob", "department": "Engineering", "salary": 90000},
]
# Save as JSON
with open("employees.json", "w") as f:
json.dump(employees, f, indent=2)
# Convert JSON to CSV
exporter = DataExporter()
exporter.json_to_csv("employees.json", "employees.csv")
exporter.csv_to_json("employees.csv", "employees_restored.json")
import pickle
import os
import time
import hashlib
from functools import wraps
def pickle_cache(cache_dir=".cache"):
"""Decorator that caches function results using pickle."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Create cache directory if needed
os.makedirs(cache_dir, exist_ok=True)
# Generate a unique cache key from function name and arguments
key_data = f"{func.__name__}:{args}:{sorted(kwargs.items())}"
cache_key = hashlib.md5(key_data.encode()).hexdigest()
cache_path = os.path.join(cache_dir, f"{cache_key}.pkl")
# Return cached result if available
if os.path.exists(cache_path):
print(f"Cache HIT for {func.__name__}")
with open(cache_path, "rb") as f:
return pickle.load(f)
# Compute and cache the result
print(f"Cache MISS for {func.__name__} — computing...")
result = func(*args, **kwargs)
with open(cache_path, "wb") as f:
pickle.dump(result, f)
return result
return wrapper
return decorator
@pickle_cache()
def expensive_computation(n):
"""Simulate a slow computation."""
time.sleep(2) # Pretend this takes a long time
return {i: i ** 3 for i in range(n)}
# First call: takes 2 seconds (cache MISS)
start = time.time()
result1 = expensive_computation(1000)
print(f"First call: {time.time() - start:.2f}s")
# Second call: instant (cache HIT)
start = time.time()
result2 = expensive_computation(1000)
print(f"Second call: {time.time() - start:.2f}s")
print(f"Results match: {result1 == result2}")
This is the single most important pitfall. As demonstrated earlier, pickle.loads() can execute arbitrary code. Never use pickle to deserialize data from user input, external APIs, or any untrusted source. Use JSON instead.
import json
# Problem: non-ASCII characters
data = {"city": "Sao Paulo", "greeting": "Hola, como estas?"}
# Default behavior escapes non-ASCII
print(json.dumps(data))
# {"city": "Sao Paulo", "greeting": "Hola, \u00bfcomo est\u00e1s?"}
# Fix: use ensure_ascii=False
print(json.dumps(data, ensure_ascii=False))
# {"city": "Sao Paulo", "greeting": "Hola, como estas?"}
# When writing to files, always specify encoding
with open("data.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
import json
# This will raise ValueError: Circular reference detected
a = {}
b = {"ref": a}
a["ref"] = b
try:
json.dumps(a)
except ValueError as e:
print(f"Error: {e}") # Circular reference detected
# Solution: break circular references before serializing
# or use a custom encoder that tracks visited objects
import json
from datetime import datetime
# Problem: datetime is not JSON-serializable
event = {"name": "Deploy", "timestamp": datetime.now()}
# Solution 1: Convert to ISO format string
event["timestamp"] = event["timestamp"].isoformat()
print(json.dumps(event))
# Solution 2: Use the default parameter
def default_handler(obj):
if hasattr(obj, "isoformat"):
return obj.isoformat()
raise TypeError(f"Cannot serialize {type(obj)}")
event2 = {"name": "Deploy", "timestamp": datetime.now()}
print(json.dumps(event2, default=default_handler))
# Deserializing back to datetime
json_str = '{"name": "Deploy", "timestamp": "2024-01-15T10:30:00"}'
data = json.loads(json_str)
data["timestamp"] = datetime.fromisoformat(data["timestamp"])
print(type(data["timestamp"])) # <class 'datetime.datetime'>
import json
# Python allows non-string keys in dicts
data = {1: "one", 2: "two", (3, 4): "tuple_key"}
# JSON only allows string keys — this converts int keys to strings
result = json.dumps({1: "one", 2: "two"})
print(result) # {"1": "one", "2": "two"}
parsed = json.loads(result)
print(parsed["1"]) # "one" — note the key is now a string!
# print(parsed[1]) # KeyError! The key is "1", not 1
# Tuple keys will raise TypeError
try:
json.dumps(data)
except TypeError as e:
print(f"Error: {e}")
After years of working with serialization in production systems, here are the practices that matter most:
encoding="utf-8" when opening files, and use ensure_ascii=False if your data contains non-ASCII characters.yaml.safe_load(), never yaml.load() without a Loader. The full yaml.load() can execute arbitrary Python code, similar to pickle.to_dict() / from_dict() methods on your classes, or use schemas (marshmallow) to define exactly what gets serialized and how..get() with defaults rather than direct key access. Data schemas evolve, and old serialized data may lack newer fields.json.dumps() calls throughout your code. Centralize serialization in dedicated methods or schema classes.
json module) is the go-to format for APIs and human-readable data. Use dumps/loads for strings, dump/load for files.PyYAML) excels at configuration files. Always use safe_load().ElementTree) is for enterprise/legacy integrations.asdict() provide a clean, zero-dependency path from Python objects to JSON.datetime, encoding, and non-string keys explicitly — they are the most common sources of serialization bugs.
Functional programming is a paradigm that treats computation as the evaluation of mathematical functions. Rather than telling the computer how to do something step by step (imperative style), you describe what you want to achieve by composing pure functions that transform data without side effects.
Python is not a purely functional language, but it borrows heavily from the functional tradition. Three of the most important functional tools in Python are map(), filter(), and reduce(). These functions let you process collections of data in a declarative, composable way — and understanding them will make you a stronger Python developer.
Here is why these three functions matter:
Together, they form the backbone of data processing pipelines. Whether you are cleaning datasets, transforming API responses, or building ETL jobs, you will reach for these tools constantly.
map(function, iterable, *iterables)
map() applies a function to every item in one or more iterables and returns a map object (an iterator). It does not modify the original data — it produces a new sequence of transformed values.
# Basic usage numbers = [1, 2, 3, 4, 5] squared = map(lambda x: x ** 2, numbers) print(list(squared)) # Output: [1, 4, 9, 16, 25]
Notice that map() returns an iterator, not a list. You need to wrap it in list() to see all the values at once. This lazy evaluation is by design — it is memory efficient for large datasets.
def celsius_to_fahrenheit(celsius):
return (celsius * 9/5) + 32
temperatures_c = [0, 20, 37, 100]
temperatures_f = list(map(celsius_to_fahrenheit, temperatures_c))
print(temperatures_f)
# Output: [32.0, 68.0, 98.6, 212.0]
This is clean, readable, and intention-revealing. The function name tells you exactly what transformation is happening. No loop boilerplate, no index management.
This is a pattern you will use all the time when working with API responses or database results.
employees = [
{"name": "Alice", "department": "Engineering", "salary": 95000},
{"name": "Bob", "department": "Marketing", "salary": 72000},
{"name": "Charlie", "department": "Engineering", "salary": 105000},
{"name": "Diana", "department": "HR", "salary": 68000},
]
# Extract just the names
names = list(map(lambda emp: emp["name"], employees))
print(names)
# Output: ['Alice', 'Bob', 'Charlie', 'Diana']
# Extract name and salary as tuples
name_salary = list(map(lambda emp: (emp["name"], emp["salary"]), employees))
print(name_salary)
# Output: [('Alice', 95000), ('Bob', 72000), ('Charlie', 105000), ('Diana', 68000)]
When you pass multiple iterables to map(), the function must accept that many arguments. The iteration stops when the shortest iterable is exhausted.
# Add corresponding elements from two lists
list_a = [1, 2, 3, 4]
list_b = [10, 20, 30, 40]
sums = list(map(lambda a, b: a + b, list_a, list_b))
print(sums)
# Output: [11, 22, 33, 44]
# Calculate weighted scores
scores = [85, 92, 78, 95]
weights = [0.2, 0.3, 0.25, 0.25]
weighted = list(map(lambda s, w: round(s * w, 2), scores, weights))
print(weighted)
# Output: [17.0, 27.6, 19.5, 23.75]
total_weighted_score = sum(weighted)
print(f"Total weighted score: {total_weighted_score}")
# Output: Total weighted score: 87.85
In Python, list comprehensions can do everything map() does and are often considered more Pythonic.
numbers = [1, 2, 3, 4, 5] # Using map squared_map = list(map(lambda x: x ** 2, numbers)) # Using list comprehension squared_comp = [x ** 2 for x in numbers] # Both produce: [1, 4, 9, 16, 25]
When to use map():
list(map(str, numbers)) is cleaner than [str(x) for x in numbers].list()).When to use list comprehension:
filter(function, iterable)
filter() takes a function that returns True or False (a predicate) and an iterable. It returns an iterator containing only the elements for which the predicate returned True.
# Basic usage numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] evens = list(filter(lambda x: x % 2 == 0, numbers)) print(evens) # Output: [2, 4, 6, 8, 10]
numbers = range(1, 21) # 1 through 20
evens = list(filter(lambda x: x % 2 == 0, numbers))
odds = list(filter(lambda x: x % 2 != 0, numbers))
print(f"Even: {evens}")
# Output: Even: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
print(f"Odd: {odds}")
# Output: Odd: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
Here is a practical example you might encounter when processing user input or cleaning data.
import re
def is_valid_email(email):
"""Basic email validation."""
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
emails = [
"alice@example.com",
"bob@company.org",
"not-an-email",
"charlie@",
"diana@domain.co.uk",
"@missing-local.com",
"eve@valid.io",
]
valid_emails = list(filter(is_valid_email, emails))
print(valid_emails)
# Output: ['alice@example.com', 'bob@company.org', 'diana@domain.co.uk', 'eve@valid.io']
invalid_emails = list(filter(lambda e: not is_valid_email(e), emails))
print(invalid_emails)
# Output: ['not-an-email', 'charlie@', '@missing-local.com']
class Product:
def __init__(self, name, price, in_stock):
self.name = name
self.price = price
self.in_stock = in_stock
def __repr__(self):
return f"Product({self.name}, ${self.price}, {'In Stock' if self.in_stock else 'Out of Stock'})"
products = [
Product("Laptop", 999.99, True),
Product("Mouse", 29.99, True),
Product("Keyboard", 79.99, False),
Product("Monitor", 349.99, True),
Product("Webcam", 69.99, False),
Product("Headset", 149.99, True),
]
# Filter products that are in stock and under $200
affordable_in_stock = list(filter(
lambda p: p.in_stock and p.price < 200,
products
))
print(affordable_in_stock)
# Output: [Product(Mouse, $29.99, In Stock), Product(Headset, $149.99, In Stock)]
If you pass None as the function, filter() removes all falsy values from the iterable.
mixed = [0, 1, "", "hello", None, True, False, [], [1, 2], {}, {"key": "val"}]
truthy_values = list(filter(None, mixed))
print(truthy_values)
# Output: [1, 'hello', True, [1, 2], {'key': 'val'}]
This is a clean way to strip out empty strings, zeros, None values, and empty collections in one shot.
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Using filter evens_filter = list(filter(lambda x: x % 2 == 0, numbers)) # Using list comprehension evens_comp = [x for x in numbers if x % 2 == 0] # Both produce: [2, 4, 6, 8, 10]
The list comprehension is arguably more readable here. But filter() shines when you already have a named predicate function — list(filter(is_valid_email, emails)) reads almost like English.
from functools import reduce reduce(function, iterable[, initializer])
reduce() applies a function of two arguments cumulatively to the items in an iterable, from left to right, reducing the iterable to a single value. Unlike map() and filter(), reduce() is not a built-in — you must import it from the functools module.
Here is how it works step by step:
from functools import reduce numbers = [1, 2, 3, 4, 5] # Step-by-step: reduce(lambda a, b: a + b, [1, 2, 3, 4, 5]) # Step 1: a=1, b=2 -> 3 # Step 2: a=3, b=3 -> 6 # Step 3: a=6, b=4 -> 10 # Step 4: a=10, b=5 -> 15 total = reduce(lambda a, b: a + b, numbers) print(total) # Output: 15
from functools import reduce
# Sum of all numbers
numbers = [10, 20, 30, 40, 50]
total = reduce(lambda acc, x: acc + x, numbers)
print(f"Sum: {total}")
# Output: Sum: 150
# Of course, Python has a built-in sum() for this.
# But reduce() generalizes to any binary operation.
print(f"Sum (built-in): {sum(numbers)}")
# Output: Sum (built-in): 150
from functools import reduce
numbers = [34, 12, 89, 45, 67, 23, 91, 56]
maximum = reduce(lambda a, b: a if a > b else b, numbers)
print(f"Maximum: {maximum}")
# Output: Maximum: 91
minimum = reduce(lambda a, b: a if a < b else b, numbers)
print(f"Minimum: {minimum}")
# Output: Minimum: 12
Again, Python has max() and min() built-ins for this. But this demonstrates the pattern: reduce() compresses a collection by repeatedly applying a binary operation.
from functools import reduce nested = [[1, 2, 3], [4, 5], [6, 7, 8, 9], [10]] flattened = reduce(lambda acc, lst: acc + lst, nested) print(flattened) # Output: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
This works because the + operator concatenates lists. The accumulator starts with [1, 2, 3], then appends [4, 5] to get [1, 2, 3, 4, 5], and so on.
from functools import reduce
words = ["Python", "is", "a", "powerful", "language"]
sentence = reduce(lambda acc, word: acc + " " + word, words)
print(sentence)
# Output: Python is a powerful language
# In practice, you would use str.join() for this:
print(" ".join(words))
# Output: Python is a powerful language
The optional third argument to reduce() is the initializer. It serves as the starting value for the accumulation and is used as the default if the iterable is empty.
from functools import reduce
# Without initializer - fails on empty list
try:
result = reduce(lambda a, b: a + b, [])
except TypeError as e:
print(f"Error: {e}")
# Output: Error: reduce() of empty sequence with no initial value
# With initializer - returns the initializer for empty list
result = reduce(lambda a, b: a + b, [], 0)
print(f"Empty list with initializer: {result}")
# Output: Empty list with initializer: 0
# Counting word frequencies with reduce
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
word_counts = reduce(
lambda acc, word: {**acc, word: acc.get(word, 0) + 1},
words,
{} # initializer: empty dictionary
)
print(word_counts)
# Output: {'apple': 3, 'banana': 2, 'cherry': 1}
The initializer is critical when you need the accumulator to be a different type than the elements. In the word-counting example above, the elements are strings but the accumulator is a dictionary.
The real power of these functions emerges when you chain them together into data processing pipelines. Here is a real-world example: processing employee data to compute total salary expenditure for active engineering staff.
from functools import reduce
employees = [
{"name": "Alice", "department": "Engineering", "salary": 95000, "active": True},
{"name": "Bob", "department": "Marketing", "salary": 72000, "active": True},
{"name": "Charlie", "department": "Engineering", "salary": 105000, "active": False},
{"name": "Diana", "department": "HR", "salary": 68000, "active": True},
{"name": "Eve", "department": "Engineering", "salary": 112000, "active": True},
{"name": "Frank", "department": "Engineering", "salary": 89000, "active": True},
{"name": "Grace", "department": "Marketing", "salary": 78000, "active": False},
]
# Pipeline: filter active engineers -> extract salaries -> compute total
active_engineers = filter(
lambda emp: emp["active"] and emp["department"] == "Engineering",
employees
)
salaries = map(lambda emp: emp["salary"], active_engineers)
total_salary = reduce(lambda acc, sal: acc + sal, salaries, 0)
print(f"Total salary for active engineers: ${total_salary:,}")
# Output: Total salary for active engineers: $296,000
Notice how each step has a single responsibility:
Because filter() and map() return iterators, no intermediate lists are created. The data flows through the pipeline lazily, one element at a time.
Here is another example — computing the average score of students who passed:
from functools import reduce
students = [
{"name": "Alice", "score": 92},
{"name": "Bob", "score": 45},
{"name": "Charlie", "score": 78},
{"name": "Diana", "score": 34},
{"name": "Eve", "score": 88},
{"name": "Frank", "score": 65},
{"name": "Grace", "score": 55},
]
# Step 1: Filter students who passed (score >= 60)
passed = list(filter(lambda s: s["score"] >= 60, students))
# Step 2: Extract scores
scores = list(map(lambda s: s["score"], passed))
# Step 3: Compute average using reduce
total = reduce(lambda acc, s: acc + s, scores, 0)
average = total / len(scores)
print(f"Passing students: {[s['name'] for s in passed]}")
# Output: Passing students: ['Alice', 'Charlie', 'Eve', 'Frank']
print(f"Average passing score: {average:.1f}")
# Output: Average passing score: 80.8
Lambda functions are anonymous, single-expression functions. They are the natural companion to map(), filter(), and reduce() because they let you define small transformation or predicate logic inline without naming a separate function.
# Lambda syntax: lambda arguments: expression # Square numbers list(map(lambda x: x ** 2, [1, 2, 3, 4])) # [1, 4, 9, 16] # Filter strings longer than 3 characters list(filter(lambda s: len(s) > 3, ["hi", "hello", "hey", "howdy"])) # ['hello', 'howdy'] # Multiply all numbers together from functools import reduce reduce(lambda a, b: a * b, [1, 2, 3, 4, 5]) # 120 (factorial of 5)
A word of caution: Lambdas are great for simple, obvious operations. But if your lambda spans multiple conditions or is hard to read at a glance, extract it into a named function. Readability always wins.
# Bad: complex lambda is hard to parse
result = list(filter(
lambda x: x["active"] and x["age"] > 25 and x["department"] in ["Engineering", "Product"],
employees
))
# Better: named function with a clear name
def is_eligible_engineer(emp):
return (
emp["active"]
and emp["age"] > 25
and emp["department"] in ["Engineering", "Product"]
)
result = list(filter(is_eligible_engineer, employees))
Here is a practical decision guide for choosing between these tools.
| Scenario | Prefer |
|---|---|
| Applying an existing named function | map(str, numbers) |
| Simple inline transformation | [x * 2 for x in numbers] |
| Multiple iterables | map(func, iter1, iter2) |
| Need lazy evaluation | map(func, iterable) |
| Transformation + filtering together | [x * 2 for x in numbers if x > 0] |
| Scenario | Prefer |
|---|---|
| Applying an existing predicate function | filter(is_valid, items) |
| Simple inline condition | [x for x in items if x > 0] |
| Removing falsy values | filter(None, items) |
| Need lazy evaluation | filter(func, iterable) |
sum(), math.prod() for those).itertools.accumulate() if you need intermediate results.In Python 3, both map() and filter() return iterators, not lists. This means they compute values on demand, which has significant memory benefits for large datasets.
import sys
# List comprehension creates entire list in memory
big_list = [x ** 2 for x in range(1_000_000)]
print(f"List size: {sys.getsizeof(big_list):,} bytes")
# Output: List size: 8,448,728 bytes
# map() returns a tiny iterator object
big_map = map(lambda x: x ** 2, range(1_000_000))
print(f"Map size: {sys.getsizeof(big_map)} bytes")
# Output: Map size: 48 bytes
The map object is only 48 bytes regardless of how many elements it will produce. The values are computed only when you iterate over them.
For complex transformations, generator expressions offer the same lazy evaluation benefits as map() and filter() with more readable syntax.
# Generator expression - lazy, like map/filter
squared_gen = (x ** 2 for x in range(1_000_000))
# You can chain filter and map logic in one generator
result = (
x ** 2
for x in range(1_000_000)
if x % 2 == 0
)
# Process lazily - never loads everything into memory
for value in result:
if value > 100:
break
import timeit
numbers = list(range(10_000))
# map with lambda
t1 = timeit.timeit(lambda: list(map(lambda x: x * 2, numbers)), number=1000)
# list comprehension
t2 = timeit.timeit(lambda: [x * 2 for x in numbers], number=1000)
# map with named function
def double(x):
return x * 2
t3 = timeit.timeit(lambda: list(map(double, numbers)), number=1000)
print(f"map + lambda: {t1:.4f}s")
print(f"comprehension: {t2:.4f}s")
print(f"map + named func: {t3:.4f}s")
# Typical results:
# map + lambda: 0.8500s
# comprehension: 0.5200s
# map + named func: 0.7100s
# List comprehensions are usually fastest for simple operations
The takeaway: list comprehensions tend to be slightly faster than map() with a lambda, because they avoid the overhead of a function call on each iteration. However, the difference is negligible for most applications — choose based on readability.
# This will fail in Python 3 # reduce(lambda a, b: a + b, [1, 2, 3]) # NameError: name 'reduce' is not defined # Correct: import it first from functools import reduce reduce(lambda a, b: a + b, [1, 2, 3]) # 6
In Python 2, reduce() was a built-in. Guido van Rossum moved it to functools in Python 3 because he felt it was overused and often less readable than a simple loop.
# This might surprise you result = map(lambda x: x * 2, [1, 2, 3]) print(result) # Output: <map object at 0x...> # You need to consume the iterator print(list(result)) # Output: [2, 4, 6] # CAUTION: iterators are exhausted after one pass result = map(lambda x: x * 2, [1, 2, 3]) print(list(result)) # [2, 4, 6] print(list(result)) # [] -- empty! The iterator is spent.
This is a frequent source of bugs. If you need to iterate over the result multiple times, convert it to a list first.
# Overly clever - hard to debug and understand
result = list(map(lambda x: (lambda y: y ** 2 + 2 * y + 1)(x), range(10)))
# Just use a regular function
def transform(x):
return x ** 2 + 2 * x + 1
result = list(map(transform, range(10)))
# Or better yet:
result = [x ** 2 + 2 * x + 1 for x in range(10)]
from functools import reduce import math numbers = [1, 2, 3, 4, 5] # Unnecessary reduce usage total = reduce(lambda a, b: a + b, numbers) # Use sum(numbers) product = reduce(lambda a, b: a * b, numbers) # Use math.prod(numbers) biggest = reduce(lambda a, b: max(a, b), numbers) # Use max(numbers) joined = reduce(lambda a, b: a + " " + b, ["a", "b", "c"]) # Use " ".join(...) # Python has built-ins for all of these. Use them.
The goal is code that your teammates (and future you) can understand at a glance. Functional style should make code clearer, not more obscure.
# Clear and readable active_users = [user for user in users if user.is_active] usernames = [user.name for user in active_users] # Also clear, different style active_users = filter(lambda u: u.is_active, users) usernames = list(map(lambda u: u.name, active_users))
# Comprehension handles both in one expression result = [x ** 2 for x in numbers if x > 0] # map + filter requires nesting or chaining result = list(map(lambda x: x ** 2, filter(lambda x: x > 0, numbers)))
The comprehension is almost always more readable when you need both transformation and filtering.
def calculate_tax(income):
if income < 30000:
return income * 0.1
elif income < 70000:
return income * 0.2
else:
return income * 0.3
incomes = [25000, 45000, 85000, 60000, 120000]
taxes = list(map(calculate_tax, incomes))
print(taxes)
# Output: [2500.0, 9000.0, 25500.0, 12000.0, 36000.0]
Named functions are testable, documentable, and reusable. Lambda functions are none of these.
from functools import reduce
# Processing a log file: extract errors, get timestamps, find the latest
log_entries = [
{"level": "INFO", "timestamp": "2024-01-15 10:30:00", "message": "Started"},
{"level": "ERROR", "timestamp": "2024-01-15 10:31:00", "message": "DB timeout"},
{"level": "INFO", "timestamp": "2024-01-15 10:32:00", "message": "Retrying"},
{"level": "ERROR", "timestamp": "2024-01-15 10:33:00", "message": "DB timeout again"},
{"level": "INFO", "timestamp": "2024-01-15 10:34:00", "message": "Recovered"},
]
errors = filter(lambda e: e["level"] == "ERROR", log_entries)
timestamps = map(lambda e: e["timestamp"], errors)
latest_error = reduce(lambda a, b: max(a, b), timestamps)
print(f"Latest error at: {latest_error}")
# Output: Latest error at: 2024-01-15 10:33:00
functools and use it for non-trivial aggregations.list() when you need a list.map()/filter() when you have named functions or need lazy evaluation.sum(), max(), min(), str.join()) when they fit — do not reinvent the wheel with reduce().Imagine you are a detective investigating a crime scene. Without evidence — fingerprints, security camera footage, witness statements — you would have no way to reconstruct what happened. Logging is the evidence trail for your application. It records what your program did, when it did it, and what went wrong.
Logging is the practice of recording messages from your application during runtime. These messages capture events, errors, state changes, and diagnostic information that help you understand your application’s behavior — especially when things go wrong in production at 3 AM and you cannot attach a debugger.
Every Java developer starts with System.out.println() for debugging. It works, but it is the equivalent of using a flashlight when you need a full surveillance system. Here is why it falls short in real applications:
| Feature | System.out.println | Logging Framework |
|---|---|---|
| Severity levels | None — everything looks the same | TRACE, DEBUG, INFO, WARN, ERROR |
| On/off control | Must delete or comment out lines | Change config file, no code changes |
| Output destination | Console only | Console, files, databases, remote servers |
| Timestamps | Must add manually | Automatic |
| Thread info | Must add manually | Automatic |
| Class/method info | Must add manually | Automatic |
| File rotation | Not possible | Automatic (e.g., daily, by size) |
| Performance | Always executes string building | Lazy evaluation, skip if level disabled |
| Production ready | No | Yes |
A simple rule: if you would not want it on a billboard, do not log it.
// BAD: System.out.println for debugging
public class BadDebugging {
public void processOrder(Order order) {
System.out.println("Processing order: " + order.getId()); // No timestamp
System.out.println("Order total: " + order.getTotal()); // No severity
System.out.println("Sending to payment..."); // Cannot turn off
// These println calls will clutter production logs forever
}
}
// GOOD: Proper logging
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class GoodLogging {
private static final Logger log = LoggerFactory.getLogger(GoodLogging.class);
public void processOrder(Order order) {
log.info("Processing order id={}, total={}", order.getId(), order.getTotal());
log.debug("Sending order to payment gateway");
// Output: 2026-02-28 10:15:32.451 [main] INFO GoodLogging - Processing order id=12345, total=99.99
// In production, DEBUG messages are automatically suppressed
}
}
Java has multiple logging frameworks, which can be confusing for newcomers. Here is the landscape and how the pieces fit together:
| Framework | Type | Description | Status |
|---|---|---|---|
| java.util.logging (JUL) | Implementation | Built into the JDK since Java 1.4. No external dependencies needed. | Active but rarely used in modern projects |
| Log4j 1.x | Implementation | Was the de facto standard for years. Uses log4j.properties or log4j.xml. | END OF LIFE — Critical security vulnerability CVE-2021-44228. DO NOT USE. |
| Log4j 2 | Implementation | Complete rewrite of Log4j. Async logging, plugin architecture, modern design. | Active, maintained by Apache |
| Logback | Implementation | Created by the founder of Log4j as its successor. Native SLF4J implementation. | Active, default in Spring Boot |
| SLF4J | Facade (API) | Simple Logging Facade for Java. An abstraction layer — you code against SLF4J and swap implementations without changing code. | Active, industry standard |
Think of SLF4J like a universal remote control. You press the same buttons regardless of whether your TV is Samsung, LG, or Sony. Similarly, you write logging code using SLF4J’s API, and the actual logging is handled by whichever implementation (Logback, Log4j2) is on the classpath.
This means:
org.slf4j.Logger — never a specific implementation classFor most Java applications in 2026, use: SLF4J (facade) + Logback (implementation). This is the default in Spring Boot and the most widely adopted combination. This tutorial will focus primarily on this stack, but we will also cover JUL and Log4j2.
Log levels let you categorize messages by severity. You can then configure your application to show only messages at or above a certain level — for example, showing everything in development but only WARN and ERROR in production.
| Level | Purpose | When to Use | Example |
|---|---|---|---|
| TRACE | Extremely detailed diagnostic information | Step-by-step algorithm execution, variable values in loops, entering/exiting methods | log.trace("Entering calculateTax with amount={}", amount) |
| DEBUG | Detailed information useful during development | SQL queries executed, cache hit/miss, intermediate calculation results, request/response payloads | log.debug("Query returned {} rows in {}ms", count, elapsed) |
| INFO | Important business or application events | Application started, user logged in, order processed, scheduled job completed | log.info("Order {} placed successfully by user {}", orderId, userId) |
| WARN | Potentially harmful situations that are recoverable | Retry attempts, deprecated API usage, approaching disk/memory limits, fallback to default | log.warn("Payment gateway timeout, retrying (attempt {}/3)", attempt) |
| ERROR | Serious failures that need attention | Unhandled exceptions, failed database connections, data corruption, business rule violations that halt processing | log.error("Failed to process payment for order {}", orderId, exception) |
Levels form a hierarchy. When you set the log level to a certain value, all messages at that level and above are logged. Messages below that level are suppressed.
| Configured Level | TRACE | DEBUG | INFO | WARN | ERROR |
|---|---|---|---|---|---|
| TRACE | Yes | Yes | Yes | Yes | Yes |
| DEBUG | No | Yes | Yes | Yes | Yes |
| INFO | No | No | Yes | Yes | Yes |
| WARN | No | No | No | Yes | Yes |
| ERROR | No | No | No | No | Yes |
Rule of thumb: Development uses DEBUG or TRACE. Production uses INFO (or WARN for very high-throughput systems). You should be able to understand what your application is doing from INFO logs alone.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class LogLevelDemo {
private static final Logger log = LoggerFactory.getLogger(LogLevelDemo.class);
public void processPayment(String orderId, double amount) {
log.trace("Entering processPayment(orderId={}, amount={})", orderId, amount);
log.debug("Validating payment amount: {}", amount);
if (amount <= 0) {
log.warn("Invalid payment amount {} for order {}, using minimum $0.01", amount, orderId);
amount = 0.01;
}
try {
log.info("Processing payment of ${} for order {}", amount, orderId);
// ... payment logic ...
log.info("Payment successful for order {}", orderId);
} catch (Exception e) {
log.error("Payment failed for order {} with amount ${}", orderId, amount, e);
// The exception 'e' is passed as the LAST argument -- SLF4J will print the full stack trace
}
log.trace("Exiting processPayment for order {}", orderId);
}
}
// If level is set to INFO, output would be:
// 2026-02-28 10:30:00.123 [main] INFO LogLevelDemo - Processing payment of $49.99 for order ORD-001
// 2026-02-28 10:30:00.456 [main] INFO LogLevelDemo - Payment successful for order ORD-001
// (TRACE and DEBUG messages are suppressed)
Java includes a built-in logging framework in the java.util.logging package. It requires no external dependencies, which makes it a good starting point for learning and for simple applications where you want zero third-party libraries.
JUL uses its own level names, which differ from SLF4J:
| JUL Level | SLF4J Equivalent | Description |
|---|---|---|
| FINEST | TRACE | Highly detailed tracing |
| FINER | TRACE | Fairly detailed tracing |
| FINE | DEBUG | General debugging |
| CONFIG | - | Configuration info |
| INFO | INFO | Informational messages |
| WARNING | WARN | Potential problems |
| SEVERE | ERROR | Serious failures |
import java.util.logging.Level;
import java.util.logging.Logger;
public class JulExample {
// Create a logger named after the class
private static final Logger logger = Logger.getLogger(JulExample.class.getName());
public static void main(String[] args) {
// Basic logging at different levels
logger.info("Application starting");
logger.warning("Configuration file not found, using defaults");
logger.severe("Database connection failed!");
// Parameterized logging (JUL uses {0}, {1} style -- not {} like SLF4J)
String user = "alice";
int loginAttempts = 3;
logger.log(Level.INFO, "User {0} logged in after {1} attempts", new Object[]{user, loginAttempts});
// Logging an exception
try {
int result = 10 / 0;
} catch (ArithmeticException e) {
logger.log(Level.SEVERE, "Division error occurred", e);
}
// Check if level is enabled before expensive operations
if (logger.isLoggable(Level.FINE)) {
logger.fine("Debug data: " + expensiveToString());
}
}
private static String expensiveToString() {
// Imagine this method is costly to call
return "detailed debug information";
}
}
// Output:
// Feb 28, 2026 10:45:00 AM JulExample main
// INFO: Application starting
// Feb 28, 2026 10:45:00 AM JulExample main
// WARNING: Configuration file not found, using defaults
// Feb 28, 2026 10:45:00 AM JulExample main
// SEVERE: Database connection failed!
While JUL works for simple cases, it has significant drawbacks compared to modern frameworks:
logger.log(Level.INFO, "msg {0}", new Object[]{val}) vs. SLF4J's log.info("msg {}", val)logging.properties file that is awkward to customize per-packageVerdict: Use JUL for quick scripts or when you truly cannot add dependencies. For any real application, use SLF4J + Logback.
SLF4J (Simple Logging Facade for Java) + Logback is the most popular logging stack in the Java ecosystem. Spring Boot uses it by default. SLF4J provides the API you code against; Logback provides the engine that does the actual logging.
org.slf4j slf4j-api 2.0.16 ch.qos.logback logback-classic 1.5.15
// Gradle: Add to build.gradle
dependencies {
implementation 'org.slf4j:slf4j-api:2.0.16'
implementation 'ch.qos.logback:logback-classic:1.5.15'
}
The setup follows a consistent two-step pattern in every class:
org.slf4j.Logger and org.slf4j.LoggerFactoryprivate static final Logger field using LoggerFactory.getLogger(YourClass.class)Passing the class to getLogger() means the logger is named after your class, so log output shows exactly which class produced each message.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class UserService {
// Step 1: Declare the logger -- always private static final
private static final Logger log = LoggerFactory.getLogger(UserService.class);
public User findUserById(long id) {
log.info("Looking up user with id={}", id);
User user = userRepository.findById(id);
if (user == null) {
log.warn("User not found for id={}", id);
return null;
}
log.debug("Found user: name={}, email={}", user.getName(), user.getEmail());
return user;
}
}
// Output with INFO level:
// 2026-02-28 10:30:00.123 [main] INFO c.e.service.UserService - Looking up user with id=42
// 2026-02-28 10:30:00.125 [main] WARN c.e.service.UserService - User not found for id=42
This is one of SLF4J's most important features. Never use string concatenation in log statements. Use {} placeholders instead.
Why? With string concatenation, Java builds the string every time, even if the log level is disabled. With placeholders, SLF4J only builds the string if the message will actually be logged.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ParameterizedLogging {
private static final Logger log = LoggerFactory.getLogger(ParameterizedLogging.class);
public void demonstrate(Order order) {
// BAD: String concatenation -- always builds the string, even if DEBUG is off
log.debug("Processing order " + order.getId() + " for user " + order.getUserId()
+ " with " + order.getItems().size() + " items");
// GOOD: Parameterized logging -- string built ONLY if DEBUG is enabled
log.debug("Processing order {} for user {} with {} items",
order.getId(), order.getUserId(), order.getItems().size());
// Multiple placeholders -- they are filled in order
log.info("User {} placed order {} with total ${}", "alice", "ORD-123", 99.99);
// Output: User alice placed order ORD-123 with total $99.99
// Logging exceptions -- exception is ALWAYS the last argument
try {
processPayment(order);
} catch (Exception e) {
// The exception goes last -- SLF4J recognizes it and prints the full stack trace
log.error("Payment failed for order {}", order.getId(), e);
// Output:
// 2026-02-28 10:30:00.123 [main] ERROR ParameterizedLogging - Payment failed for order ORD-123
// java.lang.RuntimeException: Insufficient funds
// at ParameterizedLogging.processPayment(ParameterizedLogging.java:35)
// at ParameterizedLogging.demonstrate(ParameterizedLogging.java:22)
// ...
}
}
}
When logging exceptions, always pass the exception object as the last argument. SLF4J will automatically print the full stack trace. This is the single most important logging pattern to get right.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ExceptionLogging {
private static final Logger log = LoggerFactory.getLogger(ExceptionLogging.class);
public void demonstrateExceptionLogging() {
try {
riskyOperation();
} catch (Exception e) {
// BAD: Loses the stack trace entirely
log.error("Something failed");
// BAD: Only logs the exception message, no stack trace
log.error("Something failed: " + e.getMessage());
// BAD: Converts stack trace to string manually -- ugly and loses structure
log.error("Something failed: " + e.toString());
// GOOD: Pass exception as the last argument -- full stack trace is printed
log.error("Something failed", e);
// GOOD: With context AND exception -- placeholders first, exception last
log.error("Failed to process order {} for user {}", orderId, userId, e);
// SLF4J knows the last argument is an exception because {} count (2) < argument count (3)
}
}
}
Log4j2 is the modern successor to Log4j 1.x, built from the ground up by Apache. It is a completely different codebase from Log4j 1.x.
Critical Warning: Log4j 1.x (versions 1.2.x) reached end of life in 2015 and has the critical Log4Shell vulnerability (CVE-2021-44228), one of the most severe security vulnerabilities in Java history. If you are using Log4j 1.x, you must migrate immediately. Log4j2 (versions 2.x) is the safe, modern version.
org.slf4j slf4j-api 2.0.16 org.apache.logging.log4j log4j-slf4j2-impl 2.24.3 org.apache.logging.log4j log4j-core 2.24.3
Place this file in src/main/resources/log4j2.xml:
Log4j2's standout feature is its async logging capability using the LMAX Disruptor library. This can dramatically improve performance in high-throughput applications by logging on a separate thread.
com.lmax disruptor 4.0.0
| Feature | Logback | Log4j2 |
|---|---|---|
| Spring Boot default | Yes | No (requires exclusion + config) |
| Async performance | Good (AsyncAppender) | Excellent (LMAX Disruptor) |
| Garbage-free logging | No | Yes (reduces GC pauses) |
| Lambda support | No | Yes (lazy message construction) |
| Plugin architecture | Limited | Extensive |
| Community adoption | Higher (Spring ecosystem) | Strong (Apache ecosystem) |
| Configuration reload | Yes | Yes (automatic) |
Bottom line: Use Logback for most applications, especially with Spring Boot. Choose Log4j2 if you need maximum throughput with async logging (e.g., high-frequency trading, real-time data pipelines).
Logback is configured via an XML file named logback.xml (or logback-spring.xml in Spring Boot) placed in src/main/resources/. The configuration has three main components:
%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n logs/application.log %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n logs/application.log logs/application.%d{yyyy-MM-dd}.%i.log.gz 10MB 30 1GB %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
The pattern string controls how each log message is formatted. Here are the most common conversion specifiers:
| Specifier | Output | Example |
|---|---|---|
%d{pattern} |
Date/time | %d{yyyy-MM-dd HH:mm:ss.SSS} = 2026-02-28 10:30:00.123 |
%level or %-5level |
Log level (padded to 5 chars) | INFO, DEBUG, ERROR |
%logger{n} |
Logger name (abbreviated to n chars) | %logger{36} = c.e.service.UserService |
%msg |
The log message | User logged in successfully |
%n |
Newline (platform-specific) | \n or \r\n |
%thread |
Thread name | main, http-nio-8080-exec-1 |
%class |
Full class name (slow) | com.example.service.UserService |
%method |
Method name (slow) | findUserById |
%line |
Line number (slow) | 42 |
%X{key} |
MDC value | %X{requestId} = abc-123 |
%highlight() |
ANSI color by level (console only) | ERROR in red, WARN in yellow |
Performance note: %class, %method, and %line are computed by generating a stack trace, which is expensive. Avoid them in production patterns.
// Development pattern (human-readable with colors)
%d{HH:mm:ss.SSS} %highlight(%-5level) %cyan(%logger{36}) - %msg%n
// Output: 10:30:00.123 INFO c.e.service.UserService - Order placed
// Production pattern (full detail, no color)
%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
// Output: 2026-02-28 10:30:00.123 [http-nio-8080-exec-1] INFO c.e.service.UserService - Order placed
// Production with MDC (request tracking)
%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} [requestId=%X{requestId}] - %msg%n
// Output: 2026-02-28 10:30:00.123 [http-nio-8080-exec-1] INFO c.e.service.UserService [requestId=abc-123-def] - Order placed
// JSON pattern for ELK/Splunk (see Section 12)
{"timestamp":"%d{yyyy-MM-dd'T'HH:mm:ss.SSSZ}","level":"%level","logger":"%logger","thread":"%thread","message":"%msg","requestId":"%X{requestId}"}%n
One of the most powerful configuration features is setting different log levels for different packages. This lets you see detailed logs from your code while keeping framework noise quiet.
%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n
The format of your log messages matters more than you might think. In development, you want human-readable output. In production, you often want structured (JSON) output that can be parsed by log aggregation tools like the ELK stack (Elasticsearch, Logstash, Kibana) or Splunk.
| Environment | Pattern | Why |
|---|---|---|
| Development | %d{HH:mm:ss} %-5level %logger{20} - %msg%n |
Short, readable, fast to scan |
| Staging | %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n |
Full detail for debugging issues that match production |
| Production (text) | %d{ISO8601} [%thread] %-5level %logger{36} [%X{requestId}] - %msg%n |
ISO timestamps, MDC context, full logger names |
| Production (JSON) | Use Logstash encoder (see below) | Machine-parseable for log aggregation |
For production environments using ELK stack, Splunk, or Datadog, structured JSON logs are essential. Each log line is a valid JSON object that these tools can parse, index, and search.
net.logstash.logback logstash-logback-encoder 8.0
requestId userId
With JSON logging, each log line looks like this:
{"@timestamp":"2026-02-28T10:30:00.123Z","@version":"1","message":"Order ORD-123 placed successfully","logger_name":"com.myapp.service.OrderService","thread_name":"http-nio-8080-exec-1","level":"INFO","requestId":"abc-123-def","userId":"user-42"}
This structured output means you can search for all logs where userId="user-42" or find all ERROR-level messages for a specific requestId -- something that is extremely difficult with plain text logs.
Imagine you are a doctor in a busy emergency room, treating 20 patients simultaneously. Without patient wristbands (IDs), you would have no way to tell which vitals belong to which patient. MDC is the wristband for your application's requests.
MDC (Mapped Diagnostic Context) lets you attach key-value pairs to the current thread. These values are then automatically included in every log message produced by that thread. This is invaluable in multi-threaded web applications where dozens of requests are processed concurrently.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
public class MdcExample {
private static final Logger log = LoggerFactory.getLogger(MdcExample.class);
public void handleRequest(String requestId, String userId) {
// Put values into MDC at the start of the request
MDC.put("requestId", requestId);
MDC.put("userId", userId);
try {
log.info("Request received");
processOrder();
sendConfirmation();
log.info("Request completed successfully");
} finally {
// CRITICAL: Always clear MDC when the request is done
// Threads are reused in thread pools -- leftover MDC values leak into other requests!
MDC.clear();
}
}
private void processOrder() {
// This log line automatically includes requestId and userId from MDC
log.info("Processing order");
// Output: 2026-02-28 10:30:00.123 [http-exec-1] INFO MdcExample [requestId=abc-123, userId=user-42] - Processing order
}
private void sendConfirmation() {
log.info("Sending confirmation email");
// Output: 2026-02-28 10:30:00.456 [http-exec-1] INFO MdcExample [requestId=abc-123, userId=user-42] - Sending confirmation email
}
}
In real applications, you set up MDC in a servlet filter or Spring interceptor so that every request automatically gets a unique ID. You never have to manually add MDC in individual controllers or services.
import org.slf4j.MDC;
import jakarta.servlet.*;
import jakarta.servlet.http.HttpServletRequest;
import java.io.IOException;
import java.util.UUID;
public class LoggingFilter implements Filter {
private static final String REQUEST_ID = "requestId";
private static final String USER_ID = "userId";
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
HttpServletRequest httpRequest = (HttpServletRequest) request;
try {
// Generate or extract request ID
String requestId = httpRequest.getHeader("X-Request-ID");
if (requestId == null || requestId.isBlank()) {
requestId = UUID.randomUUID().toString().substring(0, 8);
}
// Set MDC values
MDC.put(REQUEST_ID, requestId);
// Extract user from security context (if authenticated)
String userId = extractUserId(httpRequest);
if (userId != null) {
MDC.put(USER_ID, userId);
}
// Continue processing the request
chain.doFilter(request, response);
} finally {
// Always clean up to prevent thread pool contamination
MDC.clear();
}
}
private String extractUserId(HttpServletRequest request) {
// In a real app, extract from security context or JWT token
return request.getRemoteUser();
}
}
To display MDC values in your log output, use the %X{key} specifier in your pattern:
%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} [req=%X{requestId} user=%X{userId}] - %msg%n
These are the logging practices that separate junior developers from senior developers. Follow these in every Java project.
Always code against the SLF4J API, never a specific implementation. This gives you the freedom to switch between Logback, Log4j2, or any future implementation without touching your code.
// BAD: Coupling to a specific implementation import org.apache.logging.log4j.LogManager; import org.apache.logging.log4j.Logger; private static final Logger log = LogManager.getLogger(MyClass.class); // BAD: Using java.util.logging directly import java.util.logging.Logger; private static final Logger log = Logger.getLogger(MyClass.class.getName()); // GOOD: SLF4J facade -- works with ANY implementation import org.slf4j.Logger; import org.slf4j.LoggerFactory; private static final Logger log = LoggerFactory.getLogger(MyClass.class);
This is the single most common logging mistake in Java code reviews. Never concatenate strings in log statements.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ParameterizedBestPractice {
private static final Logger log = LoggerFactory.getLogger(ParameterizedBestPractice.class);
public void process(Order order) {
// BAD: String concatenation -- always builds the string even if DEBUG is off
log.debug("Order " + order.getId() + " has " + order.getItems().size() + " items totaling $" + order.getTotal());
// This calls order.getId(), order.getItems().size(), and order.getTotal()
// PLUS concatenates 5 strings -- all wasted work if DEBUG is disabled
// GOOD: Parameterized -- only builds string if DEBUG is enabled
log.debug("Order {} has {} items totaling ${}", order.getId(), order.getItems().size(), order.getTotal());
}
}
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class AppropriateLevel {
private static final Logger log = LoggerFactory.getLogger(AppropriateLevel.class);
public void processOrder(Order order) {
// TRACE: Very fine-grained, method entry/exit
log.trace("Entering processOrder with order={}", order);
// DEBUG: Technical detail helpful during development
log.debug("Validating order items against inventory");
// INFO: Business event -- this is what operations teams monitor
log.info("Order {} placed by user {} for ${}", order.getId(), order.getUserId(), order.getTotal());
// WARN: Something unusual but recoverable
if (order.getTotal() > 10000) {
log.warn("High-value order {} for ${} -- flagged for review", order.getId(), order.getTotal());
}
// ERROR: Something failed -- needs human attention
try {
chargePayment(order);
} catch (PaymentException e) {
log.error("Payment failed for order {} with amount ${}", order.getId(), order.getTotal(), e);
}
}
}
A log message without context is like a clue without a case number. Always include the relevant IDs and values that will help you investigate.
// BAD: No context -- useless for debugging
log.error("Payment failed");
log.info("User logged in");
log.warn("Retry attempt");
// GOOD: Context-rich -- you can trace exactly what happened
log.error("Payment failed for order={} user={} amount=${} gateway={}", orderId, userId, amount, gateway);
log.info("User {} logged in from IP {} using {}", userId, ipAddress, userAgent);
log.warn("Retry attempt {}/{} for order={} after {}ms delay", attempt, maxRetries, orderId, delay);
While parameterized logging avoids string concatenation overhead, it does not avoid the cost of computing the arguments. If computing an argument is expensive, guard the log statement.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ExpensiveLogging {
private static final Logger log = LoggerFactory.getLogger(ExpensiveLogging.class);
public void processLargeDataSet(List records) {
// BAD: computeStats() is called EVERY TIME, even when DEBUG is off
log.debug("Dataset statistics: {}", computeStats(records));
// computeStats() might iterate over millions of records
// GOOD: Guard expensive computation
if (log.isDebugEnabled()) {
log.debug("Dataset statistics: {}", computeStats(records));
}
// ALSO GOOD for simple arguments -- no guard needed
log.debug("Processing {} records", records.size());
// records.size() is O(1) and trivially cheap
}
private String computeStats(List records) {
// Imagine this iterates the entire list, computes averages, etc.
return "min=1, max=100, avg=42.5, stddev=12.3";
}
}
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class SensitiveDataLogging {
private static final Logger log = LoggerFactory.getLogger(SensitiveDataLogging.class);
public void authenticateUser(String username, String password) {
// BAD: NEVER log passwords
log.info("Login attempt: user={}, password={}", username, password);
// GOOD: Log the event without sensitive data
log.info("Login attempt for user={}", username);
}
public void processPayment(String creditCardNumber, double amount) {
// BAD: NEVER log full credit card numbers
log.info("Charging card {} for ${}", creditCardNumber, amount);
// GOOD: Mask the sensitive data
String masked = maskCreditCard(creditCardNumber);
log.info("Charging card {} for ${}", masked, amount);
// Output: Charging card ****-****-****-4242 for $99.99
}
private String maskCreditCard(String number) {
if (number == null || number.length() < 4) return "****";
return "****-****-****-" + number.substring(number.length() - 4);
}
}
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class LoopLogging {
private static final Logger log = LoggerFactory.getLogger(LoopLogging.class);
public void processRecords(List records) {
// BAD: Logging inside a loop with 1 million records = 1 million log lines
for (Record record : records) {
log.debug("Processing record: {}", record.getId());
process(record);
}
// GOOD: Log summary information
log.info("Starting to process {} records", records.size());
int successCount = 0;
int failCount = 0;
for (Record record : records) {
try {
process(record);
successCount++;
} catch (Exception e) {
failCount++;
// Only log individual failures -- these are exceptional
log.warn("Failed to process record {}: {}", record.getId(), e.getMessage());
}
}
log.info("Completed processing: {} succeeded, {} failed out of {} total",
successCount, failCount, records.size());
}
}
| Practice | Do | Do Not |
|---|---|---|
| API | Use SLF4J facade | Use implementation-specific API (JUL, Log4j directly) |
| Parameters | log.info("User {}", userId) |
log.info("User " + userId) |
| Exceptions | log.error("Msg", exception) |
log.error("Msg: " + e.getMessage()) |
| Levels | INFO for business events, DEBUG for technical details | Everything at INFO or everything at DEBUG |
| Context | Include IDs, amounts, counts | Vague messages like "Error occurred" |
| MDC | Set requestId/userId in filter | Manually add IDs to every message |
| Sensitive data | Mask or omit | Log passwords, credit cards, tokens |
| Loops | Log summary before/after | Log every iteration |
| Guards | if (log.isDebugEnabled()) for expensive computation |
Call expensive methods as log arguments |
| Logger declaration | private static final Logger |
Creating new Logger per method call |
Every experienced Java developer has made these mistakes. Recognizing them in code reviews will make you a better developer.
// MISTAKE: System.out.println scattered through production code
public class OrderService {
public void placeOrder(Order order) {
System.out.println("Placing order: " + order); // No level, no timestamp, no thread
System.out.println("Validating..."); // Cannot turn off without deleting
System.out.println("Done!"); // Goes to stdout only
}
}
// FIX: Use a proper logger
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class OrderService {
private static final Logger log = LoggerFactory.getLogger(OrderService.class);
public void placeOrder(Order order) {
log.info("Placing order {}", order.getId());
log.debug("Validating order items");
log.info("Order {} placed successfully", order.getId());
}
}
// MISTAKE: String concatenation is evaluated even when the level is disabled
log.debug("User " + user.getName() + " has " + user.getOrders().size() + " orders"
+ " totaling $" + calculateTotal(user.getOrders()));
// If DEBUG is off, Java still:
// 1. Calls user.getName()
// 2. Calls user.getOrders().size()
// 3. Calls calculateTotal() -- potentially expensive!
// 4. Concatenates 5 strings
// 5. Throws the result away
// FIX: Use parameterized logging
log.debug("User {} has {} orders totaling ${}",
user.getName(), user.getOrders().size(), calculateTotal(user.getOrders()));
// With parameterized logging, if DEBUG is off, SLF4J skips building the string.
// NOTE: The arguments are still evaluated. For expensive arguments, use isDebugEnabled() guard.
try {
connectToDatabase();
} catch (SQLException e) {
// MISTAKE 1: Swallowing the exception entirely
// (empty catch block -- the worst possible thing)
// MISTAKE 2: Only logging the message, losing the stack trace
log.error("Database error: " + e.getMessage());
// Output: Database error: Connection refused
// WHERE did it fail? Which line? What was the root cause? All lost.
// MISTAKE 3: Using printStackTrace() instead of logging
e.printStackTrace();
// This goes to System.err, bypassing the logging framework entirely.
// No timestamp, no level, no file output, no MDC.
// CORRECT: Pass the exception as the last argument
log.error("Failed to connect to database", e);
// Output includes the full stack trace:
// 2026-02-28 10:30:00.123 [main] ERROR DatabaseService - Failed to connect to database
// java.sql.SQLException: Connection refused
// at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:839)
// at com.mysql.cj.jdbc.ConnectionImpl.(ConnectionImpl.java:453)
// at DatabaseService.connectToDatabase(DatabaseService.java:42)
// ...
// Caused by: java.net.ConnectException: Connection refused (Connection refused)
// at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
// ...
}
// MISTAKE: Logging too much -- "log diarrhea"
public double calculateTax(double amount, String state) {
log.info("calculateTax called"); // Noise
log.info("amount = " + amount); // Noise + concatenation
log.info("state = " + state); // Noise + concatenation
double rate = getTaxRate(state);
log.info("tax rate = " + rate); // Noise
double tax = amount * rate;
log.info("tax = " + tax); // Noise
log.info("returning tax"); // Noise
return tax;
}
// This method generates 6 log lines for a simple calculation.
// Multiply by 1000 requests/second and you have 6000 lines/second of noise.
// MISTAKE: Logging too little
public double calculateTax(double amount, String state) {
return amount * getTaxRate(state);
// No logging at all. If tax calculations are wrong, where do you start?
}
// CORRECT: Log meaningful events at the right level
public double calculateTax(double amount, String state) {
log.debug("Calculating tax for amount={} state={}", amount, state);
double rate = getTaxRate(state);
double tax = amount * rate;
log.debug("Tax calculated: amount={} state={} rate={} tax={}", amount, state, rate, tax);
return tax;
}
// Two DEBUG lines that can be turned off in production but enabled when needed.
// MISTAKE: Logging user data verbatim
public void registerUser(UserRegistration reg) {
log.info("Registering user: {}", reg);
// If UserRegistration.toString() includes password, SSN, or credit card... game over.
// Log files are often stored in plain text, backed up to multiple servers,
// and accessed by many team members.
}
// CORRECT: Log only safe, relevant fields
public void registerUser(UserRegistration reg) {
log.info("Registering user: email={}", reg.getEmail());
// Or override toString() to exclude sensitive fields:
// @Override public String toString() {
// return "UserRegistration{email='" + email + "', name='" + name + "'}";
// // password, ssn, creditCard intentionally excluded
// }
}
// MISTAKE: Still using Log4j 1.x (versions 1.2.x) import org.apache.log4j.Logger; // <-- This is Log4j 1.x -- SECURITY VULNERABILITY! // Log4j 1.x reached End of Life in August 2015. // CVE-2021-44228 (Log4Shell) allows Remote Code Execution -- attackers can take over your server. // This is a CRITICAL vulnerability rated 10.0 out of 10.0 on the CVSS scale. // FIX: Migrate to SLF4J + Logback (or Log4j2) // Step 1: Remove log4j 1.x dependency // Step 2: Add SLF4J + Logback dependencies (see Section 5) // Step 3: Replace imports: import org.slf4j.Logger; // <-- SLF4J facade import org.slf4j.LoggerFactory; // Step 4: Replace logger creation: // OLD: private static final Logger log = Logger.getLogger(MyClass.class); // NEW: private static final Logger log = LoggerFactory.getLogger(MyClass.class); // Step 5: Replace log4j.properties with logback.xml (see Section 7) // Step 6: The logging method calls (log.info, log.error, etc.) are almost identical
Production logging has different requirements than development logging. In production, your logs are the primary tool for understanding what is happening across hundreds of servers processing thousands of requests per second.
In production, logs should be machine-parseable. Plain text logs like 2026-02-28 10:30 INFO OrderService - Order placed are hard for log aggregation tools to parse reliably. JSON format solves this.
With JSON logging, tools like Elasticsearch, Splunk, Datadog, and Grafana Loki can index every field and let you write queries like:
The ELK stack (Elasticsearch, Logstash, Kibana) is the most popular open-source log aggregation platform:
| Component | Role | Description |
|---|---|---|
| Elasticsearch | Store and search | Distributed search engine that indexes log data for fast queries |
| Logstash | Collect and transform | Ingests logs from multiple sources, parses them, and sends to Elasticsearch |
| Kibana | Visualize | Web UI for searching logs, building dashboards, and setting up alerts |
Without log rotation, log files grow until they fill the disk and your application crashes. Always configure rolling policies:
logs/application.log logs/application.%d{yyyy-MM-dd}.%i.log.gz 50MB 90 5GB %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} [%X{requestId}] - %msg%n
| Concern | Solution |
|---|---|
| High-throughput logging blocks threads | Use async appenders (Logback's AsyncAppender or Log4j2's AsyncLogger) |
| Disk I/O bottleneck | Write to a local buffer, ship to remote collector (Logstash, Fluentd) |
| Large stack traces | Logback automatically shortens repeated stack frames with ... 42 common frames omitted |
| GC pressure from log string building | Use parameterized logging ({}), consider Log4j2's garbage-free mode |
| Log file size | Use GZIP compression on rolled files (.log.gz) |
| Too many DEBUG/TRACE in production | Set root level to INFO, use DEBUG only for your packages when investigating |
logs/application.log logs/application.%d{yyyy-MM-dd}.log.gz 30 %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n 1024 0 false
Let us tie everything together with a realistic, production-quality example. This OrderService demonstrates all the logging concepts we have covered: appropriate log levels, parameterized messages, exception handling, MDC for request tracking, and best practices throughout.
org.slf4j slf4j-api 2.0.16 ch.qos.logback logback-classic 1.5.15
%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{30} [orderId=%X{orderId} user=%X{userId}] - %msg%n logs/orders.log logs/orders.%d{yyyy-MM-dd}.%i.log.gz 10MB 30 1GB %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} [orderId=%X{orderId} user=%X{userId}] - %msg%n
package com.example.orders;
import java.util.List;
public class Order {
private final String id;
private final String userId;
private final List items;
private double total;
private OrderStatus status;
public Order(String id, String userId, List items) {
this.id = id;
this.userId = userId;
this.items = items;
this.total = items.stream().mapToDouble(OrderItem::getSubtotal).sum();
this.status = OrderStatus.PENDING;
}
public String getId() { return id; }
public String getUserId() { return userId; }
public List getItems() { return items; }
public double getTotal() { return total; }
public OrderStatus getStatus() { return status; }
public void setStatus(OrderStatus status) { this.status = status; }
public void setTotal(double total) { this.total = total; }
// toString excludes any sensitive user data
@Override
public String toString() {
return "Order{id='" + id + "', items=" + items.size() + ", total=" + total + ", status=" + status + "}";
}
}
enum OrderStatus { PENDING, VALIDATED, PAID, SHIPPED, CANCELLED }
class OrderItem {
private final String productName;
private final int quantity;
private final double price;
public OrderItem(String productName, int quantity, double price) {
this.productName = productName;
this.quantity = quantity;
this.price = price;
}
public String getProductName() { return productName; }
public int getQuantity() { return quantity; }
public double getPrice() { return price; }
public double getSubtotal() { return quantity * price; }
}
package com.example.orders;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.util.List;
public class OrderService {
private static final Logger log = LoggerFactory.getLogger(OrderService.class);
private static final double HIGH_VALUE_THRESHOLD = 1000.0;
private static final double TAX_RATE = 0.08;
private static final double DISCOUNT_THRESHOLD = 500.0;
private static final double DISCOUNT_RATE = 0.10;
/**
* Process an order end-to-end with proper logging at every stage.
*/
public void processOrder(Order order) {
// Set MDC context for this order -- all subsequent log lines include these values
MDC.put("orderId", order.getId());
MDC.put("userId", order.getUserId());
long startTime = System.currentTimeMillis();
try {
// INFO: Business event -- order processing started
log.info("Order processing started: {} items, total=${}",
order.getItems().size(), order.getTotal());
// Step 1: Validate
validateOrder(order);
// Step 2: Apply discounts
applyDiscounts(order);
// Step 3: Calculate tax
calculateTax(order);
// Step 4: Process payment
processPayment(order);
// Step 5: Ship
shipOrder(order);
long elapsed = System.currentTimeMillis() - startTime;
// INFO: Business event -- order completed with timing
log.info("Order processing completed successfully in {}ms, finalTotal=${}",
elapsed, order.getTotal());
} catch (Exception e) {
long elapsed = System.currentTimeMillis() - startTime;
order.setStatus(OrderStatus.CANCELLED);
// ERROR: Something went wrong -- include the exception for stack trace
log.error("Order processing failed after {}ms", elapsed, e);
} finally {
// CRITICAL: Always clear MDC to prevent thread contamination
MDC.clear();
}
}
private void validateOrder(Order order) {
log.debug("Validating order");
if (order.getItems() == null || order.getItems().isEmpty()) {
// ERROR: Invalid input -- this should not happen if upstream validation works
log.error("Order has no items");
throw new IllegalArgumentException("Order must have at least one item");
}
for (OrderItem item : order.getItems()) {
if (item.getQuantity() <= 0) {
log.error("Invalid quantity {} for product '{}'",
item.getQuantity(), item.getProductName());
throw new IllegalArgumentException("Quantity must be positive for: " + item.getProductName());
}
if (item.getPrice() < 0) {
log.error("Negative price ${} for product '{}'",
item.getPrice(), item.getProductName());
throw new IllegalArgumentException("Price cannot be negative for: " + item.getProductName());
}
}
order.setStatus(OrderStatus.VALIDATED);
// DEBUG: Technical detail about validation result
log.debug("Order validated: {} items passed all checks", order.getItems().size());
}
private void applyDiscounts(Order order) {
double originalTotal = order.getTotal();
log.debug("Checking discounts for total=${}", originalTotal);
if (originalTotal >= DISCOUNT_THRESHOLD) {
double discount = originalTotal * DISCOUNT_RATE;
order.setTotal(originalTotal - discount);
// INFO: Business event -- discount applied (operations wants to track this)
log.info("Discount applied: {}% off ${} = -${}, newTotal=${}",
(int)(DISCOUNT_RATE * 100), originalTotal, discount, order.getTotal());
} else {
log.debug("No discount applied: total ${} below threshold ${}",
originalTotal, DISCOUNT_THRESHOLD);
}
}
private void calculateTax(Order order) {
double beforeTax = order.getTotal();
double tax = beforeTax * TAX_RATE;
order.setTotal(beforeTax + tax);
// DEBUG: Technical calculation detail
log.debug("Tax calculated: ${} * {} = ${}, newTotal=${}",
beforeTax, TAX_RATE, tax, order.getTotal());
}
private void processPayment(Order order) {
// INFO: Business event -- payment attempt
log.info("Processing payment of ${}", order.getTotal());
// WARN: Flag high-value orders
if (order.getTotal() > HIGH_VALUE_THRESHOLD) {
log.warn("High-value order detected: ${} exceeds threshold ${}",
order.getTotal(), HIGH_VALUE_THRESHOLD);
}
// Simulate payment processing
try {
simulatePaymentGateway(order);
order.setStatus(OrderStatus.PAID);
log.info("Payment processed successfully for ${}", order.getTotal());
} catch (RuntimeException e) {
// ERROR: Payment failed -- include the exception
log.error("Payment gateway rejected transaction for ${}", order.getTotal(), e);
throw e;
}
}
private void simulatePaymentGateway(Order order) {
// Simulate: orders with total over $5000 fail (for demo purposes)
if (order.getTotal() > 5000) {
throw new RuntimeException("Payment declined: exceeds single transaction limit");
}
log.debug("Payment gateway returned: APPROVED");
}
private void shipOrder(Order order) {
log.info("Initiating shipment");
order.setStatus(OrderStatus.SHIPPED);
log.info("Order shipped to user {}", order.getUserId());
}
}
package com.example.orders;
import java.util.List;
public class OrderApp {
public static void main(String[] args) {
OrderService service = new OrderService();
// Scenario 1: Normal order
System.out.println("=== Scenario 1: Normal Order ===");
Order normalOrder = new Order("ORD-001", "alice",
List.of(new OrderItem("Laptop Stand", 1, 45.99),
new OrderItem("USB-C Cable", 2, 12.99)));
service.processOrder(normalOrder);
System.out.println();
// Scenario 2: High-value order with discount
System.out.println("=== Scenario 2: High-Value Order ===");
Order highValue = new Order("ORD-002", "bob",
List.of(new OrderItem("MacBook Pro", 1, 2499.00),
new OrderItem("AppleCare+", 1, 399.00)));
service.processOrder(highValue);
System.out.println();
// Scenario 3: Order that exceeds payment limit (will fail)
System.out.println("=== Scenario 3: Failed Payment ===");
Order tooExpensive = new Order("ORD-003", "charlie",
List.of(new OrderItem("Server Rack", 3, 2500.00)));
service.processOrder(tooExpensive);
System.out.println();
// Scenario 4: Invalid order (empty items)
System.out.println("=== Scenario 4: Invalid Order ===");
Order emptyOrder = new Order("ORD-004", "dave", List.of());
service.processOrder(emptyOrder);
}
}
=== Scenario 1: Normal Order ===
2026-02-28 10:30:00.001 [main] INFO c.e.orders.OrderService [orderId=ORD-001 user=alice] - Order processing started: 2 items, total=$71.97
2026-02-28 10:30:00.002 [main] DEBUG c.e.orders.OrderService [orderId=ORD-001 user=alice] - Validating order
2026-02-28 10:30:00.002 [main] DEBUG c.e.orders.OrderService [orderId=ORD-001 user=alice] - Order validated: 2 items passed all checks
2026-02-28 10:30:00.002 [main] DEBUG c.e.orders.OrderService [orderId=ORD-001 user=alice] - Checking discounts for total=$71.97
2026-02-28 10:30:00.002 [main] DEBUG c.e.orders.OrderService [orderId=ORD-001 user=alice] - No discount applied: total $71.97 below threshold $500.0
2026-02-28 10:30:00.003 [main] DEBUG c.e.orders.OrderService [orderId=ORD-001 user=alice] - Tax calculated: $71.97 * 0.08 = $5.7576, newTotal=$77.7276
2026-02-28 10:30:00.003 [main] INFO c.e.orders.OrderService [orderId=ORD-001 user=alice] - Processing payment of $77.7276
2026-02-28 10:30:00.003 [main] DEBUG c.e.orders.OrderService [orderId=ORD-001 user=alice] - Payment gateway returned: APPROVED
2026-02-28 10:30:00.003 [main] INFO c.e.orders.OrderService [orderId=ORD-001 user=alice] - Payment processed successfully for $77.7276
2026-02-28 10:30:00.003 [main] INFO c.e.orders.OrderService [orderId=ORD-001 user=alice] - Initiating shipment
2026-02-28 10:30:00.003 [main] INFO c.e.orders.OrderService [orderId=ORD-001 user=alice] - Order shipped to user alice
2026-02-28 10:30:00.004 [main] INFO c.e.orders.OrderService [orderId=ORD-001 user=alice] - Order processing completed successfully in 3ms, finalTotal=$77.7276
=== Scenario 2: High-Value Order ===
2026-02-28 10:30:00.005 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Order processing started: 2 items, total=$2898.0
2026-02-28 10:30:00.005 [main] DEBUG c.e.orders.OrderService [orderId=ORD-002 user=bob] - Validating order
2026-02-28 10:30:00.005 [main] DEBUG c.e.orders.OrderService [orderId=ORD-002 user=bob] - Order validated: 2 items passed all checks
2026-02-28 10:30:00.005 [main] DEBUG c.e.orders.OrderService [orderId=ORD-002 user=bob] - Checking discounts for total=$2898.0
2026-02-28 10:30:00.005 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Discount applied: 10% off $2898.0 = -$289.8, newTotal=$2608.2
2026-02-28 10:30:00.006 [main] DEBUG c.e.orders.OrderService [orderId=ORD-002 user=bob] - Tax calculated: $2608.2 * 0.08 = $208.656, newTotal=$2816.856
2026-02-28 10:30:00.006 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Processing payment of $2816.856
2026-02-28 10:30:00.006 [main] WARN c.e.orders.OrderService [orderId=ORD-002 user=bob] - High-value order detected: $2816.856 exceeds threshold $1000.0
2026-02-28 10:30:00.006 [main] DEBUG c.e.orders.OrderService [orderId=ORD-002 user=bob] - Payment gateway returned: APPROVED
2026-02-28 10:30:00.006 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Payment processed successfully for $2816.856
2026-02-28 10:30:00.006 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Initiating shipment
2026-02-28 10:30:00.006 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Order shipped to user bob
2026-02-28 10:30:00.007 [main] INFO c.e.orders.OrderService [orderId=ORD-002 user=bob] - Order processing completed successfully in 2ms, finalTotal=$2816.856
=== Scenario 3: Failed Payment ===
2026-02-28 10:30:00.008 [main] INFO c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Order processing started: 3 items, total=$7500.0
2026-02-28 10:30:00.008 [main] DEBUG c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Validating order
2026-02-28 10:30:00.008 [main] DEBUG c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Order validated: 3 items passed all checks
2026-02-28 10:30:00.008 [main] DEBUG c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Checking discounts for total=$7500.0
2026-02-28 10:30:00.008 [main] INFO c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Discount applied: 10% off $7500.0 = -$750.0, newTotal=$6750.0
2026-02-28 10:30:00.009 [main] DEBUG c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Tax calculated: $6750.0 * 0.08 = $540.0, newTotal=$7290.0
2026-02-28 10:30:00.009 [main] INFO c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Processing payment of $7290.0
2026-02-28 10:30:00.009 [main] WARN c.e.orders.OrderService [orderId=ORD-003 user=charlie] - High-value order detected: $7290.0 exceeds threshold $1000.0
2026-02-28 10:30:00.009 [main] ERROR c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Payment gateway rejected transaction for $7290.0
java.lang.RuntimeException: Payment declined: exceeds single transaction limit
at com.example.orders.OrderService.simulatePaymentGateway(OrderService.java:112)
...
2026-02-28 10:30:00.010 [main] ERROR c.e.orders.OrderService [orderId=ORD-003 user=charlie] - Order processing failed after 2ms
=== Scenario 4: Invalid Order ===
2026-02-28 10:30:00.011 [main] INFO c.e.orders.OrderService [orderId=ORD-004 user=dave] - Order processing started: 0 items, total=$0.0
2026-02-28 10:30:00.011 [main] DEBUG c.e.orders.OrderService [orderId=ORD-004 user=dave] - Validating order
2026-02-28 10:30:00.011 [main] ERROR c.e.orders.OrderService [orderId=ORD-004 user=dave] - Order has no items
2026-02-28 10:30:00.011 [main] ERROR c.e.orders.OrderService [orderId=ORD-004 user=dave] - Order processing failed after 0ms
java.lang.IllegalArgumentException: Order must have at least one item
at com.example.orders.OrderService.validateOrder(OrderService.java:70)
...
| # | Concept | Where in Code |
|---|---|---|
| 1 | Logger declaration (private static final) | OrderService class field |
| 2 | MDC for request tracking | processOrder() -- MDC.put/MDC.clear |
| 3 | MDC cleanup in finally block | processOrder() -- prevents thread contamination |
| 4 | INFO for business events | "Order processing started", "Payment processed", "Order shipped" |
| 5 | DEBUG for technical details | "Validating order", "Tax calculated", "Payment gateway returned" |
| 6 | WARN for recoverable issues | "High-value order detected" |
| 7 | ERROR with exception | "Payment gateway rejected" -- exception passed as last argument |
| 8 | Parameterized logging ({}) | Every log statement uses {} instead of string concatenation |
| 9 | Context in messages | Order ID, user ID, amounts, item counts included |
| 10 | Performance tracking | Elapsed time measured and logged on completion/failure |
| 11 | No sensitive data logged | toString() excludes user details; no passwords/tokens |
| 12 | Separate logback.xml configuration | Console + rolling file, package-level filtering, MDC in pattern |
| Topic | Key Point |
|---|---|
| Recommended stack | SLF4J (facade) + Logback (implementation) |
| Logger declaration | private static final Logger log = LoggerFactory.getLogger(MyClass.class) |
| Parameterized logging | log.info("User {} placed order {}", userId, orderId) |
| Exception logging | log.error("Something failed for order {}", orderId, exception) -- exception is always the last argument |
| Log levels | TRACE < DEBUG < INFO < WARN < ERROR. Use INFO for business events, DEBUG for technical details. |
| MDC | MDC.put("requestId", id) in filter/interceptor, %X{requestId} in pattern, MDC.clear() in finally |
| Configuration file | logback.xml in src/main/resources |
| Production format | JSON via logstash-logback-encoder for ELK/Splunk/Datadog |
| Log rotation | SizeAndTimeBasedRollingPolicy with maxFileSize, maxHistory, totalSizeCap |
| Async logging | Logback AsyncAppender or Log4j2 AsyncLogger for high throughput |
| Never log | Passwords, credit cards, SSNs, API keys, session tokens |
| Never use | Log4j 1.x (CVE-2021-44228), System.out.println, string concatenation in log calls |
Hasura allows you to define role-based access control rules for each of the models/tables that you use. Access control rules help in restricting querying on a table based on certain conditions.
Roles
Every table/view can have permission rules for users based on their role. By default, there is an admin role that can perform any operation on any table. You can define roles and then create permissions for each of those roles.
Permission rules can also refer to as session variables. Session variables are key-value pairs in the JWT custom claims. These session variables are used to defined permissions for rows and columns of a table. Permissions are essentially a combination of boolean expressions and column selections that impose constraints on the data being returned or modified.
Row level permission
Limit access to a subset of the rows in the table based on this permission. Row-level permissions are essentially boolean expressions that, when evaluated against any row, determine access to it. These permissions are constructed from the values in columns, session variables and static values to build this boolean expression.
Column level permission
For the rows that are accessible based on the above, limit access to a subset of the columns based on this permission rule.


You can also allow users to have access all rows and all columns of a particular table. You can do that by selecting “Without any checks” option