Serialization is the process of converting an in-memory data structure (objects, dictionaries, lists) into a format that can be stored on disk, transmitted over a network, or cached for later retrieval. Deserialization is the reverse — reconstructing the original data structure from the serialized format.
If you have ever saved application state to a file, sent JSON to a REST API, or read a YAML configuration file, you have already been using serialization. It is one of the most fundamental operations in software engineering, and Python gives you several powerful modules to handle it.
Why serialization matters:
In this tutorial, we will cover the most important serialization formats and libraries in Python: JSON, pickle, YAML, XML, dataclasses, and marshmallow. Each has its strengths, trade-offs, and ideal use cases.
JSON (JavaScript Object Notation) is the most widely used serialization format on the web. It is human-readable, language-agnostic, and supported by virtually every programming language. Python’s built-in json module handles JSON serialization and deserialization out of the box.
Use json.dumps() to serialize a Python object to a JSON string, and json.loads() to deserialize a JSON string back to a Python object.
import json
# Serialize Python dict to JSON string
user = {
"name": "Folau",
"age": 30,
"email": "folau@example.com",
"skills": ["Python", "Java", "AWS"],
"active": True
}
json_string = json.dumps(user)
print(json_string)
# {"name": "Folau", "age": 30, "email": "folau@example.com", "skills": ["Python", "Java", "AWS"], "active": true}
print(type(json_string))
# <class 'str'>
# Deserialize JSON string back to Python dict
parsed = json.loads(json_string)
print(parsed["name"]) # Folau
print(parsed["skills"]) # ['Python', 'Java', 'AWS']
print(type(parsed)) # <class 'dict'>
Notice that Python’s True becomes JSON’s true, and None becomes null. The json module handles these conversions automatically.
When you need to write JSON directly to a file or read from one, use json.dump() and json.load() (without the trailing “s”).
import json
user = {
"name": "Folau",
"age": 30,
"roles": ["admin", "developer"]
}
# Write to file
with open("user.json", "w") as f:
json.dump(user, f, indent=2)
# Read from file
with open("user.json", "r") as f:
loaded_user = json.load(f)
print(loaded_user)
# {'name': 'Folau', 'age': 30, 'roles': ['admin', 'developer']}
Tip: Always use with statements for file operations. It guarantees the file is properly closed even if an exception occurs.
The json.dumps() function accepts several formatting options that make output more readable.
import json
config = {
"database": {
"host": "localhost",
"port": 5432,
"name": "myapp_db"
},
"cache": {
"enabled": True,
"ttl_seconds": 300
},
"debug": False
}
# Pretty print with 4-space indentation
pretty = json.dumps(config, indent=4)
print(pretty)
# Sort keys alphabetically
sorted_json = json.dumps(config, indent=2, sort_keys=True)
print(sorted_json)
# Compact output (minimize whitespace)
compact = json.dumps(config, separators=(",", ":"))
print(compact)
# {"database":{"host":"localhost","port":5432,"name":"myapp_db"},"cache":{"enabled":true,"ttl_seconds":300},"debug":false}
Use indent for config files and logs where readability matters. Use separators=(",", ":") when you need minimal payload size (e.g., sending data over a network).
The json module can only serialize basic Python types: dict, list, str, int, float, bool, and None. Anything else will raise a TypeError. This commonly happens with datetime objects, sets, custom classes, and bytes.
import json
from datetime import datetime
data = {
"event": "deployment",
"timestamp": datetime.now()
}
# This will FAIL
try:
json.dumps(data)
except TypeError as e:
print(f"Error: {e}")
# Error: Object of type datetime is not JSON serializable
The simplest fix is the default parameter, which provides a fallback serializer for unsupported types.
import json
from datetime import datetime, date
from decimal import Decimal
def json_serializer(obj):
"""Custom serializer for objects not handled by default json encoder."""
if isinstance(obj, (datetime, date)):
return obj.isoformat()
if isinstance(obj, Decimal):
return float(obj)
if isinstance(obj, set):
return list(obj)
if isinstance(obj, bytes):
return obj.decode("utf-8")
raise TypeError(f"Type {type(obj)} is not JSON serializable")
data = {
"event": "deployment",
"timestamp": datetime.now(),
"cost": Decimal("49.99"),
"tags": {"urgent", "production"},
"payload": b"raw bytes here"
}
result = json.dumps(data, default=json_serializer, indent=2)
print(result)
For more control, subclass json.JSONEncoder. This is cleaner when you have complex serialization logic that you want to reuse across your application.
import json
from datetime import datetime, date
from decimal import Decimal
class AppJSONEncoder(json.JSONEncoder):
"""Custom JSON encoder for application-specific types."""
def default(self, obj):
if isinstance(obj, (datetime, date)):
return obj.isoformat()
if isinstance(obj, Decimal):
return str(obj)
if isinstance(obj, set):
return sorted(list(obj))
if isinstance(obj, bytes):
return obj.decode("utf-8")
# Let the base class raise TypeError for unknown types
return super().default(obj)
data = {
"user": "Folau",
"created_at": datetime(2024, 1, 15, 10, 30, 0),
"balance": Decimal("1250.75"),
"permissions": {"read", "write", "admin"}
}
print(json.dumps(data, cls=AppJSONEncoder, indent=2))
When to use which approach:
default parameter — Quick one-off serializationJSONEncoder subclass — Reusable across your codebase, better for large projects
While JSON handles basic data types, Python’s pickle module can serialize almost any Python object — including classes, functions, nested structures, and even lambda expressions. The trade-off is that pickle output is binary (not human-readable) and Python-specific (other languages cannot read it).
import pickle
# A complex Python object that JSON cannot handle
class User:
def __init__(self, name, age, scores):
self.name = name
self.age = age
self.scores = scores
def average_score(self):
return sum(self.scores) / len(self.scores)
def __repr__(self):
return f"User(name={self.name}, age={self.age})"
user = User("Folau", 30, [95, 88, 72, 90])
# Serialize to bytes
pickled = pickle.dumps(user)
print(type(pickled)) # <class 'bytes'>
print(len(pickled)) # varies
# Deserialize back to object
restored = pickle.loads(pickled)
print(restored) # User(name=Folau, age=30)
print(restored.average_score()) # 86.25
import pickle
user = User("Folau", 30, [95, 88, 72, 90])
# Write to file (binary mode!)
with open("user.pkl", "wb") as f:
pickle.dump(user, f)
# Read from file
with open("user.pkl", "rb") as f:
loaded_user = pickle.load(f)
print(loaded_user.name) # Folau
print(loaded_user.average_score()) # 86.25
Important: Always open pickle files in binary mode ("wb" and "rb"). Pickle produces bytes, not text.
| Feature | JSON | pickle |
|---|---|---|
| Human-readable | Yes | No (binary) |
| Language support | Universal | Python only |
| Custom objects | Requires custom encoder | Works out of the box |
| Security | Safe to deserialize | Can execute arbitrary code |
| Speed | Moderate | Fast for Python objects |
| Best for | APIs, config files, data exchange | Caching, internal Python storage |
WARNING: Never unpickle data from untrusted sources! Pickle can execute arbitrary code during deserialization. A malicious pickle payload can run system commands, delete files, or open network connections. Only use pickle with data you created yourself or from a fully trusted source.
import pickle
import os
# This is what a MALICIOUS pickle payload looks like.
# DO NOT run this — it demonstrates the danger.
class Malicious:
def __reduce__(self):
# This would execute a system command when unpickled!
return (os.system, ("echo 'You have been hacked!'",))
# If someone sends you a pickle file, it could contain code like this.
# NEVER do: pickle.loads(untrusted_data)
# SAFE alternatives for untrusted data:
# - Use json.loads() for JSON data
# - Use yaml.safe_load() for YAML data
# - Use pickle only for data YOU created
YAML (YAML Ain’t Markup Language) is popular for configuration files because it is more human-friendly than JSON — no braces, no quotes around keys, and it supports comments. Python uses the PyYAML library to work with YAML.
# Install first: pip install pyyaml
import yaml
# Python dict to YAML string
config = {
"database": {
"host": "localhost",
"port": 5432,
"name": "myapp_db",
"credentials": {
"username": "admin",
"password": "secret"
}
},
"logging": {
"level": "INFO",
"file": "/var/log/app.log"
},
"features": ["auth", "caching", "rate_limiting"]
}
yaml_string = yaml.dump(config, default_flow_style=False, sort_keys=False)
print(yaml_string)
Output:
database:
host: localhost
port: 5432
name: myapp_db
credentials:
username: admin
password: secret
logging:
level: INFO
file: /var/log/app.log
features:
- auth
- caching
- rate_limiting
import yaml yaml_content = """ server: host: 0.0.0.0 port: 8080 workers: 4 database: url: postgresql://localhost:5432/myapp pool_size: 10 # Timeout in seconds timeout: 30 features: - authentication - rate_limiting - caching """ # ALWAYS use safe_load, never yaml.load() without a Loader config = yaml.safe_load(yaml_content) print(config["server"]["port"]) # 8080 print(config["database"]["url"]) # postgresql://localhost:5432/myapp print(config["features"]) # ['authentication', 'rate_limiting', 'caching']
import yaml
import os
def load_config(config_path="config.yaml"):
"""Load application configuration from YAML file."""
if not os.path.exists(config_path):
raise FileNotFoundError(f"Config file not found: {config_path}")
with open(config_path, "r") as f:
config = yaml.safe_load(f)
# Override with environment variables if set
if os.environ.get("DB_HOST"):
config["database"]["host"] = os.environ["DB_HOST"]
if os.environ.get("DB_PASSWORD"):
config["database"]["password"] = os.environ["DB_PASSWORD"]
return config
def save_config(config, config_path="config.yaml"):
"""Save configuration back to YAML file."""
with open(config_path, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
# Usage
# config = load_config("config.yaml")
# print(config["database"]["host"])
Why YAML over JSON for config? YAML supports comments, is easier to read and edit by hand, and does not require quotes around string keys. JSON is better for data interchange because it is stricter and more widely supported programmatically.
XML (eXtensible Markup Language) is less common for new projects but still widely used in enterprise systems, SOAP APIs, and legacy codebases. Python’s standard library includes xml.etree.ElementTree for working with XML.
import xml.etree.ElementTree as ET
# Create XML programmatically
root = ET.Element("users")
user1 = ET.SubElement(root, "user", id="1")
ET.SubElement(user1, "name").text = "Folau"
ET.SubElement(user1, "email").text = "folau@example.com"
ET.SubElement(user1, "role").text = "admin"
user2 = ET.SubElement(root, "user", id="2")
ET.SubElement(user2, "name").text = "Jane"
ET.SubElement(user2, "email").text = "jane@example.com"
ET.SubElement(user2, "role").text = "developer"
# Convert to string
xml_string = ET.tostring(root, encoding="unicode", xml_declaration=True)
print(xml_string)
import xml.etree.ElementTree as ET
# Parse XML string
xml_data = """
<users>
<user id="1">
<name>Folau</name>
<email>folau@example.com</email>
<role>admin</role>
</user>
<user id="2">
<name>Jane</name>
<email>jane@example.com</email>
<role>developer</role>
</user>
</users>
"""
root = ET.fromstring(xml_data)
for user in root.findall("user"):
user_id = user.get("id")
name = user.find("name").text
email = user.find("email").text
role = user.find("role").text
print(f"ID: {user_id}, Name: {name}, Email: {email}, Role: {role}")
# Output:
# ID: 1, Name: Folau, Email: folau@example.com, Role: admin
# ID: 2, Name: Jane, Email: jane@example.com, Role: developer
When to use XML: SOAP web services, configuration files for Java-based systems (Maven pom.xml, Android manifests), RSS/Atom feeds, and legacy integrations. For new Python projects, JSON or YAML are almost always better choices.
Python’s dataclasses module (introduced in Python 3.7) provides a clean way to define data-holding classes. Combined with the dataclasses.asdict() function, they integrate well with JSON serialization.
import json
from dataclasses import dataclass, asdict, field
from typing import List
@dataclass
class Address:
street: str
city: str
state: str
zip_code: str
@dataclass
class Employee:
name: str
age: int
department: str
skills: List[str] = field(default_factory=list)
address: Address = None
def to_json(self):
"""Serialize to JSON string."""
return json.dumps(asdict(self), indent=2)
@classmethod
def from_json(cls, json_string):
"""Deserialize from JSON string."""
data = json.loads(json_string)
# Handle nested Address object
if data.get("address"):
data["address"] = Address(**data["address"])
return cls(**data)
# Create and serialize
employee = Employee(
name="Folau",
age=30,
department="Engineering",
skills=["Python", "AWS", "Docker"],
address=Address("123 Main St", "San Francisco", "CA", "94102")
)
json_output = employee.to_json()
print(json_output)
# Deserialize back
restored = Employee.from_json(json_output)
print(restored.name) # Folau
print(restored.address.city) # San Francisco
print(restored.skills) # ['Python', 'AWS', 'Docker']
Why dataclasses for serialization?
asdict() provides automatic conversion to a dictionary (ready for json.dumps())
For production applications that need validation, type coercion, and well-defined schemas, the marshmallow library is the gold standard. It separates your data model from your serialization logic, which keeps things clean as your application grows.
# Install first: pip install marshmallow
from marshmallow import Schema, fields, validate, post_load
class User:
def __init__(self, name, email, age, role="viewer"):
self.name = name
self.email = email
self.age = age
self.role = role
def __repr__(self):
return f"User(name={self.name}, email={self.email}, role={self.role})"
class UserSchema(Schema):
name = fields.Str(required=True, validate=validate.Length(min=1, max=100))
email = fields.Email(required=True)
age = fields.Int(required=True, validate=validate.Range(min=0, max=150))
role = fields.Str(validate=validate.OneOf(["admin", "editor", "viewer"]))
@post_load
def make_user(self, data, **kwargs):
return User(**data)
schema = UserSchema()
# Deserialize (load) — validates and creates object
user_data = {"name": "Folau", "email": "folau@example.com", "age": 30, "role": "admin"}
user = schema.load(user_data)
print(user) # User(name=Folau, email=folau@example.com, role=admin)
# Serialize (dump) — converts object to dict
output = schema.dump(user)
print(output) # {'name': 'Folau', 'email': 'folau@example.com', 'age': 30, 'role': 'admin'}
# Validation error example
try:
bad_data = {"name": "", "email": "not-an-email", "age": -5}
schema.load(bad_data)
except Exception as e:
print(f"Validation errors: {e}")
Key benefits of marshmallow:
This is one of the most common real-world serialization tasks: fetching data from a REST API, processing it, and serializing the results.
import json
import urllib.request
from dataclasses import dataclass, asdict
from typing import List, Optional
@dataclass
class Todo:
id: int
title: str
completed: bool
user_id: int
@classmethod
def from_api_response(cls, data: dict) -> "Todo":
"""Create Todo from API response dict."""
return cls(
id=data["id"],
title=data["title"],
completed=data["completed"],
user_id=data["userId"]
)
def fetch_todos(limit: int = 10) -> List[Todo]:
"""Fetch todos from JSONPlaceholder API."""
url = f"https://jsonplaceholder.typicode.com/todos?_limit={limit}"
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
return [Todo.from_api_response(item) for item in data]
def save_todos(todos: List[Todo], filepath: str):
"""Serialize todos to JSON file."""
data = [asdict(todo) for todo in todos]
with open(filepath, "w") as f:
json.dump(data, f, indent=2)
print(f"Saved {len(todos)} todos to {filepath}")
def load_todos(filepath: str) -> List[Todo]:
"""Deserialize todos from JSON file."""
with open(filepath, "r") as f:
data = json.load(f)
return [Todo(**item) for item in data]
# Fetch from API, process, and save
todos = fetch_todos(limit=5)
completed = [t for t in todos if t.completed]
print(f"Completed: {len(completed)} / {len(todos)}")
save_todos(todos, "todos.json")
restored = load_todos("todos.json")
print(f"Loaded {len(restored)} todos from file")
import json
import os
from datetime import datetime
class ConfigManager:
"""Manage application configuration with JSON persistence."""
def __init__(self, config_path="app_config.json"):
self.config_path = config_path
self.config = self._load_or_create()
def _load_or_create(self):
"""Load existing config or create default."""
if os.path.exists(self.config_path):
with open(self.config_path, "r") as f:
return json.load(f)
return self._default_config()
def _default_config(self):
"""Return default configuration."""
return {
"app_name": "MyApp",
"version": "1.0.0",
"database": {
"host": "localhost",
"port": 5432,
"name": "myapp_db"
},
"logging": {
"level": "INFO",
"file": "app.log"
},
"last_modified": datetime.now().isoformat()
}
def get(self, key, default=None):
"""Get a config value using dot notation: 'database.host'."""
keys = key.split(".")
value = self.config
for k in keys:
if isinstance(value, dict) and k in value:
value = value[k]
else:
return default
return value
def set(self, key, value):
"""Set a config value using dot notation."""
keys = key.split(".")
config = self.config
for k in keys[:-1]:
config = config.setdefault(k, {})
config[keys[-1]] = value
self.config["last_modified"] = datetime.now().isoformat()
self._save()
def _save(self):
"""Persist config to disk."""
with open(self.config_path, "w") as f:
json.dump(self.config, f, indent=2)
# Usage
config = ConfigManager("app_config.json")
print(config.get("database.host")) # localhost
print(config.get("logging.level")) # INFO
config.set("database.host", "db.production.com")
config.set("logging.level", "WARNING")
print(config.get("database.host")) # db.production.com
import json
import csv
import os
class DataExporter:
"""Export and import data between JSON and CSV formats."""
@staticmethod
def json_to_csv(json_path, csv_path):
"""Convert a JSON array of objects to CSV."""
with open(json_path, "r") as f:
data = json.load(f)
if not data:
print("No data to export")
return
# Use keys from first record as CSV headers
headers = list(data[0].keys())
with open(csv_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=headers)
writer.writeheader()
writer.writerows(data)
print(f"Exported {len(data)} records to {csv_path}")
@staticmethod
def csv_to_json(csv_path, json_path):
"""Convert CSV to JSON array of objects."""
records = []
with open(csv_path, "r") as f:
reader = csv.DictReader(f)
for row in reader:
records.append(dict(row))
with open(json_path, "w") as f:
json.dump(records, f, indent=2)
print(f"Imported {len(records)} records to {json_path}")
@staticmethod
def export_summary(data, output_path):
"""Export a summary report as JSON."""
summary = {
"total_records": len(data),
"exported_at": __import__("datetime").datetime.now().isoformat(),
"sample": data[:3] if len(data) >= 3 else data
}
with open(output_path, "w") as f:
json.dump(summary, f, indent=2)
print(f"Summary saved to {output_path}")
# Example usage
employees = [
{"name": "Folau", "department": "Engineering", "salary": 95000},
{"name": "Jane", "department": "Marketing", "salary": 85000},
{"name": "Bob", "department": "Engineering", "salary": 90000},
]
# Save as JSON
with open("employees.json", "w") as f:
json.dump(employees, f, indent=2)
# Convert JSON to CSV
exporter = DataExporter()
exporter.json_to_csv("employees.json", "employees.csv")
exporter.csv_to_json("employees.csv", "employees_restored.json")
import pickle
import os
import time
import hashlib
from functools import wraps
def pickle_cache(cache_dir=".cache"):
"""Decorator that caches function results using pickle."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Create cache directory if needed
os.makedirs(cache_dir, exist_ok=True)
# Generate a unique cache key from function name and arguments
key_data = f"{func.__name__}:{args}:{sorted(kwargs.items())}"
cache_key = hashlib.md5(key_data.encode()).hexdigest()
cache_path = os.path.join(cache_dir, f"{cache_key}.pkl")
# Return cached result if available
if os.path.exists(cache_path):
print(f"Cache HIT for {func.__name__}")
with open(cache_path, "rb") as f:
return pickle.load(f)
# Compute and cache the result
print(f"Cache MISS for {func.__name__} — computing...")
result = func(*args, **kwargs)
with open(cache_path, "wb") as f:
pickle.dump(result, f)
return result
return wrapper
return decorator
@pickle_cache()
def expensive_computation(n):
"""Simulate a slow computation."""
time.sleep(2) # Pretend this takes a long time
return {i: i ** 3 for i in range(n)}
# First call: takes 2 seconds (cache MISS)
start = time.time()
result1 = expensive_computation(1000)
print(f"First call: {time.time() - start:.2f}s")
# Second call: instant (cache HIT)
start = time.time()
result2 = expensive_computation(1000)
print(f"Second call: {time.time() - start:.2f}s")
print(f"Results match: {result1 == result2}")
This is the single most important pitfall. As demonstrated earlier, pickle.loads() can execute arbitrary code. Never use pickle to deserialize data from user input, external APIs, or any untrusted source. Use JSON instead.
import json
# Problem: non-ASCII characters
data = {"city": "Sao Paulo", "greeting": "Hola, como estas?"}
# Default behavior escapes non-ASCII
print(json.dumps(data))
# {"city": "Sao Paulo", "greeting": "Hola, \u00bfcomo est\u00e1s?"}
# Fix: use ensure_ascii=False
print(json.dumps(data, ensure_ascii=False))
# {"city": "Sao Paulo", "greeting": "Hola, como estas?"}
# When writing to files, always specify encoding
with open("data.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
import json
# This will raise ValueError: Circular reference detected
a = {}
b = {"ref": a}
a["ref"] = b
try:
json.dumps(a)
except ValueError as e:
print(f"Error: {e}") # Circular reference detected
# Solution: break circular references before serializing
# or use a custom encoder that tracks visited objects
import json
from datetime import datetime
# Problem: datetime is not JSON-serializable
event = {"name": "Deploy", "timestamp": datetime.now()}
# Solution 1: Convert to ISO format string
event["timestamp"] = event["timestamp"].isoformat()
print(json.dumps(event))
# Solution 2: Use the default parameter
def default_handler(obj):
if hasattr(obj, "isoformat"):
return obj.isoformat()
raise TypeError(f"Cannot serialize {type(obj)}")
event2 = {"name": "Deploy", "timestamp": datetime.now()}
print(json.dumps(event2, default=default_handler))
# Deserializing back to datetime
json_str = '{"name": "Deploy", "timestamp": "2024-01-15T10:30:00"}'
data = json.loads(json_str)
data["timestamp"] = datetime.fromisoformat(data["timestamp"])
print(type(data["timestamp"])) # <class 'datetime.datetime'>
import json
# Python allows non-string keys in dicts
data = {1: "one", 2: "two", (3, 4): "tuple_key"}
# JSON only allows string keys — this converts int keys to strings
result = json.dumps({1: "one", 2: "two"})
print(result) # {"1": "one", "2": "two"}
parsed = json.loads(result)
print(parsed["1"]) # "one" — note the key is now a string!
# print(parsed[1]) # KeyError! The key is "1", not 1
# Tuple keys will raise TypeError
try:
json.dumps(data)
except TypeError as e:
print(f"Error: {e}")
After years of working with serialization in production systems, here are the practices that matter most:
encoding="utf-8" when opening files, and use ensure_ascii=False if your data contains non-ASCII characters.yaml.safe_load(), never yaml.load() without a Loader. The full yaml.load() can execute arbitrary Python code, similar to pickle.to_dict() / from_dict() methods on your classes, or use schemas (marshmallow) to define exactly what gets serialized and how..get() with defaults rather than direct key access. Data schemas evolve, and old serialized data may lack newer fields.json.dumps() calls throughout your code. Centralize serialization in dedicated methods or schema classes.
json module) is the go-to format for APIs and human-readable data. Use dumps/loads for strings, dump/load for files.PyYAML) excels at configuration files. Always use safe_load().ElementTree) is for enterprise/legacy integrations.asdict() provide a clean, zero-dependency path from Python objects to JSON.datetime, encoding, and non-string keys explicitly — they are the most common sources of serialization bugs.