Python Advanced – Serialization

Introduction

Serialization is the process of converting an in-memory data structure (objects, dictionaries, lists) into a format that can be stored on disk, transmitted over a network, or cached for later retrieval. Deserialization is the reverse — reconstructing the original data structure from the serialized format.

If you have ever saved application state to a file, sent JSON to a REST API, or read a YAML configuration file, you have already been using serialization. It is one of the most fundamental operations in software engineering, and Python gives you several powerful modules to handle it.

Why serialization matters:

  • Data persistence — Save program state between sessions (e.g., user preferences, application data)
  • API communication — Exchange structured data between services over HTTP (JSON is the lingua franca of modern APIs)
  • Caching — Store expensive computation results and reload them instantly
  • Inter-process communication — Share data between different programs, languages, or machines
  • Configuration management — Store and load application settings in human-readable formats

In this tutorial, we will cover the most important serialization formats and libraries in Python: JSON, pickle, YAML, XML, dataclasses, and marshmallow. Each has its strengths, trade-offs, and ideal use cases.

 


1. JSON Serialization

JSON (JavaScript Object Notation) is the most widely used serialization format on the web. It is human-readable, language-agnostic, and supported by virtually every programming language. Python’s built-in json module handles JSON serialization and deserialization out of the box.

1.1 — json.dumps() and json.loads() (Working with Strings)

Use json.dumps() to serialize a Python object to a JSON string, and json.loads() to deserialize a JSON string back to a Python object.

import json

# Serialize Python dict to JSON string
user = {
    "name": "Folau",
    "age": 30,
    "email": "folau@example.com",
    "skills": ["Python", "Java", "AWS"],
    "active": True
}

json_string = json.dumps(user)
print(json_string)
# {"name": "Folau", "age": 30, "email": "folau@example.com", "skills": ["Python", "Java", "AWS"], "active": true}

print(type(json_string))
# <class 'str'>

# Deserialize JSON string back to Python dict
parsed = json.loads(json_string)
print(parsed["name"])    # Folau
print(parsed["skills"])  # ['Python', 'Java', 'AWS']
print(type(parsed))      # <class 'dict'>

Notice that Python’s True becomes JSON’s true, and None becomes null. The json module handles these conversions automatically.

1.2 — json.dump() and json.load() (Working with Files)

When you need to write JSON directly to a file or read from one, use json.dump() and json.load() (without the trailing “s”).

import json

user = {
    "name": "Folau",
    "age": 30,
    "roles": ["admin", "developer"]
}

# Write to file
with open("user.json", "w") as f:
    json.dump(user, f, indent=2)

# Read from file
with open("user.json", "r") as f:
    loaded_user = json.load(f)

print(loaded_user)
# {'name': 'Folau', 'age': 30, 'roles': ['admin', 'developer']}

Tip: Always use with statements for file operations. It guarantees the file is properly closed even if an exception occurs.

1.3 — Pretty Printing, sort_keys, and indent

The json.dumps() function accepts several formatting options that make output more readable.

import json

config = {
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "myapp_db"
    },
    "cache": {
        "enabled": True,
        "ttl_seconds": 300
    },
    "debug": False
}

# Pretty print with 4-space indentation
pretty = json.dumps(config, indent=4)
print(pretty)

# Sort keys alphabetically
sorted_json = json.dumps(config, indent=2, sort_keys=True)
print(sorted_json)

# Compact output (minimize whitespace)
compact = json.dumps(config, separators=(",", ":"))
print(compact)
# {"database":{"host":"localhost","port":5432,"name":"myapp_db"},"cache":{"enabled":true,"ttl_seconds":300},"debug":false}

Use indent for config files and logs where readability matters. Use separators=(",", ":") when you need minimal payload size (e.g., sending data over a network).

1.4 — Handling Non-Serializable Types

The json module can only serialize basic Python types: dict, list, str, int, float, bool, and None. Anything else will raise a TypeError. This commonly happens with datetime objects, sets, custom classes, and bytes.

import json
from datetime import datetime

data = {
    "event": "deployment",
    "timestamp": datetime.now()
}

# This will FAIL
try:
    json.dumps(data)
except TypeError as e:
    print(f"Error: {e}")
    # Error: Object of type datetime is not JSON serializable

The simplest fix is the default parameter, which provides a fallback serializer for unsupported types.

import json
from datetime import datetime, date
from decimal import Decimal

def json_serializer(obj):
    """Custom serializer for objects not handled by default json encoder."""
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
    if isinstance(obj, Decimal):
        return float(obj)
    if isinstance(obj, set):
        return list(obj)
    if isinstance(obj, bytes):
        return obj.decode("utf-8")
    raise TypeError(f"Type {type(obj)} is not JSON serializable")

data = {
    "event": "deployment",
    "timestamp": datetime.now(),
    "cost": Decimal("49.99"),
    "tags": {"urgent", "production"},
    "payload": b"raw bytes here"
}

result = json.dumps(data, default=json_serializer, indent=2)
print(result)

1.5 — Custom JSONEncoder

For more control, subclass json.JSONEncoder. This is cleaner when you have complex serialization logic that you want to reuse across your application.

import json
from datetime import datetime, date
from decimal import Decimal

class AppJSONEncoder(json.JSONEncoder):
    """Custom JSON encoder for application-specific types."""

    def default(self, obj):
        if isinstance(obj, (datetime, date)):
            return obj.isoformat()
        if isinstance(obj, Decimal):
            return str(obj)
        if isinstance(obj, set):
            return sorted(list(obj))
        if isinstance(obj, bytes):
            return obj.decode("utf-8")
        # Let the base class raise TypeError for unknown types
        return super().default(obj)

data = {
    "user": "Folau",
    "created_at": datetime(2024, 1, 15, 10, 30, 0),
    "balance": Decimal("1250.75"),
    "permissions": {"read", "write", "admin"}
}

print(json.dumps(data, cls=AppJSONEncoder, indent=2))

When to use which approach:

  • default parameter — Quick one-off serialization
  • JSONEncoder subclass — Reusable across your codebase, better for large projects

 


2. pickle Module — Binary Serialization

While JSON handles basic data types, Python’s pickle module can serialize almost any Python object — including classes, functions, nested structures, and even lambda expressions. The trade-off is that pickle output is binary (not human-readable) and Python-specific (other languages cannot read it).

2.1 — pickle.dumps()/loads() and dump()/load()

import pickle

# A complex Python object that JSON cannot handle
class User:
    def __init__(self, name, age, scores):
        self.name = name
        self.age = age
        self.scores = scores

    def average_score(self):
        return sum(self.scores) / len(self.scores)

    def __repr__(self):
        return f"User(name={self.name}, age={self.age})"

user = User("Folau", 30, [95, 88, 72, 90])

# Serialize to bytes
pickled = pickle.dumps(user)
print(type(pickled))  # <class 'bytes'>
print(len(pickled))   # varies

# Deserialize back to object
restored = pickle.loads(pickled)
print(restored)               # User(name=Folau, age=30)
print(restored.average_score())  # 86.25
import pickle

user = User("Folau", 30, [95, 88, 72, 90])

# Write to file (binary mode!)
with open("user.pkl", "wb") as f:
    pickle.dump(user, f)

# Read from file
with open("user.pkl", "rb") as f:
    loaded_user = pickle.load(f)

print(loaded_user.name)           # Folau
print(loaded_user.average_score())  # 86.25

Important: Always open pickle files in binary mode ("wb" and "rb"). Pickle produces bytes, not text.

2.2 — pickle vs JSON: When to Use Each

Feature JSON pickle
Human-readable Yes No (binary)
Language support Universal Python only
Custom objects Requires custom encoder Works out of the box
Security Safe to deserialize Can execute arbitrary code
Speed Moderate Fast for Python objects
Best for APIs, config files, data exchange Caching, internal Python storage

2.3 — Security Warning

WARNING: Never unpickle data from untrusted sources! Pickle can execute arbitrary code during deserialization. A malicious pickle payload can run system commands, delete files, or open network connections. Only use pickle with data you created yourself or from a fully trusted source.

import pickle
import os

# This is what a MALICIOUS pickle payload looks like.
# DO NOT run this — it demonstrates the danger.
class Malicious:
    def __reduce__(self):
        # This would execute a system command when unpickled!
        return (os.system, ("echo 'You have been hacked!'",))

# If someone sends you a pickle file, it could contain code like this.
# NEVER do: pickle.loads(untrusted_data)

# SAFE alternatives for untrusted data:
# - Use json.loads() for JSON data
# - Use yaml.safe_load() for YAML data
# - Use pickle only for data YOU created

 


3. YAML Serialization with PyYAML

YAML (YAML Ain’t Markup Language) is popular for configuration files because it is more human-friendly than JSON — no braces, no quotes around keys, and it supports comments. Python uses the PyYAML library to work with YAML.

# Install first: pip install pyyaml
import yaml

# Python dict to YAML string
config = {
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "myapp_db",
        "credentials": {
            "username": "admin",
            "password": "secret"
        }
    },
    "logging": {
        "level": "INFO",
        "file": "/var/log/app.log"
    },
    "features": ["auth", "caching", "rate_limiting"]
}

yaml_string = yaml.dump(config, default_flow_style=False, sort_keys=False)
print(yaml_string)

Output:

database:
  host: localhost
  port: 5432
  name: myapp_db
  credentials:
    username: admin
    password: secret
logging:
  level: INFO
  file: /var/log/app.log
features:
- auth
- caching
- rate_limiting

3.1 — Reading YAML Files (Always Use safe_load)

import yaml

yaml_content = """
server:
  host: 0.0.0.0
  port: 8080
  workers: 4

database:
  url: postgresql://localhost:5432/myapp
  pool_size: 10
  # Timeout in seconds
  timeout: 30

features:
  - authentication
  - rate_limiting
  - caching
"""

# ALWAYS use safe_load, never yaml.load() without a Loader
config = yaml.safe_load(yaml_content)

print(config["server"]["port"])      # 8080
print(config["database"]["url"])     # postgresql://localhost:5432/myapp
print(config["features"])            # ['authentication', 'rate_limiting', 'caching']

3.2 — Config File Use Case

import yaml
import os

def load_config(config_path="config.yaml"):
    """Load application configuration from YAML file."""
    if not os.path.exists(config_path):
        raise FileNotFoundError(f"Config file not found: {config_path}")

    with open(config_path, "r") as f:
        config = yaml.safe_load(f)

    # Override with environment variables if set
    if os.environ.get("DB_HOST"):
        config["database"]["host"] = os.environ["DB_HOST"]
    if os.environ.get("DB_PASSWORD"):
        config["database"]["password"] = os.environ["DB_PASSWORD"]

    return config

def save_config(config, config_path="config.yaml"):
    """Save configuration back to YAML file."""
    with open(config_path, "w") as f:
        yaml.dump(config, f, default_flow_style=False, sort_keys=False)

# Usage
# config = load_config("config.yaml")
# print(config["database"]["host"])

Why YAML over JSON for config? YAML supports comments, is easier to read and edit by hand, and does not require quotes around string keys. JSON is better for data interchange because it is stricter and more widely supported programmatically.

 


4. XML Basics with ElementTree

XML (eXtensible Markup Language) is less common for new projects but still widely used in enterprise systems, SOAP APIs, and legacy codebases. Python’s standard library includes xml.etree.ElementTree for working with XML.

import xml.etree.ElementTree as ET

# Create XML programmatically
root = ET.Element("users")

user1 = ET.SubElement(root, "user", id="1")
ET.SubElement(user1, "name").text = "Folau"
ET.SubElement(user1, "email").text = "folau@example.com"
ET.SubElement(user1, "role").text = "admin"

user2 = ET.SubElement(root, "user", id="2")
ET.SubElement(user2, "name").text = "Jane"
ET.SubElement(user2, "email").text = "jane@example.com"
ET.SubElement(user2, "role").text = "developer"

# Convert to string
xml_string = ET.tostring(root, encoding="unicode", xml_declaration=True)
print(xml_string)
import xml.etree.ElementTree as ET

# Parse XML string
xml_data = """
<users>
    <user id="1">
        <name>Folau</name>
        <email>folau@example.com</email>
        <role>admin</role>
    </user>
    <user id="2">
        <name>Jane</name>
        <email>jane@example.com</email>
        <role>developer</role>
    </user>
</users>
"""

root = ET.fromstring(xml_data)

for user in root.findall("user"):
    user_id = user.get("id")
    name = user.find("name").text
    email = user.find("email").text
    role = user.find("role").text
    print(f"ID: {user_id}, Name: {name}, Email: {email}, Role: {role}")

# Output:
# ID: 1, Name: Folau, Email: folau@example.com, Role: admin
# ID: 2, Name: Jane, Email: jane@example.com, Role: developer

When to use XML: SOAP web services, configuration files for Java-based systems (Maven pom.xml, Android manifests), RSS/Atom feeds, and legacy integrations. For new Python projects, JSON or YAML are almost always better choices.

 


5. dataclasses and Serialization

Python’s dataclasses module (introduced in Python 3.7) provides a clean way to define data-holding classes. Combined with the dataclasses.asdict() function, they integrate well with JSON serialization.

import json
from dataclasses import dataclass, asdict, field
from typing import List

@dataclass
class Address:
    street: str
    city: str
    state: str
    zip_code: str

@dataclass
class Employee:
    name: str
    age: int
    department: str
    skills: List[str] = field(default_factory=list)
    address: Address = None

    def to_json(self):
        """Serialize to JSON string."""
        return json.dumps(asdict(self), indent=2)

    @classmethod
    def from_json(cls, json_string):
        """Deserialize from JSON string."""
        data = json.loads(json_string)
        # Handle nested Address object
        if data.get("address"):
            data["address"] = Address(**data["address"])
        return cls(**data)

# Create and serialize
employee = Employee(
    name="Folau",
    age=30,
    department="Engineering",
    skills=["Python", "AWS", "Docker"],
    address=Address("123 Main St", "San Francisco", "CA", "94102")
)

json_output = employee.to_json()
print(json_output)

# Deserialize back
restored = Employee.from_json(json_output)
print(restored.name)              # Folau
print(restored.address.city)      # San Francisco
print(restored.skills)            # ['Python', 'AWS', 'Docker']

Why dataclasses for serialization?

  • Type hints serve as documentation for your data structure
  • asdict() provides automatic conversion to a dictionary (ready for json.dumps())
  • Default values, field factories, and frozen instances are built in
  • No external dependencies required

 


6. marshmallow — Schema-Based Serialization

For production applications that need validation, type coercion, and well-defined schemas, the marshmallow library is the gold standard. It separates your data model from your serialization logic, which keeps things clean as your application grows.

# Install first: pip install marshmallow
from marshmallow import Schema, fields, validate, post_load

class User:
    def __init__(self, name, email, age, role="viewer"):
        self.name = name
        self.email = email
        self.age = age
        self.role = role

    def __repr__(self):
        return f"User(name={self.name}, email={self.email}, role={self.role})"

class UserSchema(Schema):
    name = fields.Str(required=True, validate=validate.Length(min=1, max=100))
    email = fields.Email(required=True)
    age = fields.Int(required=True, validate=validate.Range(min=0, max=150))
    role = fields.Str(validate=validate.OneOf(["admin", "editor", "viewer"]))

    @post_load
    def make_user(self, data, **kwargs):
        return User(**data)

schema = UserSchema()

# Deserialize (load) — validates and creates object
user_data = {"name": "Folau", "email": "folau@example.com", "age": 30, "role": "admin"}
user = schema.load(user_data)
print(user)  # User(name=Folau, email=folau@example.com, role=admin)

# Serialize (dump) — converts object to dict
output = schema.dump(user)
print(output)  # {'name': 'Folau', 'email': 'folau@example.com', 'age': 30, 'role': 'admin'}

# Validation error example
try:
    bad_data = {"name": "", "email": "not-an-email", "age": -5}
    schema.load(bad_data)
except Exception as e:
    print(f"Validation errors: {e}")

Key benefits of marshmallow:

  • Validation — Enforce constraints on incoming data
  • Type coercion — Automatically convert strings to integers, dates, etc.
  • Nested schemas — Handle complex, nested data structures
  • Partial loading — Allow updates with only some fields
  • Custom fields — Define your own field types and validators

 


7. Practical Examples

7.1 — REST API Data Processing

This is one of the most common real-world serialization tasks: fetching data from a REST API, processing it, and serializing the results.

import json
import urllib.request
from dataclasses import dataclass, asdict
from typing import List, Optional

@dataclass
class Todo:
    id: int
    title: str
    completed: bool
    user_id: int

    @classmethod
    def from_api_response(cls, data: dict) -> "Todo":
        """Create Todo from API response dict."""
        return cls(
            id=data["id"],
            title=data["title"],
            completed=data["completed"],
            user_id=data["userId"]
        )

def fetch_todos(limit: int = 10) -> List[Todo]:
    """Fetch todos from JSONPlaceholder API."""
    url = f"https://jsonplaceholder.typicode.com/todos?_limit={limit}"
    with urllib.request.urlopen(url) as response:
        data = json.loads(response.read().decode())
    return [Todo.from_api_response(item) for item in data]

def save_todos(todos: List[Todo], filepath: str):
    """Serialize todos to JSON file."""
    data = [asdict(todo) for todo in todos]
    with open(filepath, "w") as f:
        json.dump(data, f, indent=2)
    print(f"Saved {len(todos)} todos to {filepath}")

def load_todos(filepath: str) -> List[Todo]:
    """Deserialize todos from JSON file."""
    with open(filepath, "r") as f:
        data = json.load(f)
    return [Todo(**item) for item in data]

# Fetch from API, process, and save
todos = fetch_todos(limit=5)
completed = [t for t in todos if t.completed]
print(f"Completed: {len(completed)} / {len(todos)}")

save_todos(todos, "todos.json")
restored = load_todos("todos.json")
print(f"Loaded {len(restored)} todos from file")

7.2 — JSON-Based Config File Manager

import json
import os
from datetime import datetime

class ConfigManager:
    """Manage application configuration with JSON persistence."""

    def __init__(self, config_path="app_config.json"):
        self.config_path = config_path
        self.config = self._load_or_create()

    def _load_or_create(self):
        """Load existing config or create default."""
        if os.path.exists(self.config_path):
            with open(self.config_path, "r") as f:
                return json.load(f)
        return self._default_config()

    def _default_config(self):
        """Return default configuration."""
        return {
            "app_name": "MyApp",
            "version": "1.0.0",
            "database": {
                "host": "localhost",
                "port": 5432,
                "name": "myapp_db"
            },
            "logging": {
                "level": "INFO",
                "file": "app.log"
            },
            "last_modified": datetime.now().isoformat()
        }

    def get(self, key, default=None):
        """Get a config value using dot notation: 'database.host'."""
        keys = key.split(".")
        value = self.config
        for k in keys:
            if isinstance(value, dict) and k in value:
                value = value[k]
            else:
                return default
        return value

    def set(self, key, value):
        """Set a config value using dot notation."""
        keys = key.split(".")
        config = self.config
        for k in keys[:-1]:
            config = config.setdefault(k, {})
        config[keys[-1]] = value
        self.config["last_modified"] = datetime.now().isoformat()
        self._save()

    def _save(self):
        """Persist config to disk."""
        with open(self.config_path, "w") as f:
            json.dump(self.config, f, indent=2)

# Usage
config = ConfigManager("app_config.json")
print(config.get("database.host"))     # localhost
print(config.get("logging.level"))     # INFO

config.set("database.host", "db.production.com")
config.set("logging.level", "WARNING")
print(config.get("database.host"))     # db.production.com

7.3 — Data Export/Import System (CSV + JSON)

import json
import csv
import os

class DataExporter:
    """Export and import data between JSON and CSV formats."""

    @staticmethod
    def json_to_csv(json_path, csv_path):
        """Convert a JSON array of objects to CSV."""
        with open(json_path, "r") as f:
            data = json.load(f)

        if not data:
            print("No data to export")
            return

        # Use keys from first record as CSV headers
        headers = list(data[0].keys())

        with open(csv_path, "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=headers)
            writer.writeheader()
            writer.writerows(data)

        print(f"Exported {len(data)} records to {csv_path}")

    @staticmethod
    def csv_to_json(csv_path, json_path):
        """Convert CSV to JSON array of objects."""
        records = []
        with open(csv_path, "r") as f:
            reader = csv.DictReader(f)
            for row in reader:
                records.append(dict(row))

        with open(json_path, "w") as f:
            json.dump(records, f, indent=2)

        print(f"Imported {len(records)} records to {json_path}")

    @staticmethod
    def export_summary(data, output_path):
        """Export a summary report as JSON."""
        summary = {
            "total_records": len(data),
            "exported_at": __import__("datetime").datetime.now().isoformat(),
            "sample": data[:3] if len(data) >= 3 else data
        }
        with open(output_path, "w") as f:
            json.dump(summary, f, indent=2)
        print(f"Summary saved to {output_path}")

# Example usage
employees = [
    {"name": "Folau", "department": "Engineering", "salary": 95000},
    {"name": "Jane", "department": "Marketing", "salary": 85000},
    {"name": "Bob", "department": "Engineering", "salary": 90000},
]

# Save as JSON
with open("employees.json", "w") as f:
    json.dump(employees, f, indent=2)

# Convert JSON to CSV
exporter = DataExporter()
exporter.json_to_csv("employees.json", "employees.csv")
exporter.csv_to_json("employees.csv", "employees_restored.json")

7.4 — Caching Expensive Computations with pickle

import pickle
import os
import time
import hashlib
from functools import wraps

def pickle_cache(cache_dir=".cache"):
    """Decorator that caches function results using pickle."""

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # Create cache directory if needed
            os.makedirs(cache_dir, exist_ok=True)

            # Generate a unique cache key from function name and arguments
            key_data = f"{func.__name__}:{args}:{sorted(kwargs.items())}"
            cache_key = hashlib.md5(key_data.encode()).hexdigest()
            cache_path = os.path.join(cache_dir, f"{cache_key}.pkl")

            # Return cached result if available
            if os.path.exists(cache_path):
                print(f"Cache HIT for {func.__name__}")
                with open(cache_path, "rb") as f:
                    return pickle.load(f)

            # Compute and cache the result
            print(f"Cache MISS for {func.__name__} — computing...")
            result = func(*args, **kwargs)

            with open(cache_path, "wb") as f:
                pickle.dump(result, f)

            return result
        return wrapper
    return decorator

@pickle_cache()
def expensive_computation(n):
    """Simulate a slow computation."""
    time.sleep(2)  # Pretend this takes a long time
    return {i: i ** 3 for i in range(n)}

# First call: takes 2 seconds (cache MISS)
start = time.time()
result1 = expensive_computation(1000)
print(f"First call: {time.time() - start:.2f}s")

# Second call: instant (cache HIT)
start = time.time()
result2 = expensive_computation(1000)
print(f"Second call: {time.time() - start:.2f}s")

print(f"Results match: {result1 == result2}")

 


8. Common Pitfalls

8.1 — Security: pickle and Untrusted Data

This is the single most important pitfall. As demonstrated earlier, pickle.loads() can execute arbitrary code. Never use pickle to deserialize data from user input, external APIs, or any untrusted source. Use JSON instead.

8.2 — Encoding Issues

import json

# Problem: non-ASCII characters
data = {"city": "Sao Paulo", "greeting": "Hola, como estas?"}

# Default behavior escapes non-ASCII
print(json.dumps(data))
# {"city": "Sao Paulo", "greeting": "Hola, \u00bfcomo est\u00e1s?"}

# Fix: use ensure_ascii=False
print(json.dumps(data, ensure_ascii=False))
# {"city": "Sao Paulo", "greeting": "Hola, como estas?"}

# When writing to files, always specify encoding
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

8.3 — Circular References

import json

# This will raise ValueError: Circular reference detected
a = {}
b = {"ref": a}
a["ref"] = b

try:
    json.dumps(a)
except ValueError as e:
    print(f"Error: {e}")  # Circular reference detected

# Solution: break circular references before serializing
# or use a custom encoder that tracks visited objects

8.4 — datetime Handling

import json
from datetime import datetime

# Problem: datetime is not JSON-serializable
event = {"name": "Deploy", "timestamp": datetime.now()}

# Solution 1: Convert to ISO format string
event["timestamp"] = event["timestamp"].isoformat()
print(json.dumps(event))

# Solution 2: Use the default parameter
def default_handler(obj):
    if hasattr(obj, "isoformat"):
        return obj.isoformat()
    raise TypeError(f"Cannot serialize {type(obj)}")

event2 = {"name": "Deploy", "timestamp": datetime.now()}
print(json.dumps(event2, default=default_handler))

# Deserializing back to datetime
json_str = '{"name": "Deploy", "timestamp": "2024-01-15T10:30:00"}'
data = json.loads(json_str)
data["timestamp"] = datetime.fromisoformat(data["timestamp"])
print(type(data["timestamp"]))  # <class 'datetime.datetime'>

8.5 — JSON Keys Must Be Strings

import json

# Python allows non-string keys in dicts
data = {1: "one", 2: "two", (3, 4): "tuple_key"}

# JSON only allows string keys — this converts int keys to strings
result = json.dumps({1: "one", 2: "two"})
print(result)  # {"1": "one", "2": "two"}

parsed = json.loads(result)
print(parsed["1"])   # "one" — note the key is now a string!
# print(parsed[1])   # KeyError! The key is "1", not 1

# Tuple keys will raise TypeError
try:
    json.dumps(data)
except TypeError as e:
    print(f"Error: {e}")

 


9. Best Practices

After years of working with serialization in production systems, here are the practices that matter most:

  1. Use JSON for human-readable data exchange. It is the standard for APIs, configuration files that humans edit, and any data shared between different languages or systems.
  2. Use pickle only for Python-internal storage. Caching computation results, saving ML models, or storing session data between runs of the same Python application. Never expose pickle data to the outside world.
  3. Validate on deserialization. Never trust incoming data. Validate structure, types, and ranges after deserializing — whether from a file, API, or user input. Libraries like marshmallow and pydantic make this easy.
  4. Handle encoding explicitly. Always specify encoding="utf-8" when opening files, and use ensure_ascii=False if your data contains non-ASCII characters.
  5. Use yaml.safe_load(), never yaml.load() without a Loader. The full yaml.load() can execute arbitrary Python code, similar to pickle.
  6. Define clear serialization boundaries. Use to_dict() / from_dict() methods on your classes, or use schemas (marshmallow) to define exactly what gets serialized and how.
  7. Version your serialized formats. Include a version field in your serialized data so you can handle format changes gracefully over time.
  8. Handle missing fields gracefully. When deserializing, use .get() with defaults rather than direct key access. Data schemas evolve, and old serialized data may lack newer fields.
  9. Keep serialization logic separate from business logic. Do not scatter json.dumps() calls throughout your code. Centralize serialization in dedicated methods or schema classes.
  10. Use appropriate formats for the job. YAML for config files that humans edit. JSON for API communication. pickle for Python-internal caching. CSV for tabular data that needs spreadsheet compatibility. XML only when integrating with systems that require it.

 


10. Key Takeaways

  • Serialization converts Python objects to a storable/transmittable format; deserialization reverses the process.
  • JSON (json module) is the go-to format for APIs and human-readable data. Use dumps/loads for strings, dump/load for files.
  • pickle handles any Python object but produces binary, Python-only output. Never unpickle untrusted data.
  • YAML (PyYAML) excels at configuration files. Always use safe_load().
  • XML (ElementTree) is for enterprise/legacy integrations.
  • dataclasses + asdict() provide a clean, zero-dependency path from Python objects to JSON.
  • marshmallow adds validation and schema enforcement for production applications.
  • Handle datetime, encoding, and non-string keys explicitly — they are the most common sources of serialization bugs.
  • Always validate deserialized data. Never trust the source blindly.

 




Subscribe To Our Newsletter
You will receive our latest post and tutorial.
Thank you for subscribing!

required
required


Leave a Reply

Your email address will not be published. Required fields are marked *