Research: Error Handling & Retry Best Practices¶

Date: 2025-11-09 Researcher: Phase 5 Unit 1 Status: Complete Related: ADR-027

Overview¶

This research document explores best practices for error handling and retry strategies in distributed systems, with focus on API integrations, network failures, and rate limiting. The goal is to design a robust error handling system for Inkwell that gracefully handles transient failures while providing excellent user experience.

1. Retry Strategies Overview¶

Why Retry?¶

Distributed systems experience transient failures that resolve themselves: - Network timeouts - Temporary service unavailability - Rate limiting - Intermittent connectivity issues

Goal: Retry automatically for transient errors, fail fast for permanent errors.

Common Retry Patterns¶

1. Fixed Delay¶

# Wait 1 second between each retry
retry(wait=wait_fixed(1))

Pros: Simple, predictable Cons: Inefficient, can overwhelm recovering services

2. Linear Backoff¶

# Wait 1s, 2s, 3s, 4s...
retry(wait=wait_incrementing(start=1, increment=1))

Pros: Gives service time to recover Cons: Still creates predictable waves

3. Exponential Backoff¶

# Wait 1s, 2s, 4s, 8s, 16s...
retry(wait=wait_exponential(multiplier=1, min=1, max=60))

Pros: Quickly backs off, industry standard Cons: Multiple clients retry simultaneously (thundering herd)

4. Exponential Backoff with Jitter (BEST)¶

# Wait random(1-2s), random(2-4s), random(4-8s)...
retry(wait=wait_random_exponential(multiplier=1, max=60))

Pros: Spreads load, prevents thundering herd Cons: Slightly more complex

2. Exponential Backoff with Jitter (Deep Dive)¶

The Thundering Herd Problem¶

Scenario: 1000 clients make a request. Service fails. All retry at exactly 1s, 2s, 4s...

Result: - 0s: 1000 requests → service crashes - 1s: 1000 requests → service still down - 2s: 1000 requests → service still recovering - 4s: 1000 requests → service might be up but immediately crashes again

Problem: Synchronized retries create waves that prevent service recovery.

Solution: Jitter (Randomization)¶

With jitter: Each client waits a random time within the backoff window: - Client 1: waits 1.2s - Client 2: waits 1.7s - Client 3: waits 0.8s - Client 4: waits 1.5s

Result: Requests spread out over time, allowing service to recover gracefully.

Jitter Strategies¶

Full Jitter (AWS Recommendation)¶

wait_time = random(0, min(cap, base * 2^attempt))

Example: Attempt 3, base=1s, cap=60s - Max wait: min(60, 1 * 2^3) = 8s - Actual wait: random(0, 8s)

Pros: Maximum spread Cons: Might retry too quickly

Equal Jitter (Balanced)¶

temp = min(cap, base * 2^attempt)
wait_time = temp/2 + random(0, temp/2)

Example: Attempt 3, base=1s, cap=60s - Max wait: 8s - Actual wait: 4s + random(0, 4s) = 4-8s

Pros: Guaranteed minimum wait, good spread Cons: Slightly less optimal than full jitter

Decorrelated Jitter (AWS Advanced)¶

wait_time = min(cap, random(base, previous_wait * 3))

Pros: Even better spread, adapts to system behavior Cons: More complex

Recommendation: Use equal jitter for balance of simplicity and effectiveness.

3. Python Retry Libraries¶

Tenacity (Recommended)¶

Repository: https://github.com/jd/tenacity Stars: ~6.5k Maintainance: Active (2025)

Why Tenacity?¶

Comprehensive - Supports all retry strategies
Flexible - Decorator or imperative API
Async-friendly - Works with asyncio
Well-tested - Production-proven
Readable - Clean, declarative syntax

Basic Usage¶

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(
    wait=wait_exponential(multiplier=1, min=1, max=10),
    stop=stop_after_attempt(3)
)
async def fetch_data():
    # Your code here
    pass

Advanced Usage¶

from tenacity import (
    retry,
    wait_random_exponential,
    stop_after_attempt,
    retry_if_exception_type,
    before_sleep_log,
)
import logging

logger = logging.getLogger(__name__)

@retry(
    # Retry strategy
    wait=wait_random_exponential(multiplier=1, max=60),

    # Stop condition
    stop=stop_after_attempt(3),

    # Only retry on specific exceptions
    retry=retry_if_exception_type((TimeoutError, ConnectionError)),

    # Logging
    before_sleep=before_sleep_log(logger, logging.WARNING),

    # Reraise final exception
    reraise=True
)
async def api_call():
    pass

Alternative: backoff¶

Repository: https://github.com/litl/backoff Stars: ~2k

Pros: Simpler API, decorator-focused Cons: Less flexible than Tenacity

import backoff

@backoff.on_exception(
    backoff.expo,
    (TimeoutError, ConnectionError),
    max_tries=3,
    jitter=backoff.full_jitter
)
async def api_call():
    pass

Recommendation: Use Tenacity for more control and flexibility.

4. Error Classification¶

Errors to Retry (Transient)¶

Status Code	Description	Retry?	Wait
408	Request Timeout	✅ Yes	Exponential
429	Rate Limit	✅ Yes	Respect `Retry-After` header
500	Internal Server Error	✅ Yes	Exponential
502	Bad Gateway	✅ Yes	Exponential
503	Service Unavailable	✅ Yes	Exponential
504	Gateway Timeout	✅ Yes	Exponential

Errors NOT to Retry (Permanent)¶

Status Code	Description	Retry?	Action
400	Bad Request	❌ No	Fix request
401	Unauthorized	❌ No	Check API key
403	Forbidden	❌ No	Check permissions
404	Not Found	❌ No	Check URL
422	Unprocessable Entity	❌ No	Fix input data

Network Errors¶

Error Type	Retry?	Wait
`ConnectionError`	✅ Yes	Exponential
`TimeoutError`	✅ Yes	Exponential
`NetworkError`	✅ Yes	Exponential
`SSLError`	❌ No	Fix SSL config
`DNSError`	⚠️ Maybe	Check network

5. Best Practices¶

1. Always Set Stop Conditions¶

❌ Bad:

@retry(wait=wait_exponential())
def api_call():
    pass  # Retries forever!

✅ Good:

@retry(
    wait=wait_exponential(max=60),
    stop=stop_after_attempt(3)  # Max 3 attempts
)
def api_call():
    pass

2. Respect `Retry-After` Header¶

Many APIs include a Retry-After header indicating when to retry:

import httpx
from tenacity import retry, wait_chain, wait_fixed
from datetime import datetime, timedelta

def get_retry_after_wait(response):
    """Extract wait time from Retry-After header."""
    if "Retry-After" in response.headers:
        retry_after = response.headers["Retry-After"]

        # Could be seconds (integer)
        try:
            return int(retry_after)
        except ValueError:
            pass

        # Or HTTP-date
        try:
            retry_date = datetime.strptime(retry_after, "%a, %d %b %Y %H:%M:%S GMT")
            return max(0, (retry_date - datetime.utcnow()).total_seconds())
        except ValueError:
            pass

    return None

@retry(wait=wait_exponential(max=60), stop=stop_after_attempt(3))
async def api_call_with_retry_after():
    response = await httpx.get("https://api.example.com/")

    if response.status_code == 429:
        wait_time = get_retry_after_wait(response)
        if wait_time:
            await asyncio.sleep(wait_time)

    response.raise_for_status()
    return response.json()

3. Log Retry Attempts¶

Users want to know what's happening:

import logging
from tenacity import retry, before_sleep_log, after_log

logger = logging.getLogger(__name__)

@retry(
    wait=wait_random_exponential(max=60),
    stop=stop_after_attempt(3),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    after=after_log(logger, logging.INFO)
)
async def api_call():
    pass

4. Provide User Feedback¶

For CLI tools, show retry progress:

from rich.console import Console

console = Console()

async def api_call_with_feedback():
    for attempt in range(1, 4):
        try:
            return await api_call()
        except TransientError as e:
            if attempt < 3:
                wait_time = 2 ** attempt
                console.print(f"[yellow]Retry {attempt}/3 - waiting {wait_time}s...[/yellow]")
                await asyncio.sleep(wait_time)
            else:
                console.print(f"[red]Failed after 3 attempts[/red]")
                raise

5. Use Token Bucket for Rate Limiting¶

Prevent overwhelming your own rate limits:

import asyncio
from collections import deque
from time import time

class TokenBucket:
    """Rate limiter using token bucket algorithm."""

    def __init__(self, rate: float, capacity: int):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.last_update = time()

    async def acquire(self):
        """Wait until a token is available."""
        while True:
            now = time()
            elapsed = now - self.last_update
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_update = now

            if self.tokens >= 1:
                self.tokens -= 1
                return

            # Wait until next token
            wait_time = (1 - self.tokens) / self.rate
            await asyncio.sleep(wait_time)

# Usage
bucket = TokenBucket(rate=10, capacity=20)  # 10 req/s, burst of 20

@retry(wait=wait_exponential(max=60), stop=stop_after_attempt(3))
async def api_call():
    await bucket.acquire()
    # Make API call
    pass

6. Handle Idempotency¶

Idempotent operation: Can be safely retried without side effects.

Examples: - ✅ GET requests (reading data) - ✅ PUT requests (updating to specific state) - ❌ POST requests (creating resources) - ❌ Payments, emails, notifications

Solution for non-idempotent operations:

import uuid

def create_resource_with_idempotency(data):
    # Generate idempotency key
    idempotency_key = str(uuid.uuid4())

    response = api.post(
        "/resources",
        json=data,
        headers={"Idempotency-Key": idempotency_key}
    )

    return response

# If retry occurs, server recognizes idempotency key
# and returns the original response instead of creating duplicate

7. Circuit Breaker Pattern¶

For repeated failures, stop retrying temporarily:

from datetime import datetime, timedelta

class CircuitBreaker:
    """Prevents repeated calls to failing service."""

    def __init__(self, failure_threshold: int = 5, timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.timeout = timedelta(seconds=timeout)
        self.failures = 0
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half-open

    def call(self, func, *args, **kwargs):
        if self.state == "open":
            # Check if timeout has passed
            if datetime.now() - self.last_failure_time > self.timeout:
                self.state = "half-open"
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

    def on_success(self):
        """Reset on successful call."""
        self.failures = 0
        self.state = "closed"

    def on_failure(self):
        """Increment failures and open if threshold reached."""
        self.failures += 1
        self.last_failure_time = datetime.now()

        if self.failures >= self.failure_threshold:
            self.state = "open"

# Usage
breaker = CircuitBreaker(failure_threshold=5, timeout=60)

def protected_api_call():
    return breaker.call(api_call)

6. Error Messages Best Practices¶

Helpful Error Messages¶