ADR-027: Retry and Error Handling Strategy¶
Date: 2025-11-09 Status: Accepted Context: Phase 5 Unit 1 - Research & Architecture Related: Research: Error Handling Best Practices
Context¶
Inkwell makes multiple external API calls that can fail due to: - Network timeouts and connectivity issues - API rate limiting (429 responses) - Temporary service unavailability (500, 502, 503, 504) - Intermittent failures
Current implementation has no retry logic, meaning transient failures require users to manually re-run commands. This results in poor user experience and wasted time.
We need a robust error handling system that: 1. Automatically retries transient failures 2. Fails fast on permanent errors 3. Provides helpful error messages 4. Shows progress during retries 5. Respects API rate limits
Research Summary¶
See Research: Error Handling Best Practices for detailed findings.
Key findings:
1. Exponential backoff with jitter is industry standard
2. Tenacity is the best Python retry library
3. Must classify errors: retry transient, fail on permanent
4. User feedback during retries improves experience
5. Respect Retry-After headers for rate limiting
Decision¶
We will implement a comprehensive retry and error handling system with these components:
1. Retry Strategy: Exponential Backoff with Equal Jitter¶
Implementation:
from tenacity import (
retry,
wait_random_exponential,
stop_after_attempt,
retry_if_exception_type
)
@retry(
wait=wait_random_exponential(multiplier=1, min=1, max=60),
stop=stop_after_attempt(3),
retry=retry_if_exception_type((ConnectionError, TimeoutError, RateLimitError)),
reraise=True
)
async def api_call():
# API logic
pass
Parameters: - Multiplier: 1 second base - Min wait: 1 second - Max wait: 10 seconds (reduced from 60s for better UX - see TODO #004) - Max attempts: 3 (configurable via config) - Jitter: ±25% jitter (base ± random(0.25 * base))
Wait times: - Attempt 1: 0s (initial) - Attempt 2: 1s ± 25% = 0.75-1.25s - Attempt 3: 2s ± 25% = 1.5-2.5s - Attempt 4: 4s ± 25% = 3-5s - Total worst case: ~9s (vs 150s with old config)
Rationale: ±25% jitter is industry standard (AWS, Google) and prevents thundering herd. 10s max wait provides fast user experience while allowing transient issues to resolve.
2. Error Classification¶
Errors to Retry (Transient):
| Error Type | HTTP Status | Retry? | Max Attempts |
|---|---|---|---|
| Network timeout | - | ✅ Yes | 3 |
| Connection error | - | ✅ Yes | 3 |
| Request timeout | 408 | ✅ Yes | 3 |
| Rate limit | 429 | ✅ Yes | 3 (with Retry-After) |
| Internal server error | 500 | ✅ Yes | 3 |
| Bad gateway | 502 | ✅ Yes | 3 |
| Service unavailable | 503 | ✅ Yes | 3 |
| Gateway timeout | 504 | ✅ Yes | 3 |
Errors NOT to Retry (Permanent):
| Error Type | HTTP Status | Retry? | User Action |
|---|---|---|---|
| Bad request | 400 | ❌ No | Fix input |
| Unauthorized | 401 | ❌ No | Check API key |
| Forbidden | 403 | ❌ No | Check permissions |
| Not found | 404 | ❌ No | Check URL |
| Unprocessable entity | 422 | ❌ No | Fix data format |
3. Retry Utility Module¶
Location: src/inkwell/utils/retry.py
"""Retry utilities with exponential backoff."""
import asyncio
import logging
from typing import Callable, TypeVar, Optional
from functools import wraps
from tenacity import (
retry,
wait_random_exponential,
stop_after_attempt,
retry_if_exception_type,
before_sleep_log,
RetryCallState
)
from rich.console import Console
from inkwell.utils.errors import (
TransientError,
RateLimitError,
NetworkError
)
logger = logging.getLogger(__name__)
console = Console()
T = TypeVar('T')
def with_retry(
max_attempts: int = 3,
max_wait: int = 60,
retry_on: tuple = (ConnectionError, TimeoutError, TransientError),
show_progress: bool = True
):
"""
Decorator for retrying async functions with exponential backoff.
Args:
max_attempts: Maximum number of retry attempts (default: 3)
max_wait: Maximum wait time between retries in seconds (default: 60)
retry_on: Tuple of exception types to retry on
show_progress: Show retry progress in console (default: True)
Example:
@with_retry(max_attempts=3, retry_on=(ConnectionError, TimeoutError))
async def fetch_data():
return await api_call()
"""
def before_sleep_callback(retry_state: RetryCallState):
"""Log retry attempts and show user feedback."""
attempt = retry_state.attempt_number
exception = retry_state.outcome.exception()
wait_time = retry_state.next_action.sleep
logger.warning(
f"Retry attempt {attempt}/{max_attempts} after {exception.__class__.__name__}: "
f"{str(exception)[:100]}. Waiting {wait_time:.1f}s..."
)
if show_progress:
console.print(
f"[yellow]⚠ Retry {attempt}/{max_attempts} - waiting {wait_time:.1f}s...[/yellow]"
)
def decorator(func: Callable[..., T]) -> Callable[..., T]:
@wraps(func)
@retry(
wait=wait_random_exponential(multiplier=1, min=1, max=max_wait),
stop=stop_after_attempt(max_attempts),
retry=retry_if_exception_type(retry_on),
before_sleep=before_sleep_callback,
reraise=True
)
async def wrapper(*args, **kwargs) -> T:
return await func(*args, **kwargs)
return wrapper
return decorator
async def retry_with_rate_limit(
func: Callable,
*args,
max_attempts: int = 3,
**kwargs
) -> T:
"""
Retry function with respect for Retry-After header on 429 responses.
This function handles rate limiting specially by respecting the
Retry-After header when present.
Args:
func: Async function to call
*args: Positional arguments for func
max_attempts: Maximum retry attempts
**kwargs: Keyword arguments for func
Returns:
Result from func
Raises:
RateLimitError: If rate limit exceeded after all retries
"""
for attempt in range(1, max_attempts + 1):
try:
return await func(*args, **kwargs)
except RateLimitError as e:
if attempt >= max_attempts:
raise
# Use Retry-After header if available
wait_time = e.retry_after if e.retry_after else (2 ** attempt)
logger.warning(f"Rate limited. Waiting {wait_time}s before retry {attempt}/{max_attempts}")
console.print(f"[yellow]⚠ Rate limited - waiting {wait_time}s...[/yellow]")
await asyncio.sleep(wait_time)
raise RateLimitError("Max retries exceeded")
4. Enhanced Error Classes¶
Location: src/inkwell/utils/errors.py (enhance existing)
"""Enhanced error classes with helpful messages."""
from typing import Optional, Dict, Any
class InkwellError(Exception):
"""Base exception for all Inkwell errors."""
def __init__(
self,
message: str,
details: Optional[Dict[str, Any]] = None,
suggestion: Optional[str] = None,
docs_url: Optional[str] = None
):
self.message = message
self.details = details or {}
self.suggestion = suggestion
self.docs_url = docs_url
super().__init__(self.message)
def __str__(self):
"""Format error message for display."""
parts = [f"❌ {self.message}"]
if self.details:
parts.append(f"\nDetails: {self.details}")
if self.suggestion:
parts.append(f"\n💡 Suggestion: {self.suggestion}")
if self.docs_url:
parts.append(f"\n📖 Learn more: {self.docs_url}")
return "\n".join(parts)
class TransientError(InkwellError):
"""Transient error that should be retried."""
pass
class NetworkError(TransientError):
"""Network-related error."""
def __init__(self, operation: str, original_error: Exception):
super().__init__(
message=f"Network error during {operation}",
details={"original_error": str(original_error)},
suggestion="Check your internet connection and try again"
)
class RateLimitError(TransientError):
"""API rate limit exceeded."""
def __init__(
self,
provider: str,
retry_after: Optional[int] = None
):
if retry_after:
suggestion = f"Wait {retry_after} seconds before retrying"
else:
suggestion = "Wait a few minutes before retrying or check your API usage"
super().__init__(
message=f"{provider} API rate limit exceeded",
details={"retry_after": retry_after},
suggestion=suggestion,
docs_url="https://docs.inkwell.cli/troubleshooting/rate-limits"
)
self.retry_after = retry_after
class APIKeyError(InkwellError):
"""API key missing or invalid."""
def __init__(self, provider: str):
super().__init__(
message=f"Missing or invalid API key for {provider}",
suggestion=f"Set your API key with: inkwell config set {provider.lower()}_api_key YOUR_KEY",
docs_url="https://docs.inkwell.cli/setup/api-keys"
)
class TranscriptionError(InkwellError):
"""Transcription failed."""
pass
class ExtractionError(InkwellError):
"""Extraction failed."""
pass
class InterviewError(InkwellError):
"""Interview mode error."""
pass
5. Application Points¶
Apply retry logic to these components:
TranscriptionManager¶
# src/inkwell/transcription/gemini.py
from inkwell.utils.retry import with_retry
from inkwell.utils.errors import TranscriptionError, RateLimitError
class GeminiTranscriber:
@with_retry(
max_attempts=3,
retry_on=(ConnectionError, TimeoutError, RateLimitError)
)
async def transcribe(self, audio_path: Path) -> Transcript:
try:
# Transcription logic
pass
except Exception as e:
if "rate limit" in str(e).lower():
raise RateLimitError("Gemini")
raise TranscriptionError(f"Gemini transcription failed: {e}")
ExtractionEngine¶
# src/inkwell/extraction/engine.py
from inkwell.utils.retry import with_retry
from inkwell.utils.errors import ExtractionError, RateLimitError
class ExtractionEngine:
@with_retry(
max_attempts=3,
retry_on=(ConnectionError, TimeoutError, RateLimitError)
)
async def extract(self, template: Template, transcript: str) -> ExtractionResult:
try:
# Extraction logic
pass
except Exception as e:
if "rate limit" in str(e).lower():
raise RateLimitError(self.provider)
raise ExtractionError(f"Extraction failed: {e}")
InterviewAgent¶
# src/inkwell/interview/agent.py
from inkwell.utils.retry import with_retry
from inkwell.utils.errors import InterviewError, RateLimitError
class InterviewAgent:
@with_retry(
max_attempts=3,
retry_on=(ConnectionError, TimeoutError, RateLimitError)
)
async def generate_question(self, context: InterviewContext) -> Question:
try:
# Question generation logic
pass
except Exception as e:
if "rate limit" in str(e).lower():
raise RateLimitError("Claude")
raise InterviewError(f"Question generation failed: {e}")
FeedParser¶
# src/inkwell/feeds/parser.py
from inkwell.utils.retry import with_retry
from inkwell.utils.errors import FeedError
class FeedParser:
@with_retry(
max_attempts=3,
retry_on=(ConnectionError, TimeoutError)
)
async def fetch_feed(self, url: str) -> Feed:
try:
# RSS parsing logic
pass
except Exception as e:
raise FeedError(f"Failed to fetch feed: {e}")
6. Configuration¶
Allow users to configure retry behavior:
# ~/.config/inkwell/config.yaml
retry:
enabled: true
max_attempts: 3 # Maximum retry attempts
max_wait: 10 # Maximum wait time between retries (seconds)
show_progress: true # Show retry progress in terminal
logging:
log_level: INFO # DEBUG to see all retry attempts
log_retries: true # Log retry attempts
Implementation Plan¶
Phase 5 Unit 6: Error Handling & Retry Logic (1 day)¶
-
Install Tenacity (15 min)
-
Create retry utility module (2 hours)
- Implement
with_retrydecorator - Implement
retry_with_rate_limitfunction -
Add progress display with Rich
-
Enhance error classes (1 hour)
- Add helpful error messages
- Add suggestions and docs URLs
-
Categorize transient vs permanent errors
-
Apply to existing code (3 hours)
- TranscriptionManager (Gemini API)
- ExtractionEngine (Claude/Gemini APIs)
- InterviewAgent (Claude Agent SDK)
-
FeedParser (RSS fetch)
-
Testing (2 hours)
- Unit tests for retry logic
- Mock transient failures
- Test backoff timing
- Test error messages
Consequences¶
Positive¶
- Better UX - Automatic retry on transient failures saves user time
- Resilience - Handles network issues and API hiccups gracefully
- Visibility - Users see retry progress, understand what's happening
- Helpful errors - Clear messages with actionable suggestions
- Rate limit respect - Honors
Retry-Afterheaders - Configurable - Users can adjust retry behavior
Negative¶
- Complexity - More code to maintain and test
- Longer waits - Failed operations take longer (up to ~9s with 3 retries)
- Dependency - Adds Tenacity library dependency
- Logging overhead - More log messages (can be noisy in DEBUG mode)
Risks & Mitigations¶
| Risk | Mitigation |
|---|---|
| Users frustrated by long waits | Show clear progress, allow Ctrl-C to cancel |
| Retrying non-idempotent operations | Only retry safe operations (reads, idempotent writes) |
| Overwhelming recovering service | Use jitter, respect Retry-After headers |
| Masking permanent issues | Fail after max attempts, log all errors |
| Cost increase from retries | Retry only transient errors, cache results |
Alternatives Considered¶
Alternative 1: Manual Retry (Status Quo)¶
Pros: Simple, no code changes Cons: Poor UX, users must manually retry Decision: Rejected - automatic retry is essential
Alternative 2: backoff Library¶
Pros: Simpler API Cons: Less flexible than Tenacity, weaker async support Decision: Rejected - Tenacity is more powerful
Alternative 3: Custom Implementation¶
Pros: No dependency, full control Cons: Reinventing wheel, prone to bugs Decision: Rejected - Tenacity is battle-tested
Alternative 4: Circuit Breaker Pattern¶
Pros: Prevents cascading failures Cons: Overkill for CLI tool, complex Decision: Deferred to future if needed
Success Metrics¶
- ✅ 95%+ of transient failures succeed after retry
- ✅ Average retry count: <2 attempts
- ✅ No user complaints about "network errors"
- ✅ Error messages rated "helpful" in user testing
- ✅ Retry progress visible in terminal
- ✅ No retries on permanent errors (400, 401, 403, 404)
Testing Strategy¶
Unit Tests¶
@pytest.mark.asyncio
async def test_retry_on_transient_error():
"""Test that function retries on transient errors."""
mock_api = AsyncMock()
mock_api.side_effect = [
ConnectionError("Network error"),
ConnectionError("Network error"),
{"result": "success"}
]
@with_retry(max_attempts=3, show_progress=False)
async def api_call():
return await mock_api()
result = await api_call()
assert result == {"result": "success"}
assert mock_api.call_count == 3
@pytest.mark.asyncio
async def test_no_retry_on_permanent_error():
"""Test that function doesn't retry on permanent errors."""
mock_api = AsyncMock()
mock_api.side_effect = ValueError("Bad input")
@with_retry(
max_attempts=3,
retry_on=(ConnectionError,),
show_progress=False
)
async def api_call():
return await mock_api()
with pytest.raises(ValueError):
await api_call()
assert mock_api.call_count == 1 # No retries
@pytest.mark.asyncio
async def test_exponential_backoff_timing():
"""Test that backoff timing increases exponentially."""
times = []
@with_retry(max_attempts=3, show_progress=False)
async def failing_call():
times.append(asyncio.get_event_loop().time())
raise ConnectionError("Fail")
with pytest.raises(ConnectionError):
await failing_call()
# Verify increasing wait times
assert len(times) == 3
# Wait between attempt 1 and 2 should be ~1-2s
# Wait between attempt 2 and 3 should be ~2-4s
Integration Tests¶
- Test with real API failures (mock server returning 503)
- Test rate limiting (mock server returning 429)
- Test network timeout scenarios
Implementation Checklist¶
- Add tenacity dependency
- Create
src/inkwell/utils/retry.py - Enhance
src/inkwell/utils/errors.py - Apply retry to TranscriptionManager
- Apply retry to ExtractionEngine
- Apply retry to InterviewAgent
- Apply retry to FeedParser
- Add retry configuration to config schema
- Write unit tests (>90% coverage)
- Write integration tests
- Update user documentation
- Test with real API failures
References¶
- Research: Error Handling Best Practices
- AWS Builders Library - Backoff with Jitter
- Tenacity Documentation
- Google Cloud - Exponential Backoff
Decision Made By: Phase 5 Team Status: Accepted Next Review: After Phase 5 Unit 6 completion