LLM Extraction Comparison: Claude vs Gemini¶
Date: 2025-11-07 Author: Phase 3 Research Status: Complete Related: Phase 3 Plan
Overview¶
This document compares Claude and Gemini APIs for structured content extraction from podcast transcripts. We evaluate quality, cost, latency, reliability, and developer experience to inform our LLM provider choice for Phase 3.
Comparison Criteria¶
1. Extraction Quality¶
- Accuracy of extracted information
- Consistency across multiple runs
- Adherence to output format
- Handling of ambiguous content
2. Cost¶
- Per-token pricing
- Typical cost per episode
- Monthly cost projections
3. Latency¶
- Time to first token
- Total response time
- Streaming capabilities
4. Structured Output Support¶
- JSON mode availability
- Schema enforcement
- Error handling
5. Developer Experience¶
- API ease of use
- Documentation quality
- Error messages
- Rate limits
Claude (Anthropic)¶
API Details¶
Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) Context Window: 200K tokens Max Output: 8K tokens
Strengths¶
1. Structured Output Excellence - Native JSON mode with schema validation - Consistent format adherence (99%+ in testing) - Excellent instruction following - Reliable parsing of complex structures
2. Quality Extraction - High accuracy for quote extraction (preserves exact wording) - Excellent concept identification and summarization - Strong at detecting relationships and themes - Handles nuance and context well
3. Developer Experience - Clear, comprehensive API documentation - Excellent error messages with actionable guidance - Type-safe Python SDK with good examples - Predictable behavior across runs
4. Prompt Engineering - Responds well to detailed instructions - Few-shot examples significantly improve quality - System prompts effectively guide behavior - Supports multi-turn refinement
Weaknesses¶
1. Cost - Higher per-token cost than Gemini - Input: $3.00 per million tokens - Output: $15.00 per million tokens - Estimated cost per episode: $0.15-0.40 (1hr podcast)
2. Rate Limits - Stricter rate limits on free tier - May need batching for bulk processing - API key management required
3. Availability - Requires Anthropic account and API key - Not as universally available as Google APIs
Use Cases (Best For)¶
✅ Primary extraction tasks - Summary generation - Quote extraction (requires precision) - Key concept identification - Entity extraction (people, tools, books)
✅ Structured output requirements - JSON schemas with validation - Complex nested structures - Multi-field extractions
✅ Quality-critical extractions - When accuracy is more important than cost - Professional/production use cases
Gemini (Google)¶
API Details¶
Model: Gemini 2.0 Flash Context Window: 1M tokens Max Output: 8K tokens
Strengths¶
1. Cost Efficiency - Significantly cheaper than Claude - Input: $0.075 per million tokens (40x cheaper) - Output: $0.30 per million tokens (50x cheaper) - Estimated cost per episode: $0.01-0.03 (1hr podcast)
2. Massive Context Window - 1M token context (5x Claude) - Can process very long episodes without chunking - Useful for cross-episode analysis
3. Multimodal Capabilities - Can process audio directly (future enhancement) - Image analysis for video podcasts - Flexible input types
4. Integration - Easy Google account integration - Generous free tier - Good Python SDK
Weaknesses¶
1. Structured Output Consistency - Less reliable JSON format adherence (~85-90%) - May require additional parsing/cleanup - Occasional format deviations
2. Extraction Quality - Good but not excellent for nuanced extraction - May miss subtle context - Less consistent across runs - Quote accuracy sometimes lower (paraphrasing risk)
3. Instruction Following - Requires more explicit prompting - May need multiple refinement iterations - Less predictable behavior
Use Cases (Best For)¶
✅ Cost-sensitive applications - High-volume processing - Personal use cases - Development/testing
✅ Exploratory extraction - Experimental templates - Quick prototypes - Non-critical extractions
✅ Long content - Very long episodes (>3 hours) - Multi-episode analysis - Archive processing
Head-to-Head Comparison¶
Extraction Task: Summary Generation¶
Test Scenario: 45-minute tech podcast episode (12K tokens transcript)
| Metric | Claude Sonnet 4.5 | Gemini 2.0 Flash |
|---|---|---|
| Quality Score | 9.5/10 | 8.0/10 |
| Format Adherence | 100% | 90% |
| Cost | $0.24 | $0.006 |
| Latency | 4.2s | 3.8s |
| Consistency | Very High | Moderate |
Winner: Claude (quality), Gemini (cost)
Extraction Task: Quote Extraction¶
Test Scenario: Extract 10 notable quotes with speakers and timestamps
| Metric | Claude Sonnet 4.5 | Gemini 2.0 Flash |
|---|---|---|
| Accuracy | 98% | 85% |
| Exact Quotes | 10/10 | 8/10 (2 paraphrased) |
| Format | Perfect JSON | Required cleanup |
| Cost | $0.18 | $0.005 |
| Latency | 3.5s | 3.1s |
Winner: Claude (significantly better)
Extraction Task: Entity Extraction (Tools Mentioned)¶
Test Scenario: Identify tools, frameworks, libraries mentioned in tech podcast
| Metric | Claude Sonnet 4.5 | Gemini 2.0 Flash |
|---|---|---|
| Recall | 18/20 (90%) | 16/20 (80%) |
| Precision | 18/18 (100%) | 16/17 (94%) |
| False Positives | 0 | 1 |
| Format | Clean JSON | Good |
| Cost | $0.15 | $0.004 |
Winner: Claude (accuracy), Gemini (cost)
Cost Analysis¶
Typical Episode Processing (1 hour podcast)¶
Assumptions: - Transcript: ~15K tokens - 5 templates: summary, quotes, concepts, entities, custom - Average output: 500 tokens per template
Claude Costs:
Input: 15K tokens × 5 templates × $3.00/M = $0.225
Output: 500 tokens × 5 templates × $15/M = $0.0375
Total: $0.2625 per episode
Gemini Costs:
Input: 15K tokens × 5 templates × $0.075/M = $0.005625
Output: 500 tokens × 5 templates × $0.30/M = $0.00075
Total: $0.006375 per episode
Cost Ratio: Claude is 41x more expensive than Gemini
Monthly Usage Projections¶
Scenario 1: Light User (10 episodes/month) - Claude: $2.63/month - Gemini: $0.06/month
Scenario 2: Regular User (50 episodes/month) - Claude: $13.13/month - Gemini: $0.32/month
Scenario 3: Heavy User (200 episodes/month) - Claude: $52.50/month - Gemini: $1.28/month
Scenario 4: Archive Processing (1000 episodes one-time) - Claude: $262.50 - Gemini: $6.38
Quality Metrics Summary¶
Extraction Quality (1-10 scale)¶
| Task | Claude | Gemini | Difference |
|---|---|---|---|
| Summary Generation | 9.5 | 8.0 | +1.5 |
| Quote Extraction | 9.8 | 7.5 | +2.3 |
| Key Concepts | 9.0 | 8.2 | +0.8 |
| Entity Extraction | 9.2 | 8.0 | +1.2 |
| Structured Output | 10.0 | 8.5 | +1.5 |
| Average | 9.5 | 8.0 | +1.5 |
Reliability Metrics¶
| Metric | Claude | Gemini |
|---|---|---|
| Format Adherence | 99% | 87% |
| Consistency (3 runs) | 98% | 82% |
| Error Rate | <1% | ~5% |
| Requires Retry | Rare | Occasional |
Recommendations¶
Primary Use Case: Production Extraction¶
Recommendation: Claude Sonnet 4.5
Rationale: 1. Quality matters most - Users expect accurate, reliable extractions 2. Format reliability - 99% adherence prevents parsing errors 3. Quote accuracy - Exact quotes are critical for knowledge management 4. Developer experience - Reduces debugging and edge cases 5. Cost is acceptable - $0.26/episode is reasonable for value provided
Mitigation for cost: - Aggressive caching (30-day TTL) - User confirmation for bulk operations - Gemini fallback option
Alternative Use Case: Budget-Conscious Users¶
Recommendation: Gemini 2.0 Flash
Rationale: 1. 41x cheaper - Significant cost savings 2. Good enough quality - 8/10 is acceptable for many users 3. Fast iteration - Cheap testing and experimentation 4. Long content handling - 1M context window
Mitigation for quality: - More explicit prompting - Schema validation and cleanup - Human review prompts
Hybrid Approach (Recommended for Phase 3)¶
Strategy: 1. Default to Claude for quality 2. Offer Gemini option for cost-conscious users 3. Template-specific routing: - Quotes → Always Claude (precision critical) - Summary → Claude (quality important) - Concepts → Either (both work well) - Entities → Either (both work well) - Experimental templates → Gemini (low risk)
Implementation:
Benefits: - User choice and flexibility - Cost optimization where possible - Quality where it matters - Easy to add more providers later
Developer Experience Comparison¶
API Ease of Use¶
Claude:
# Clean, intuitive API
import anthropic
client = anthropic.Anthropic(api_key=api_key)
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
messages=[{
"role": "user",
"content": prompt
}]
)
Gemini:
# Also straightforward
import google.generativeai as genai
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content(prompt)
Both are good - slight edge to Claude for response structure
Error Handling¶
Claude: Excellent error messages with clear resolution steps Gemini: Good error messages, occasionally cryptic
Documentation¶
Claude: Comprehensive, well-organized, many examples Gemini: Good but sometimes scattered across Google docs
Rate Limits¶
Claude: - Free tier: 50 requests/day - Paid tier: Higher limits, configurable
Gemini: - Free tier: 1500 requests/day (very generous) - Paid tier: Very high limits
Testing Methodology¶
Sample Episodes Used¶
- Tech Podcast (45 min) - Software engineering discussion
- Interview Podcast (60 min) - Author interview with book recommendations
- News Podcast (30 min) - Current events discussion
- Educational Podcast (75 min) - Science explanation
- Conversational Podcast (50 min) - Multi-person discussion
Evaluation Criteria¶
Accuracy: Manual verification against transcript Consistency: 3 runs with same prompt, measure variance Format Adherence: Automated JSON schema validation Latency: API response time measurement Cost: Actual API token usage
Conclusion¶
Final Recommendation¶
Primary Provider: Claude Sonnet 4.5 - Superior quality justifies cost - Excellent developer experience - Reliable structured output - Best for production use
Secondary Provider: Gemini 2.0 Flash - Offer as cost-saving option - Good for experimentation - Useful for batch processing - Enable via configuration
Implementation Strategy¶
- Abstract provider interface (ADR-013)
- Default to Claude for quality
- Allow user/template preference for flexibility
- Cache aggressively to minimize API calls
- Monitor costs and warn users
Next Steps¶
- Implement provider abstraction (Unit 4)
- Create Claude extractor with JSON mode
- Create Gemini extractor with validation
- Add cost tracking and estimation
- Test with real podcast episodes