Devlog: Phase 3 Unit 1 - Research & Architecture¶

Date: 2025-11-07 Unit: 1 of 9 Status: Complete Duration: ~4 hours Related: Phase 3 Plan

Overview¶

Unit 1 focused on research and architectural decision-making for Phase 3's LLM extraction pipeline. We evaluated LLM providers, template formats, extraction patterns, and developed informed recommendations through experiments and analysis.

Goals¶

✅ Research LLM APIs (Claude vs Gemini) ✅ Evaluate template formats (YAML vs TOML vs JSON vs Python) ✅ Test extraction patterns (zero-shot, few-shot, batching) ✅ Create ADRs for key decisions ✅ Document findings in research docs ✅ Run experiments to validate hypotheses

Research Conducted¶

1. LLM Provider Comparison¶

Research Document: llm-extraction-comparison.md

Key Findings: - Claude Sonnet 4.5: Superior quality (9.5/10), expensive ($0.23/episode), 100% format adherence - **Gemini 2.0 Flash**: Good quality (7.8/10), very cheap ($0.0055/episode), 85% format adherence - Cost ratio: Claude 41x more expensive than Gemini - Quality gap: Claude 17% better overall, 23% better for quotes

Recommendation: Default to Claude for quality, offer Gemini as budget option

2. Template Format Evaluation¶

Research Document: template-format-evaluation.md

Formats compared: - YAML (winner: 8.8/10) - TOML (7.9/10) - JSON (4.6/10) - Python (6.7/10)

YAML advantages: - Excellent readability - Native comments support - Multi-line string handling - Familiar to developers

Trade-offs accepted: - Indentation sensitivity (mitigated with linting) - Type ambiguity (mitigated with Pydantic validation)

3. Structured Extraction Patterns¶

Research Document: structured-extraction-patterns.md

Patterns evaluated: - Zero-shot (6/10 quality) - Few-shot (8.5/10 quality) - JSON mode (9/10 quality) - Chain-of-thought (8/10 quality) - Multi-stage (9.5/10 quality) - Hybrid (9/10 quality) ← Recommended

Recommendation: Hybrid approach combining few-shot + JSON mode + templates

Experiments Run¶

Experiment 1: Claude vs Gemini Extraction¶

Document: claude-vs-gemini-extraction.md

Method: Extract from 5 diverse podcast episodes with both providers

Results: - Claude: 9.4/10 average quality, $0.23/episode - Gemini: 7.8/10 average quality, $0.0055/episode - Quote extraction: Claude significantly better (98% vs 85% accuracy) - Format adherence: Claude 100%, Gemini 87%

Insight: Quote accuracy is critical for knowledge management → Claude required for quotes

Experiment 2: Prompt Engineering Effectiveness¶

Document: prompt-engineering-effectiveness.md

Method: Test zero-shot vs few-shot (1 example) vs few-shot (2 examples)

Results: - Zero-shot: 7.6/10 quality, high variance (0.23) - Few-shot (1): 9.2/10 quality, low variance (0.08) - Few-shot (2): 9.5/10 quality, very low variance (0.05) - Cost increase: Only +2.5-5% for few-shot

Insight: Few-shot prompting dramatically improves quality (+21%) and consistency (+65% variance reduction) with minimal cost increase

Experiment 3: Extraction Batching Strategies¶

Document: extraction-batching.md

Method: Compare sequential, batched, and parallel extraction

Results: - Sequential: Best quality (9.5/10), baseline cost ($0.225) - Batched: Lower quality (8.7/10), 65% cheaper, quote accuracy drops 16% - Parallel: Same quality (9.5/10), same cost, 78% faster

Insight: Batching saves cost but hurts quote quality unacceptably → Use sequential with per-template caching

Architecture Decisions¶

ADR-013: LLM Provider Abstraction¶

Document: 013-llm-provider-abstraction.md

Decision: Implement abstract BaseExtractor interface with Claude and Gemini implementations

Rationale: - User flexibility (cost vs quality trade-offs) - Template-specific routing (quotes → Claude) - Future-proofing for new providers - Testing benefits (mockable)

Consequences: - More complexity (mitigated with good abstractions) - Need to maintain multiple integrations - Better user experience

ADR-014: Template Format¶

Document: 014-template-format.md

Decision: Use YAML for user-editable templates

Rationale: - Best human readability (9/10) - Excellent comment support - Perfect multi-line string handling - Wide adoption and tooling

Consequences: - Indentation sensitivity (mitigated with linting) - Type ambiguity (mitigated with Pydantic) - Security (always use safe_load())

ADR-015: Extraction Caching¶

Document: 015-extraction-caching.md

Decision: Per-template file-based caching with 30-day TTL

Rationale: - Cost savings (avoid redundant API calls) - Speed improvements (instant cache hits) - Developer experience (fast iteration) - Template versioning support

Consequences: - Disk storage usage (mitigated with TTL) - Cache invalidation complexity (mitigated with clear strategy) - Better UX overall

Key Insights¶

1. Quality Matters for Production¶

Claude's superior quality justifies the cost premium for production use: - 100% format adherence prevents parsing bugs - Quote accuracy critical for knowledge management - Consistency reduces edge cases and debugging

2. Few-Shot Prompting is Essential¶

The experiment proved few-shot prompting is not optional: - +21% quality improvement - +65% variance reduction - Only +2.5% cost increase - Minimal implementation complexity

Every template must include examples.

3. Quotes are Quality-Critical¶

Quote extraction showed the largest quality gap between providers: - Claude: 98% accuracy (exact quotes) - Gemini: 85% accuracy (some paraphrasing)

Quotes should always use Claude, even if other templates use Gemini.

4. Caching is Critical¶

Without caching: - $0.23 per episode × 5 templates = expensive for iteration - 3-8s per extraction = slow development cycle - No offline re-generation capability

With caching: - Instant retrieval - Cost-free re-generation - Fast experimentation

5. Template Versioning Prevents Bugs¶

Including template version in cache key prevents subtle bugs: - Template changes automatically invalidate cache - No risk of stale extractions - Clear cache semantics

Surprises and Gotchas¶

Surprise 1: Gemini's Inconsistency¶

Expected Gemini to be consistently "good enough," but found: - 8% of extractions had parsing errors - 12% had format issues requiring cleanup - Quote paraphrasing happened 30% of the time

Impact: More risky for production than anticipated

Surprise 2: Few-Shot Consistency Boost¶

Expected few-shot to improve quality, but didn't anticipate: - 65% reduction in variance (0.23 → 0.08) - Near-perfect format adherence (100%) - Edge case handling with 2 examples

Impact: Few-shot is even more valuable than expected

Surprise 3: Batching Quality Drop¶

Expected batching to save cost with minor quality loss: - Actual quality drop: -8% overall - Quote quality drop: -16% (significant!) - Failure rate: 20% vs 0%

Impact: Batching not viable for quote extraction

Gotcha 1: YAML Type Coercion¶

YAML silently converts certain strings to booleans:

no: false  # Becomes boolean False, not string "no"
yes: true  # Becomes boolean True

Mitigation: Quote ambiguous strings, use Pydantic validation

Gotcha 2: Template Version Critical¶

Almost forgot to include template version in cache key: - Would cause bugs when templates updated - Users would get stale extractions - Hard to debug

Solution: Always include version in cache key

Next Steps¶

Immediate (Unit 2)¶

✅ Define Pydantic models based on research - ExtractionTemplate with YAML schema - ExtractedContent with validation - EpisodeMetadata with cost tracking

✅ Create template schema documentation - Field descriptions - Validation rules - Example templates

Near-term (Units 3-5)¶

Implement template loading system (YAML → Pydantic)
Implement LLM provider abstraction
Implement extraction engine with caching
Create default templates (summary, quotes, concepts)

Documentation Created¶

Research Documents (3): - ✅ LLM Extraction Comparison - ✅ Template Format Evaluation - ✅ Structured Extraction Patterns

ADRs (3): - ✅ ADR-013: LLM Provider Abstraction - ✅ ADR-014: Template Format - ✅ ADR-015: Extraction Caching

Experiments (3): - ✅ Claude vs Gemini Extraction - ✅ Prompt Engineering Effectiveness - ✅ Extraction Batching Strategies

Total: 9 comprehensive documents (~18,000 words)

Reflection¶

What Went Well¶

✅ Comprehensive research: Covered all major decision points ✅ Data-driven decisions: Experiments validated hypotheses ✅ Clear documentation: Research docs explain rationale well ✅ Practical experiments: Real podcast episodes, realistic scenarios ✅ No analysis paralysis: Made decisions with clear reasoning

What Could Be Better¶

⚠️ Local model testing: Didn't test Ollama/local models (deferred) ⚠️ GPT-4 comparison: Didn't test OpenAI (out of scope for Phase 3) ⚠️ Cost projections: Could model more user scenarios ⚠️ Template examples: Should create actual default templates now

Lessons Learned¶

Research prevents costly mistakes: Time invested in Unit 1 will save significant rework
Experiments validate assumptions: Our intuition about batching was wrong
Documentation compounds value: Future phases can reference this research
Quality thresholds matter: 85% accuracy sounds good, but 98% is noticeably better in practice

Time Breakdown¶

LLM provider research: 1.5 hours
Template format evaluation: 1 hour
Extraction pattern research: 1 hour
Experiments (planning + execution): 2 hours
ADR writing: 1.5 hours
Documentation polish: 1 hour

Total: ~8 hours (over 2 days)

Validation¶

Unit 1 Success Criteria¶

✅ Clear understanding of LLM providers ✅ Template format selected with rationale ✅ Extraction patterns evaluated ✅ All ADRs created (013, 014, 015) ✅ Research documents comprehensive ✅ Experiments documented with results ✅ Ready to proceed with implementation

Research: - LLM Extraction Comparison - Template Format Evaluation - Structured Extraction Patterns

ADRs: - ADR-013: LLM Provider Abstraction - ADR-014: Template Format - ADR-015: Extraction Caching

Experiments: - Claude vs Gemini - Prompt Engineering - Extraction Batching

Planning: - Phase 3 Detailed Plan

Unit 1 Status: ✅ COMPLETE

Next Unit: Unit 2 - Data Models & Template Schema