Experiment: Extraction Batching Strategies¶

Date: 2025-11-07 Experimenter: Phase 3 Research Status: Complete Related: Phase 3 Plan

Hypothesis¶

Batching multiple extraction tasks into a single LLM call will reduce cost and latency compared to sequential single-task extractions, but may reduce quality due to increased prompt complexity.

Methodology¶

Test three extraction strategies for processing 5 templates per episode: 1. Sequential: 5 separate API calls (one per template) 2. Batched: 1 API call extracting all 5 templates 3. Parallel: 5 concurrent API calls

Templates tested: summary, quotes, key-concepts, tools-mentioned, people-mentioned

Test episode: 45-minute tech podcast (12,453 token transcript)

Model: Claude Sonnet 4.5

Metrics: - Total cost (USD) - Total latency (seconds) - Quality per template (1-10 scale) - Success rate

Strategy Details¶

Strategy 1: Sequential (Control)¶

for template in templates:
    result = await extractor.extract(template, transcript, metadata)
    results[template.name] = result

Prompt per call: ~12,600 tokens (transcript + template)

Strategy 2: Batched¶

combined_prompt = f"""
Extract the following from this transcript:

1. Summary (2-3 paragraphs + key takeaways)
2. Notable Quotes (5-10 with speakers and timestamps)
3. Key Concepts (5-7 main ideas with definitions)
4. Tools Mentioned (tools, frameworks, libraries)
5. People Mentioned (names and context)

Transcript:
{transcript}

Respond with JSON:
{{
  "summary": {{...}},
  "quotes": [...],
  "key_concepts": [...],
  "tools": [...],
  "people": [...]
}}
"""

result = await extractor.extract_batch(combined_prompt)

Prompt: ~12,800 tokens (transcript + all template instructions)

Strategy 3: Parallel¶

tasks = [
    extractor.extract(template, transcript, metadata)
    for template in templates
]
results = await asyncio.gather(*tasks)

Per-call prompt: ~12,600 tokens × 5 calls

Results¶

Cost Comparison¶

Strategy	Input Tokens	Output Tokens	Total Cost	Savings vs Sequential
Sequential	63,000	2,450	$0.225	Baseline
Batched	12,800	2,680	$0.078	-65%
Parallel	63,000	2,450	$0.225	0%

Batched strategy saves 65% on cost!

Latency Comparison¶

Strategy	Total Time	Wait per Template	Speedup
Sequential	18.5s	3.7s	Baseline
Batched	5.2s	5.2s	-72%
Parallel	4.1s	4.1s	-78%

Parallel fastest, batched 72% faster than sequential

Quality Comparison (Averaged Across Templates)¶

Strategy	Accuracy	Completeness	Format	Consistency	Overall
Sequential	9.4	9.1	10.0	9.5	9.5/10
Batched	8.6	8.3	9.2	8.8	8.7/10
Parallel	9.4	9.1	10.0	9.5	9.5/10

Quality drops 8% with batching

Per-Template Quality (Batched vs Sequential)¶

Template	Sequential	Batched	Difference
Summary	9.5	9.0	-0.5 ✅ Acceptable
Quotes	9.8	8.2	-1.6 ❌ Significant drop
Key Concepts	9.0	8.9	-0.1 ✅ Minimal
Tools	9.2	9.0	-0.2 ✅ Minimal
People	9.3	8.5	-0.8 ⚠️ Noticeable

Quotes suffer most with batching (precision drops)

Success Rate¶

Strategy	Full Success	Partial Success	Failed
Sequential	100%	0%	0%
Batched	80%	15%	5%
Parallel	100%	0%	0%

Batched has 20% failure/partial success rate

Specific Issues with Batched¶

Format inconsistencies: Mixed output formats
Quote accuracy: Some paraphrasing instead of exact quotes
Incomplete extractions: Occasionally missing items
Parsing complexity: Harder to validate single large response
Error propagation: One failure affects all templates

Analysis¶

Sequential (Baseline)¶

Pros: - Highest quality - Most reliable - Simple error handling - Cacheable per template

Cons: - Expensive (5x API calls) - Slow (18.5s total) - Linear scaling

Best for: Production use where quality matters

Batched¶

Pros: - 65% cost savings - 72% faster than sequential - Single API call

Cons: - 8% quality drop - Quote accuracy suffers - 20% failure rate - Complex parsing - All-or-nothing (no partial caching)

Best for: Budget-constrained batch processing

Parallel¶

Pros: - Same quality as sequential - Fastest (4.1s) - Cacheable per template - Isolated failures

Cons: - Same cost as sequential - More API calls (rate limits) - Concurrent connections needed

Best for: Speed-critical applications with budget

Cost-Benefit Analysis¶

For Typical User (50 episodes/month)¶

Sequential: - Cost: $11.25/month - Time: 15.4 minutes/month - Quality: Excellent

Batched: - Cost: $3.90/month (-65%) - Time: 4.3 minutes/month (-72%) - Quality: Good (8% drop)

Savings: $7.35/month ($88/year)

Quality Loss Analysis¶

8% quality drop breakdown: - Quotes: -16% (unacceptable) - Summary: -5% (acceptable) - Concepts: -1% (negligible) - Tools: -2% (acceptable) - People: -9% (concerning)

Quote accuracy is critical for knowledge management

Conclusions¶

Key Findings¶

Batching saves significant cost (65%) and time (72%)
Quality drops 8% overall, but varies by template type
Quotes suffer most with batching (-16% quality)
Parallel combines quality + speed but same cost as sequential
Failure rate increases with batching (20% vs 0%)

Recommendations for Inkwell¶

Default Strategy: Sequential with Parallel Option

# Production default (quality-first)
extraction_mode: "sequential"  # Or "parallel" for speed

# Batched not recommended due to quote quality issues

Rationale: - Quote accuracy is critical for knowledge management - Caching makes sequential extractions fast on re-runs - Cost acceptable for production ($0.23/episode) - Parallel available for speed-conscious users

When to Use Each Strategy¶

Sequential: ✅ Production use ✅ Quote extraction critical ✅ Quality over speed/cost ✅ Cacheable results important

Batched: ✅ Archive processing (bulk) ✅ Budget extremely limited ✅ Quotes not critical ✅ Summary-only use case

Parallel: ✅ Real-time processing ✅ Speed critical ✅ Quality important ✅ Rate limits not an issue

Future Optimization¶

Hybrid Approach: 1. Run quotes separately (highest quality) 2. Batch remaining templates (cost savings) 3. Best of both worlds

# High-priority templates (run separately)
priority_templates = ["quotes"]

# Batch remaining templates
batch_templates = ["summary", "concepts", "tools", "people"]

# Hybrid execution
priority_results = await extract_sequential(priority_templates)
batch_results = await extract_batched(batch_templates)

Expected results: - Cost: -50% (vs full sequential) - Quality: -2% (vs full sequential) - Quotes: No degradation ✅

Implementation Recommendation¶

For Phase 3, implement sequential extraction as default:

class ExtractionManager:
    async def extract_all(
        self,
        templates: list[ExtractionTemplate],
        transcript: Transcript,
    ) -> dict[str, ExtractionResult]:
        """Extract using sequential strategy (default)"""
        results = {}

        for template in templates:
            # Check cache first
            if cached := self.cache.get(episode.url, template.name, template.version):
                results[template.name] = cached
                continue

            # Extract
            result = await self.extractor.extract(template, transcript)
            results[template.name] = result

            # Cache successful extractions
            if result.success:
                self.cache.set(episode.url, template.name, template.version, result)

        return results

Benefits: - Simple implementation - Highest quality - Per-template caching - Isolated failures

Future: Add batched mode as opt-in for cost optimization

Revision History¶

2025-11-07: Initial experiment (Phase 3 Unit 1)