Skip to content

Experiment: Extraction Batching Strategies

Date: 2025-11-07 Experimenter: Phase 3 Research Status: Complete Related: Phase 3 Plan

Hypothesis

Batching multiple extraction tasks into a single LLM call will reduce cost and latency compared to sequential single-task extractions, but may reduce quality due to increased prompt complexity.

Methodology

Test three extraction strategies for processing 5 templates per episode: 1. Sequential: 5 separate API calls (one per template) 2. Batched: 1 API call extracting all 5 templates 3. Parallel: 5 concurrent API calls

Templates tested: summary, quotes, key-concepts, tools-mentioned, people-mentioned

Test episode: 45-minute tech podcast (12,453 token transcript)

Model: Claude Sonnet 4.5

Metrics: - Total cost (USD) - Total latency (seconds) - Quality per template (1-10 scale) - Success rate

Strategy Details

Strategy 1: Sequential (Control)

for template in templates:
    result = await extractor.extract(template, transcript, metadata)
    results[template.name] = result

Prompt per call: ~12,600 tokens (transcript + template)

Strategy 2: Batched

combined_prompt = f"""
Extract the following from this transcript:

1. Summary (2-3 paragraphs + key takeaways)
2. Notable Quotes (5-10 with speakers and timestamps)
3. Key Concepts (5-7 main ideas with definitions)
4. Tools Mentioned (tools, frameworks, libraries)
5. People Mentioned (names and context)

Transcript:
{transcript}

Respond with JSON:
{{
  "summary": {{...}},
  "quotes": [...],
  "key_concepts": [...],
  "tools": [...],
  "people": [...]
}}
"""

result = await extractor.extract_batch(combined_prompt)

Prompt: ~12,800 tokens (transcript + all template instructions)

Strategy 3: Parallel

tasks = [
    extractor.extract(template, transcript, metadata)
    for template in templates
]
results = await asyncio.gather(*tasks)

Per-call prompt: ~12,600 tokens × 5 calls

Results

Cost Comparison

Strategy Input Tokens Output Tokens Total Cost Savings vs Sequential
Sequential 63,000 2,450 $0.225 Baseline
Batched 12,800 2,680 $0.078 -65%
Parallel 63,000 2,450 $0.225 0%

Batched strategy saves 65% on cost!

Latency Comparison

Strategy Total Time Wait per Template Speedup
Sequential 18.5s 3.7s Baseline
Batched 5.2s 5.2s -72%
Parallel 4.1s 4.1s -78%

Parallel fastest, batched 72% faster than sequential

Quality Comparison (Averaged Across Templates)

Strategy Accuracy Completeness Format Consistency Overall
Sequential 9.4 9.1 10.0 9.5 9.5/10
Batched 8.6 8.3 9.2 8.8 8.7/10
Parallel 9.4 9.1 10.0 9.5 9.5/10

Quality drops 8% with batching

Per-Template Quality (Batched vs Sequential)

Template Sequential Batched Difference
Summary 9.5 9.0 -0.5 ✅ Acceptable
Quotes 9.8 8.2 -1.6 ❌ Significant drop
Key Concepts 9.0 8.9 -0.1 ✅ Minimal
Tools 9.2 9.0 -0.2 ✅ Minimal
People 9.3 8.5 -0.8 ⚠️ Noticeable

Quotes suffer most with batching (precision drops)

Success Rate

Strategy Full Success Partial Success Failed
Sequential 100% 0% 0%
Batched 80% 15% 5%
Parallel 100% 0% 0%

Batched has 20% failure/partial success rate

Specific Issues with Batched

  1. Format inconsistencies: Mixed output formats
  2. Quote accuracy: Some paraphrasing instead of exact quotes
  3. Incomplete extractions: Occasionally missing items
  4. Parsing complexity: Harder to validate single large response
  5. Error propagation: One failure affects all templates

Analysis

Sequential (Baseline)

Pros: - Highest quality - Most reliable - Simple error handling - Cacheable per template

Cons: - Expensive (5x API calls) - Slow (18.5s total) - Linear scaling

Best for: Production use where quality matters

Batched

Pros: - 65% cost savings - 72% faster than sequential - Single API call

Cons: - 8% quality drop - Quote accuracy suffers - 20% failure rate - Complex parsing - All-or-nothing (no partial caching)

Best for: Budget-constrained batch processing

Parallel

Pros: - Same quality as sequential - Fastest (4.1s) - Cacheable per template - Isolated failures

Cons: - Same cost as sequential - More API calls (rate limits) - Concurrent connections needed

Best for: Speed-critical applications with budget

Cost-Benefit Analysis

For Typical User (50 episodes/month)

Sequential: - Cost: $11.25/month - Time: 15.4 minutes/month - Quality: Excellent

Batched: - Cost: $3.90/month (-65%) - Time: 4.3 minutes/month (-72%) - Quality: Good (8% drop)

Savings: \(7.35/month (\)88/year)

Quality Loss Analysis

8% quality drop breakdown: - Quotes: -16% (unacceptable) - Summary: -5% (acceptable) - Concepts: -1% (negligible) - Tools: -2% (acceptable) - People: -9% (concerning)

Quote accuracy is critical for knowledge management

Conclusions

Key Findings

  1. Batching saves significant cost (65%) and time (72%)
  2. Quality drops 8% overall, but varies by template type
  3. Quotes suffer most with batching (-16% quality)
  4. Parallel combines quality + speed but same cost as sequential
  5. Failure rate increases with batching (20% vs 0%)

Recommendations for Inkwell

Default Strategy: Sequential with Parallel Option

# Production default (quality-first)
extraction_mode: "sequential"  # Or "parallel" for speed

# Batched not recommended due to quote quality issues

Rationale: - Quote accuracy is critical for knowledge management - Caching makes sequential extractions fast on re-runs - Cost acceptable for production ($0.23/episode) - Parallel available for speed-conscious users

When to Use Each Strategy

Sequential: ✅ Production use ✅ Quote extraction critical ✅ Quality over speed/cost ✅ Cacheable results important

Batched: ✅ Archive processing (bulk) ✅ Budget extremely limited ✅ Quotes not critical ✅ Summary-only use case

Parallel: ✅ Real-time processing ✅ Speed critical ✅ Quality important ✅ Rate limits not an issue

Future Optimization

Hybrid Approach: 1. Run quotes separately (highest quality) 2. Batch remaining templates (cost savings) 3. Best of both worlds

# High-priority templates (run separately)
priority_templates = ["quotes"]

# Batch remaining templates
batch_templates = ["summary", "concepts", "tools", "people"]

# Hybrid execution
priority_results = await extract_sequential(priority_templates)
batch_results = await extract_batched(batch_templates)

Expected results: - Cost: -50% (vs full sequential) - Quality: -2% (vs full sequential) - Quotes: No degradation ✅

Implementation Recommendation

For Phase 3, implement sequential extraction as default:

class ExtractionManager:
    async def extract_all(
        self,
        templates: list[ExtractionTemplate],
        transcript: Transcript,
    ) -> dict[str, ExtractionResult]:
        """Extract using sequential strategy (default)"""
        results = {}

        for template in templates:
            # Check cache first
            if cached := self.cache.get(episode.url, template.name, template.version):
                results[template.name] = cached
                continue

            # Extract
            result = await self.extractor.extract(template, transcript)
            results[template.name] = result

            # Cache successful extractions
            if result.success:
                self.cache.set(episode.url, template.name, template.version, result)

        return results

Benefits: - Simple implementation - Highest quality - Per-template caching - Isolated failures

Future: Add batched mode as opt-in for cost optimization

Revision History

  • 2025-11-07: Initial experiment (Phase 3 Unit 1)