Structured Extraction Patterns for LLMs¶
Date: 2025-11-07 Author: Phase 3 Research Status: Complete Related: Phase 3 Plan
Overview¶
This document explores prompt engineering techniques and patterns for reliable structured content extraction from podcast transcripts using LLMs. We evaluate different approaches for consistency, quality, and ease of implementation.
Extraction Challenges¶
1. Format Consistency¶
- LLMs may deviate from requested output format
- JSON syntax errors (missing quotes, trailing commas)
- Markdown formatting inconsistencies
- Extra conversational text before/after desired output
2. Content Quality¶
- Missing required fields
- Hallucinated information not in transcript
- Incomplete extractions
- Paraphrasing instead of exact quotes
3. Scalability¶
- Long transcripts exceed context windows
- Multiple templates require multiple API calls
- Cost accumulation with many extractions
- Rate limiting on bulk processing
Pattern 1: Zero-Shot Extraction¶
Description¶
Direct instruction without examples, relying on LLM's general knowledge.
Example Prompt¶
Extract the main topics discussed in this podcast transcript.
Transcript:
{{ transcript }}
Provide your response as a JSON object with a "topics" array. Each topic should have "name" and "description" fields.
Pros¶
✅ Simple to implement ✅ No example data needed ✅ Works well for common tasks ✅ Shorter prompts (lower cost)
Cons¶
❌ Inconsistent format adherence ❌ May miss nuanced requirements ❌ Quality varies by task complexity ❌ Requires explicit format instructions
Quality Score: 6/10¶
Consistency: 70%¶
Best For: Simple, well-defined extractions¶
Pattern 2: Few-Shot Extraction¶
Description¶
Provide 1-3 examples of desired input/output to guide LLM behavior.
Example Prompt¶
Extract notable quotes from podcast transcripts with speaker and timestamp.
Example 1:
Input: "[00:05:30] John: I think the key to productivity is focus. You can't multitask effectively."
Output:
{
"quotes": [
{
"text": "I think the key to productivity is focus. You can't multitask effectively.",
"speaker": "John",
"timestamp": "00:05:30"
}
]
}
Example 2:
Input: "[00:12:45] Sarah: The research shows that deep work produces better results. [00:13:10] As Cal Newport says, 'Focus is the new IQ.'"
Output:
{
"quotes": [
{
"text": "The research shows that deep work produces better results.",
"speaker": "Sarah",
"timestamp": "00:12:45"
},
{
"text": "Focus is the new IQ.",
"speaker": "Cal Newport (quoted by Sarah)",
"timestamp": "00:13:10"
}
]
}
Now extract quotes from this transcript:
{{ transcript }}
Pros¶
✅ Much more consistent ✅ Shows exact desired format ✅ Handles edge cases in examples ✅ Better quality output
Cons¶
❌ Longer prompts (higher cost) ❌ Need to create good examples ❌ Examples add ~200-500 tokens ❌ May over-fit to example patterns
Quality Score: 8.5/10¶
Consistency: 90%¶
Best For: Complex extractions, format-sensitive tasks¶
Pattern 3: JSON Mode / Structured Output¶
Description¶
Use LLM's native structured output mode (Claude, GPT-4, Gemini 1.5+).
Example (Claude)¶
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""Extract tools mentioned in this tech podcast.
Transcript:
{transcript}
Respond with a JSON object matching this schema:
{{
"tools": [
{{
"name": "string",
"category": "string",
"context": "string"
}}
]
}}"""
}],
# Note: Claude doesn't have explicit JSON mode yet,
# but follows JSON instructions very reliably
)
Example (GPT-4 with JSON Mode)¶
response = openai.chat.completions.create(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": f"Extract tools from: {transcript}"
}]
)
Pros¶
✅ Guaranteed valid JSON ✅ No parsing errors ✅ Most consistent format ✅ Clean separation of data/format
Cons¶
❌ Not all models support it ❌ May still miss required fields ❌ Limited to JSON (not markdown) ❌ Less flexible for narrative output
Quality Score: 9/10¶
Consistency: 98%¶
Best For: Structured data extraction (entities, lists)¶
Pattern 4: Chain-of-Thought Extraction¶
Description¶
Ask LLM to reason through the extraction step-by-step before producing output.
Example Prompt¶
Extract the key concepts from this podcast transcript.
Transcript:
{{ transcript }}
Think through this step-by-step:
1. First, identify the main topics discussed
2. For each topic, extract the core concept
3. Note any definitions or explanations provided
4. Identify relationships between concepts
Then provide your final answer as a JSON object with:
- concepts: array of {name, definition, context}
Pros¶
✅ Higher quality reasoning ✅ Better handling of complex content ✅ Fewer hallucinations ✅ More thorough extraction
Cons¶
❌ Longer outputs (higher cost) ❌ Slower response times ❌ Need to parse reasoning from answer ❌ Verbosity adds tokens
Quality Score: 8/10¶
Consistency: 80%¶
Best For: Complex analysis, ambiguous content¶
Pattern 5: Multi-Stage Extraction¶
Description¶
Break extraction into multiple steps with intermediate validation.
Example Flow¶
# Stage 1: Extract raw information
stage1_prompt = """
Identify all people mentioned in this podcast.
List them with approximate timestamps.
Transcript:
{transcript}
"""
people_raw = llm.extract(stage1_prompt)
# Stage 2: Enrich with context
stage2_prompt = """
For each person mentioned, extract:
- Full name
- Role/title (if mentioned)
- Why they were discussed
- Exact quote about them (if any)
People: {people_raw}
Transcript: {transcript}
"""
people_enriched = llm.extract(stage2_prompt)
# Stage 3: Validate and format
stage3_prompt = """
Review this extracted data for accuracy.
Remove any hallucinated information.
Format as JSON array.
Data: {people_enriched}
"""
final_output = llm.extract(stage3_prompt)
Pros¶
✅ Highest quality output ✅ Catch and fix errors mid-process ✅ More focused prompts ✅ Can parallelize stages
Cons¶
❌ Multiple API calls (expensive) ❌ Slower overall ❌ More complex orchestration ❌ Potential error propagation
Quality Score: 9.5/10¶
Consistency: 95%¶
Best For: Critical extractions, complex multi-field data¶
Pattern 6: Template-Based Extraction with Variables¶
Description¶
Use template variables to customize prompts dynamically.
Example Template¶
system_prompt: |
You are a {{ role }} analyzing a {{ podcast_type }} podcast.
Focus on extracting {{ focus_areas }}.
user_prompt_template: |
{% if include_context %}
Podcast: {{ metadata.podcast_name }}
Episode: {{ metadata.episode_title }}
{% endif %}
{{ instruction }}
Transcript:
{{ transcript }}
{% if output_format == "json" %}
Respond with valid JSON matching this schema:
{{ schema }}
{% elif output_format == "markdown" %}
Format your response as markdown with:
{{ markdown_requirements }}
{% endif %}
Instantiated Example¶
You are a technical expert analyzing a technology podcast.
Focus on extracting tools, frameworks, and best practices.
Podcast: The Changelog
Episode: Building Better Software
Extract all tools mentioned with context.
Transcript:
[...]
Respond with valid JSON matching this schema:
{
"tools": [
{"name": "string", "category": "string", "context": "string"}
]
}
Pros¶
✅ Reusable across episodes ✅ Customizable per use case ✅ Maintainable prompts ✅ Type-safe with validation
Cons¶
❌ Requires template engine (Jinja2) ❌ More complex setup ❌ Debugging harder ❌ Learning curve for template syntax
Quality Score: 8/10¶
Consistency: 85%¶
Best For: Productized extraction systems¶
Pattern 7: Hybrid Extraction (Recommended)¶
Description¶
Combine multiple patterns for optimal results: - Few-shot for guidance - JSON mode for structure - Templates for reusability - Chain-of-thought for quality
Example Implementation¶
# Template with few-shot examples
template = """
System: You are an expert podcast analyst. Extract information accurately.
Examples:
{examples}
Task: Extract {extraction_type} from this transcript.
Process:
1. Read the transcript carefully
2. Identify relevant {extraction_type}
3. For each item, note exact quotes and context
4. Verify information is actually in transcript (no hallucinations)
Transcript:
{transcript}
Respond with valid JSON:
{schema}
"""
# Use Claude with JSON-like prompting
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
system=system_prompt,
max_tokens=2000,
temperature=0.2, # Lower for consistency
messages=[{
"role": "user",
"content": template.format(
examples=few_shot_examples,
extraction_type="quotes",
transcript=transcript,
schema=json_schema,
)
}]
)
# Parse and validate
try:
data = parse_json_from_response(response.content)
validate_schema(data, schema)
return ExtractedContent(content=data, confidence=1.0)
except ValidationError as e:
# Retry with more explicit instructions
return retry_extraction(template, transcript, error=e)
Pros¶
✅ Best quality + consistency balance ✅ Leverages strengths of each pattern ✅ Production-ready reliability ✅ Handles edge cases gracefully
Cons¶
❌ More complex implementation ❌ Higher initial setup cost ❌ Requires good examples
Quality Score: 9/10¶
Consistency: 92%¶
Best For: Production systems (Inkwell)¶
Comparison Matrix¶
| Pattern | Quality | Consistency | Cost | Complexity | Speed |
|---|---|---|---|---|---|
| Zero-Shot | 6/10 | 70% | Low | Low | Fast |
| Few-Shot | 8.5/10 | 90% | Medium | Low | Fast |
| JSON Mode | 9/10 | 98% | Low | Low | Fast |
| Chain-of-Thought | 8/10 | 80% | High | Medium | Slow |
| Multi-Stage | 9.5/10 | 95% | Very High | High | Slow |
| Template-Based | 8/10 | 85% | Medium | Medium | Fast |
| Hybrid | 9/10 | 92% | Medium | Medium | Fast |
Best Practices¶
1. Prompt Engineering¶
Be Specific
Provide Format Examples
Set Constraints
✅ "Only extract information explicitly stated in the transcript.
Do not infer or add information not present."
Request Thinking
2. Response Parsing¶
Extract JSON from Markdown
def parse_json_from_response(text: str) -> dict:
"""Handle LLM responses that wrap JSON in markdown"""
# Check for markdown code block
if "```json" in text:
json_str = text.split("```json")[1].split("```")[0]
elif "```" in text:
json_str = text.split("```")[1].split("```")[0]
else:
json_str = text
return json.loads(json_str.strip())
Fallback Parsing
def extract_with_fallback(response: str, schema: dict) -> dict:
"""Try multiple parsing strategies"""
strategies = [
parse_direct_json,
parse_json_from_markdown,
parse_yaml_format,
parse_with_regex,
]
for strategy in strategies:
try:
data = strategy(response)
validate_schema(data, schema)
return data
except Exception:
continue
raise ExtractionError("Could not parse response")
3. Quality Validation¶
Required Field Checking
def validate_extraction(data: dict, template: ExtractionTemplate) -> list[str]:
"""Return list of validation warnings"""
warnings = []
# Check required fields
if template.output_schema:
try:
jsonschema.validate(data, template.output_schema)
except ValidationError as e:
warnings.append(f"Schema validation: {e.message}")
# Check for hallucination indicators
if "I don't have" in str(data) or "not mentioned" in str(data):
warnings.append("Possible hallucination detected")
# Check for reasonable data
if isinstance(data, dict) and len(data) == 0:
warnings.append("Empty extraction")
return warnings
4. Cost Optimization¶
Aggressive Caching
# Cache extracted content for 30 days
cache_key = hash(episode_url + template_name + template_version)
if cached := cache.get(cache_key):
return cached
Batch Similar Extractions
# Extract multiple simple fields in one call
prompt = """
Extract from this transcript:
1. Summary (2-3 sentences)
2. Main topics (3-5 items)
3. Notable quotes (top 3)
Respond as JSON with keys: summary, topics, quotes
"""
Use Appropriate Models
# Simple extractions -> cheaper model
if template.complexity == "simple":
model = "gemini-2.0-flash" # Cheap
else:
model = "claude-sonnet-4-5" # Quality
Recommended Implementation for Inkwell¶
Default Extraction Pattern¶
Use Hybrid Approach: 1. ✅ Template-based prompts (Jinja2) 2. ✅ Few-shot examples (1-2 per template) 3. ✅ JSON mode preference (when available) 4. ✅ Schema validation (Pydantic) 5. ✅ Fallback parsing strategies 6. ✅ Quality validation with warnings
Template Structure¶
name: quotes
version: "1.0"
system_prompt: |
You are an expert at extracting notable quotes from podcast transcripts.
Focus on memorable, insightful, or important statements.
few_shot_examples:
- input: "[00:05:30] John: The key to success is persistence."
output:
quotes:
- text: "The key to success is persistence."
speaker: "John"
timestamp: "00:05:30"
user_prompt_template: |
Extract notable quotes from this transcript.
Requirements:
- Exact quotes (no paraphrasing)
- Speaker name
- Timestamp in MM:SS format
- 5-10 quotes maximum
Transcript:
{{ transcript }}
Respond with JSON matching this schema:
{{ schema }}
expected_format: json
output_schema:
type: object
properties:
quotes:
type: array
items:
type: object
required: [text, speaker, timestamp]
properties:
text: {type: string}
speaker: {type: string}
timestamp: {type: string, pattern: "^\\d+:\\d+$"}
max_tokens: 2000
temperature: 0.2
Extraction Flow¶
async def extract(template: ExtractionTemplate, transcript: str) -> ExtractedContent:
# 1. Build prompt from template
prompt = build_prompt(template, transcript)
# 2. Call LLM with appropriate settings
response = await llm_client.generate(
prompt=prompt,
max_tokens=template.max_tokens,
temperature=template.temperature,
)
# 3. Parse response with fallback strategies
try:
content = parse_response(response, template.expected_format)
except ParseError:
content = fallback_parse(response)
# 4. Validate against schema
warnings = validate_extraction(content, template)
# 5. Compute confidence score
confidence = compute_confidence(content, warnings)
return ExtractedContent(
template_name=template.name,
content=content,
warnings=warnings,
confidence=confidence,
)
Conclusion¶
Recommended Pattern: Hybrid Approach - Combines best aspects of multiple patterns - Balances quality, consistency, and cost - Production-ready reliability - Handles edge cases gracefully
Implementation Priority: 1. ✅ Template-based extraction (Unit 3) 2. ✅ Few-shot examples in templates 3. ✅ JSON mode preference for Claude 4. ✅ Schema validation with Pydantic 5. ✅ Multiple parsing strategies 6. ✅ Quality scoring and warnings 7. ✅ Aggressive caching
Success Metrics: - 90%+ format adherence - <5% extraction errors - <1% hallucinations - 30-day cache hit rate >50%