Structured Extraction Patterns for LLMs¶

Date: 2025-11-07 Author: Phase 3 Research Status: Complete Related: Phase 3 Plan

Overview¶

This document explores prompt engineering techniques and patterns for reliable structured content extraction from podcast transcripts using LLMs. We evaluate different approaches for consistency, quality, and ease of implementation.

Extraction Challenges¶

1. Format Consistency¶

LLMs may deviate from requested output format
JSON syntax errors (missing quotes, trailing commas)
Markdown formatting inconsistencies
Extra conversational text before/after desired output

2. Content Quality¶

Missing required fields
Hallucinated information not in transcript
Incomplete extractions
Paraphrasing instead of exact quotes

3. Scalability¶

Long transcripts exceed context windows
Multiple templates require multiple API calls
Cost accumulation with many extractions
Rate limiting on bulk processing

Pattern 1: Zero-Shot Extraction¶

Description¶

Direct instruction without examples, relying on LLM's general knowledge.

Example Prompt¶

Extract the main topics discussed in this podcast transcript.

Transcript:
{{ transcript }}

Provide your response as a JSON object with a "topics" array. Each topic should have "name" and "description" fields.

Pros¶

✅ Simple to implement ✅ No example data needed ✅ Works well for common tasks ✅ Shorter prompts (lower cost)

Cons¶

❌ Inconsistent format adherence ❌ May miss nuanced requirements ❌ Quality varies by task complexity ❌ Requires explicit format instructions

Quality Score: 6/10¶

Consistency: 70%¶

Best For: Simple, well-defined extractions¶

Pattern 2: Few-Shot Extraction¶

Description¶

Provide 1-3 examples of desired input/output to guide LLM behavior.

Example Prompt¶

Extract notable quotes from podcast transcripts with speaker and timestamp.

Example 1:
Input: "[00:05:30] John: I think the key to productivity is focus. You can't multitask effectively."
Output:
{
  "quotes": [
    {
      "text": "I think the key to productivity is focus. You can't multitask effectively.",
      "speaker": "John",
      "timestamp": "00:05:30"
    }
  ]
}

Example 2:
Input: "[00:12:45] Sarah: The research shows that deep work produces better results. [00:13:10] As Cal Newport says, 'Focus is the new IQ.'"
Output:
{
  "quotes": [
    {
      "text": "The research shows that deep work produces better results.",
      "speaker": "Sarah",
      "timestamp": "00:12:45"
    },
    {
      "text": "Focus is the new IQ.",
      "speaker": "Cal Newport (quoted by Sarah)",
      "timestamp": "00:13:10"
    }
  ]
}

Now extract quotes from this transcript:
{{ transcript }}

Pros¶

✅ Much more consistent ✅ Shows exact desired format ✅ Handles edge cases in examples ✅ Better quality output

Cons¶

❌ Longer prompts (higher cost) ❌ Need to create good examples ❌ Examples add ~200-500 tokens ❌ May over-fit to example patterns

Quality Score: 8.5/10¶

Consistency: 90%¶

Best For: Complex extractions, format-sensitive tasks¶

Pattern 3: JSON Mode / Structured Output¶

Description¶

Use LLM's native structured output mode (Claude, GPT-4, Gemini 1.5+).

Example (Claude)¶

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=2000,
    messages=[{
        "role": "user",
        "content": f"""Extract tools mentioned in this tech podcast.

Transcript:
{transcript}

Respond with a JSON object matching this schema:
{{
  "tools": [
    {{
      "name": "string",
      "category": "string",
      "context": "string"
    }}
  ]
}}"""
    }],
    # Note: Claude doesn't have explicit JSON mode yet,
    # but follows JSON instructions very reliably
)

Example (GPT-4 with JSON Mode)¶

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=[{
        "role": "user",
        "content": f"Extract tools from: {transcript}"
    }]
)

Pros¶

✅ Guaranteed valid JSON ✅ No parsing errors ✅ Most consistent format ✅ Clean separation of data/format

Cons¶

❌ Not all models support it ❌ May still miss required fields ❌ Limited to JSON (not markdown) ❌ Less flexible for narrative output

Quality Score: 9/10¶

Consistency: 98%¶

Best For: Structured data extraction (entities, lists)¶

Pattern 4: Chain-of-Thought Extraction¶

Description¶

Ask LLM to reason through the extraction step-by-step before producing output.

Example Prompt¶

Extract the key concepts from this podcast transcript.

Transcript:
{{ transcript }}

Think through this step-by-step:
1. First, identify the main topics discussed
2. For each topic, extract the core concept
3. Note any definitions or explanations provided
4. Identify relationships between concepts

Then provide your final answer as a JSON object with:
- concepts: array of {name, definition, context}

Pros¶

✅ Higher quality reasoning ✅ Better handling of complex content ✅ Fewer hallucinations ✅ More thorough extraction

Cons¶

❌ Longer outputs (higher cost) ❌ Slower response times ❌ Need to parse reasoning from answer ❌ Verbosity adds tokens

Quality Score: 8/10¶

Consistency: 80%¶

Best For: Complex analysis, ambiguous content¶

Pattern 5: Multi-Stage Extraction¶

Description¶

Break extraction into multiple steps with intermediate validation.

Example Flow¶

# Stage 1: Extract raw information
stage1_prompt = """
Identify all people mentioned in this podcast.
List them with approximate timestamps.

Transcript:
{transcript}
"""

people_raw = llm.extract(stage1_prompt)

# Stage 2: Enrich with context
stage2_prompt = """
For each person mentioned, extract:
- Full name
- Role/title (if mentioned)
- Why they were discussed
- Exact quote about them (if any)

People: {people_raw}
Transcript: {transcript}
"""

people_enriched = llm.extract(stage2_prompt)

# Stage 3: Validate and format
stage3_prompt = """
Review this extracted data for accuracy.
Remove any hallucinated information.
Format as JSON array.

Data: {people_enriched}
"""

final_output = llm.extract(stage3_prompt)

Pros¶

✅ Highest quality output ✅ Catch and fix errors mid-process ✅ More focused prompts ✅ Can parallelize stages

Cons¶

❌ Multiple API calls (expensive) ❌ Slower overall ❌ More complex orchestration ❌ Potential error propagation

Quality Score: 9.5/10¶

Consistency: 95%¶

Best For: Critical extractions, complex multi-field data¶

Pattern 6: Template-Based Extraction with Variables¶

Description¶

Use template variables to customize prompts dynamically.

Example Template¶

system_prompt: |
  You are a {{ role }} analyzing a {{ podcast_type }} podcast.
  Focus on extracting {{ focus_areas }}.

user_prompt_template: |
  {% if include_context %}
  Podcast: {{ metadata.podcast_name }}
  Episode: {{ metadata.episode_title }}
  {% endif %}

  {{ instruction }}

  Transcript:
  {{ transcript }}

  {% if output_format == "json" %}
  Respond with valid JSON matching this schema:
  {{ schema }}
  {% elif output_format == "markdown" %}
  Format your response as markdown with:
  {{ markdown_requirements }}
  {% endif %}

Instantiated Example¶

You are a technical expert analyzing a technology podcast.
Focus on extracting tools, frameworks, and best practices.

Podcast: The Changelog
Episode: Building Better Software

Extract all tools mentioned with context.

Transcript:
[...]

Respond with valid JSON matching this schema:
{
  "tools": [
    {"name": "string", "category": "string", "context": "string"}
  ]
}

Pros¶

✅ Reusable across episodes ✅ Customizable per use case ✅ Maintainable prompts ✅ Type-safe with validation

Cons¶

❌ Requires template engine (Jinja2) ❌ More complex setup ❌ Debugging harder ❌ Learning curve for template syntax

Quality Score: 8/10¶

Consistency: 85%¶

Best For: Productized extraction systems¶

Pattern 7: Hybrid Extraction (Recommended)¶

Description¶

Combine multiple patterns for optimal results: - Few-shot for guidance - JSON mode for structure - Templates for reusability - Chain-of-thought for quality

Example Implementation¶

# Template with few-shot examples
template = """
System: You are an expert podcast analyst. Extract information accurately.

Examples:
{examples}

Task: Extract {extraction_type} from this transcript.

Process:
1. Read the transcript carefully
2. Identify relevant {extraction_type}
3. For each item, note exact quotes and context
4. Verify information is actually in transcript (no hallucinations)

Transcript:
{transcript}

Respond with valid JSON:
{schema}
"""

# Use Claude with JSON-like prompting
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    system=system_prompt,
    max_tokens=2000,
    temperature=0.2,  # Lower for consistency
    messages=[{
        "role": "user",
        "content": template.format(
            examples=few_shot_examples,
            extraction_type="quotes",
            transcript=transcript,
            schema=json_schema,
        )
    }]
)

# Parse and validate
try:
    data = parse_json_from_response(response.content)
    validate_schema(data, schema)
    return ExtractedContent(content=data, confidence=1.0)
except ValidationError as e:
    # Retry with more explicit instructions
    return retry_extraction(template, transcript, error=e)

Pros¶

✅ Best quality + consistency balance ✅ Leverages strengths of each pattern ✅ Production-ready reliability ✅ Handles edge cases gracefully

Cons¶

❌ More complex implementation ❌ Higher initial setup cost ❌ Requires good examples

Quality Score: 9/10¶

Consistency: 92%¶

Best For: Production systems (Inkwell)¶

Comparison Matrix¶

Pattern	Quality	Consistency	Cost	Complexity	Speed
Zero-Shot	6/10	70%	Low	Low	Fast
Few-Shot	8.5/10	90%	Medium	Low	Fast
JSON Mode	9/10	98%	Low	Low	Fast
Chain-of-Thought	8/10	80%	High	Medium	Slow
Multi-Stage	9.5/10	95%	Very High	High	Slow
Template-Based	8/10	85%	Medium	Medium	Fast
Hybrid	9/10	92%	Medium	Medium	Fast

Best Practices¶

1. Prompt Engineering¶

Be Specific

❌ "Extract the main ideas"
✅ "Extract 3-5 key concepts with definitions and examples"

Provide Format Examples

✅ "Format timestamps as MM:SS (e.g., 05:30, not 5:30 or 00:05:30)"

Set Constraints

✅ "Only extract information explicitly stated in the transcript.
    Do not infer or add information not present."

Request Thinking

✅ "Before extracting, briefly note what makes each quote significant."

2. Response Parsing¶

Extract JSON from Markdown

def parse_json_from_response(text: str) -> dict:
    """Handle LLM responses that wrap JSON in markdown"""
    # Check for markdown code block
    if "```json" in text:
        json_str = text.split("```json")[1].split("```")[0]
    elif "```" in text:
        json_str = text.split("```")[1].split("```")[0]
    else:
        json_str = text

    return json.loads(json_str.strip())

Fallback Parsing

def extract_with_fallback(response: str, schema: dict) -> dict:
    """Try multiple parsing strategies"""
    strategies = [
        parse_direct_json,
        parse_json_from_markdown,
        parse_yaml_format,
        parse_with_regex,
    ]

    for strategy in strategies:
        try:
            data = strategy(response)
            validate_schema(data, schema)
            return data
        except Exception:
            continue

    raise ExtractionError("Could not parse response")

3. Quality Validation¶

Required Field Checking

href="#__codelineno-15-1">def validate_extraction(data: dict, template: ExtractionTemplate) -> list[str]: class="w"> """Return list of validation warnings""" warnings = [] # Check required fields if template.output_schema: try: jsonschema.validate(data, template.output_schema) except ValidationError as e: warnings.append(f"Schema validation: {e.message}") # Check for hallucination indicators if "I don't have" in str(data) or "not mentioned" in str(data): warnings.append("Possible hallucination detected") # Check for reasonable data if isinstance(data, dict) and len(data) == 0: warnings.append("Empty extraction") return warnings

4. Cost Optimization¶

Aggressive Caching

# Cache extracted content for 30 days
cache_key = hash(episode_url + template_name + template_version)
if cached := cache.get(cache_key):
    return cached

Batch Similar Extractions

# Extract multiple simple fields in one call
prompt = """
Extract from this transcript:
1. Summary (2-3 sentences)
2. Main topics (3-5 items)
3. Notable quotes (top 3)

Respond as JSON with keys: summary, topics, quotes
"""

Use Appropriate Models

# Simple extractions -> cheaper model
if template.complexity == "simple":
    model = "gemini-2.0-flash"  # Cheap
else:
    model = "claude-sonnet-4-5"  # Quality

Recommended Implementation for Inkwell¶

Default Extraction Pattern¶

Use Hybrid Approach: 1. ✅ Template-based prompts (Jinja2) 2. ✅ Few-shot examples (1-2 per template) 3. ✅ JSON mode preference (when available) 4. ✅ Schema validation (Pydantic) 5. ✅ Fallback parsing strategies 6. ✅ Quality validation with warnings

Template Structure¶

name: quotes
version: "1.0"

system_prompt: |
  You are an expert at extracting notable quotes from podcast transcripts.
  Focus on memorable, insightful, or important statements.

few_shot_examples:
  - input: "[00:05:30] John: The key to success is persistence."
    output:
      quotes:
        - text: "The key to success is persistence."
          speaker: "John"
          timestamp: "00:05:30"

user_prompt_template: |
  Extract notable quotes from this transcript.

  Requirements:
  - Exact quotes (no paraphrasing)
  - Speaker name
  - Timestamp in MM:SS format
  - 5-10 quotes maximum

  Transcript:
  {{ transcript }}

  Respond with JSON matching this schema:
  {{ schema }}

expected_format: json
output_schema:
  type: object
  properties:
    quotes:
      type: array
      items:
        type: object
        required: [text, speaker, timestamp]
        properties:
          text: {type: string}
          speaker: {type: string}
          timestamp: {type: string, pattern: "^\\d+:\\d+$"}

max_tokens: 2000
temperature: 0.2

Extraction Flow¶

async def extract(template: ExtractionTemplate, transcript: str) -> ExtractedContent:
    # 1. Build prompt from template
    prompt = build_prompt(template, transcript)

    # 2. Call LLM with appropriate settings
    response = await llm_client.generate(
        prompt=prompt,
        max_tokens=template.max_tokens,
        temperature=template.temperature,
    )

    # 3. Parse response with fallback strategies
    try:
        content = parse_response(response, template.expected_format)
    except ParseError:
        content = fallback_parse(response)

    # 4. Validate against schema
    warnings = validate_extraction(content, template)

    # 5. Compute confidence score
    confidence = compute_confidence(content, warnings)

    return ExtractedContent(
        template_name=template.name,
        content=content,
        warnings=warnings,
        confidence=confidence,
    )

Conclusion¶

Recommended Pattern: Hybrid Approach - Combines best aspects of multiple patterns - Balances quality, consistency, and cost - Production-ready reliability - Handles edge cases gracefully

Implementation Priority: 1. ✅ Template-based extraction (Unit 3) 2. ✅ Few-shot examples in templates 3. ✅ JSON mode preference for Claude 4. ✅ Schema validation with Pydantic 5. ✅ Multiple parsing strategies 6. ✅ Quality scoring and warnings 7. ✅ Aggressive caching

Success Metrics: - 90%+ format adherence - <5% extraction errors - <1% hallucinations - 30-day cache hit rate >50%

Structured Extraction Patterns for LLMs¶

Overview¶

Extraction Challenges¶

1. Format Consistency¶

2. Content Quality¶

3. Scalability¶

Pattern 1: Zero-Shot Extraction¶

Description¶

Example Prompt¶

Pros¶

Cons¶

Quality Score: 6/10¶

Consistency: 70%¶

Best For: Simple, well-defined extractions¶

Pattern 2: Few-Shot Extraction¶

Description¶

Example Prompt¶

Pros¶

Cons¶

Quality Score: 8.5/10¶

Consistency: 90%¶

Best For: Complex extractions, format-sensitive tasks¶

Pattern 3: JSON Mode / Structured Output¶

Description¶

Example (Claude)¶

Example (GPT-4 with JSON Mode)¶

Pros¶

Cons¶

Quality Score: 9/10¶

Consistency: 98%¶

Best For: Structured data extraction (entities, lists)¶

Pattern 4: Chain-of-Thought Extraction¶

Description¶

Example Prompt¶

Pros¶

Cons¶

Quality Score: 8/10¶

Consistency: 80%¶

Best For: Complex analysis, ambiguous content¶

Pattern 5: Multi-Stage Extraction¶

Description¶

Example Flow¶

Pros¶

Cons¶

Quality Score: 9.5/10¶

Consistency: 95%¶

Best For: Critical extractions, complex multi-field data¶

Pattern 6: Template-Based Extraction with Variables¶

Description¶

Example Template¶

Instantiated Example¶

Pros¶

Cons¶

Quality Score: 8/10¶

Consistency: 85%¶

Best For: Productized extraction systems¶

Pattern 7: Hybrid Extraction (Recommended)¶

Description¶

Example Implementation¶

Pros¶

Cons¶

Quality Score: 9/10¶

Consistency: 92%¶

Best For: Production systems (Inkwell)¶

Comparison Matrix¶

Best Practices¶

1. Prompt Engineering¶

2. Response Parsing¶

3. Quality Validation¶

4. Cost Optimization¶

Recommended Implementation for Inkwell¶

Default Extraction Pattern¶

Template Structure¶

Extraction Flow¶

Conclusion¶

References¶