Skip to content

Structured Extraction Patterns for LLMs

Date: 2025-11-07 Author: Phase 3 Research Status: Complete Related: Phase 3 Plan

Overview

This document explores prompt engineering techniques and patterns for reliable structured content extraction from podcast transcripts using LLMs. We evaluate different approaches for consistency, quality, and ease of implementation.


Extraction Challenges

1. Format Consistency

  • LLMs may deviate from requested output format
  • JSON syntax errors (missing quotes, trailing commas)
  • Markdown formatting inconsistencies
  • Extra conversational text before/after desired output

2. Content Quality

  • Missing required fields
  • Hallucinated information not in transcript
  • Incomplete extractions
  • Paraphrasing instead of exact quotes

3. Scalability

  • Long transcripts exceed context windows
  • Multiple templates require multiple API calls
  • Cost accumulation with many extractions
  • Rate limiting on bulk processing

Pattern 1: Zero-Shot Extraction

Description

Direct instruction without examples, relying on LLM's general knowledge.

Example Prompt

Extract the main topics discussed in this podcast transcript.

Transcript:
{{ transcript }}

Provide your response as a JSON object with a "topics" array. Each topic should have "name" and "description" fields.

Pros

✅ Simple to implement ✅ No example data needed ✅ Works well for common tasks ✅ Shorter prompts (lower cost)

Cons

❌ Inconsistent format adherence ❌ May miss nuanced requirements ❌ Quality varies by task complexity ❌ Requires explicit format instructions

Quality Score: 6/10

Consistency: 70%

Best For: Simple, well-defined extractions


Pattern 2: Few-Shot Extraction

Description

Provide 1-3 examples of desired input/output to guide LLM behavior.

Example Prompt

Extract notable quotes from podcast transcripts with speaker and timestamp.

Example 1:
Input: "[00:05:30] John: I think the key to productivity is focus. You can't multitask effectively."
Output:
{
  "quotes": [
    {
      "text": "I think the key to productivity is focus. You can't multitask effectively.",
      "speaker": "John",
      "timestamp": "00:05:30"
    }
  ]
}

Example 2:
Input: "[00:12:45] Sarah: The research shows that deep work produces better results. [00:13:10] As Cal Newport says, 'Focus is the new IQ.'"
Output:
{
  "quotes": [
    {
      "text": "The research shows that deep work produces better results.",
      "speaker": "Sarah",
      "timestamp": "00:12:45"
    },
    {
      "text": "Focus is the new IQ.",
      "speaker": "Cal Newport (quoted by Sarah)",
      "timestamp": "00:13:10"
    }
  ]
}

Now extract quotes from this transcript:
{{ transcript }}

Pros

✅ Much more consistent ✅ Shows exact desired format ✅ Handles edge cases in examples ✅ Better quality output

Cons

❌ Longer prompts (higher cost) ❌ Need to create good examples ❌ Examples add ~200-500 tokens ❌ May over-fit to example patterns

Quality Score: 8.5/10

Consistency: 90%

Best For: Complex extractions, format-sensitive tasks


Pattern 3: JSON Mode / Structured Output

Description

Use LLM's native structured output mode (Claude, GPT-4, Gemini 1.5+).

Example (Claude)

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=2000,
    messages=[{
        "role": "user",
        "content": f"""Extract tools mentioned in this tech podcast.

Transcript:
{transcript}

Respond with a JSON object matching this schema:
{{
  "tools": [
    {{
      "name": "string",
      "category": "string",
      "context": "string"
    }}
  ]
}}"""
    }],
    # Note: Claude doesn't have explicit JSON mode yet,
    # but follows JSON instructions very reliably
)

Example (GPT-4 with JSON Mode)

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=[{
        "role": "user",
        "content": f"Extract tools from: {transcript}"
    }]
)

Pros

✅ Guaranteed valid JSON ✅ No parsing errors ✅ Most consistent format ✅ Clean separation of data/format

Cons

❌ Not all models support it ❌ May still miss required fields ❌ Limited to JSON (not markdown) ❌ Less flexible for narrative output

Quality Score: 9/10

Consistency: 98%

Best For: Structured data extraction (entities, lists)


Pattern 4: Chain-of-Thought Extraction

Description

Ask LLM to reason through the extraction step-by-step before producing output.

Example Prompt

Extract the key concepts from this podcast transcript.

Transcript:
{{ transcript }}

Think through this step-by-step:
1. First, identify the main topics discussed
2. For each topic, extract the core concept
3. Note any definitions or explanations provided
4. Identify relationships between concepts

Then provide your final answer as a JSON object with:
- concepts: array of {name, definition, context}

Pros

✅ Higher quality reasoning ✅ Better handling of complex content ✅ Fewer hallucinations ✅ More thorough extraction

Cons

❌ Longer outputs (higher cost) ❌ Slower response times ❌ Need to parse reasoning from answer ❌ Verbosity adds tokens

Quality Score: 8/10

Consistency: 80%

Best For: Complex analysis, ambiguous content


Pattern 5: Multi-Stage Extraction

Description

Break extraction into multiple steps with intermediate validation.

Example Flow

# Stage 1: Extract raw information
stage1_prompt = """
Identify all people mentioned in this podcast.
List them with approximate timestamps.

Transcript:
{transcript}
"""

people_raw = llm.extract(stage1_prompt)

# Stage 2: Enrich with context
stage2_prompt = """
For each person mentioned, extract:
- Full name
- Role/title (if mentioned)
- Why they were discussed
- Exact quote about them (if any)

People: {people_raw}
Transcript: {transcript}
"""

people_enriched = llm.extract(stage2_prompt)

# Stage 3: Validate and format
stage3_prompt = """
Review this extracted data for accuracy.
Remove any hallucinated information.
Format as JSON array.

Data: {people_enriched}
"""

final_output = llm.extract(stage3_prompt)

Pros

✅ Highest quality output ✅ Catch and fix errors mid-process ✅ More focused prompts ✅ Can parallelize stages

Cons

❌ Multiple API calls (expensive) ❌ Slower overall ❌ More complex orchestration ❌ Potential error propagation

Quality Score: 9.5/10

Consistency: 95%

Best For: Critical extractions, complex multi-field data


Pattern 6: Template-Based Extraction with Variables

Description

Use template variables to customize prompts dynamically.

Example Template

system_prompt: |
  You are a {{ role }} analyzing a {{ podcast_type }} podcast.
  Focus on extracting {{ focus_areas }}.

user_prompt_template: |
  {% if include_context %}
  Podcast: {{ metadata.podcast_name }}
  Episode: {{ metadata.episode_title }}
  {% endif %}

  {{ instruction }}

  Transcript:
  {{ transcript }}

  {% if output_format == "json" %}
  Respond with valid JSON matching this schema:
  {{ schema }}
  {% elif output_format == "markdown" %}
  Format your response as markdown with:
  {{ markdown_requirements }}
  {% endif %}

Instantiated Example

You are a technical expert analyzing a technology podcast.
Focus on extracting tools, frameworks, and best practices.

Podcast: The Changelog
Episode: Building Better Software

Extract all tools mentioned with context.

Transcript:
[...]

Respond with valid JSON matching this schema:
{
  "tools": [
    {"name": "string", "category": "string", "context": "string"}
  ]
}

Pros

✅ Reusable across episodes ✅ Customizable per use case ✅ Maintainable prompts ✅ Type-safe with validation

Cons

❌ Requires template engine (Jinja2) ❌ More complex setup ❌ Debugging harder ❌ Learning curve for template syntax

Quality Score: 8/10

Consistency: 85%

Best For: Productized extraction systems


Description

Combine multiple patterns for optimal results: - Few-shot for guidance - JSON mode for structure - Templates for reusability - Chain-of-thought for quality

Example Implementation

# Template with few-shot examples
template = """
System: You are an expert podcast analyst. Extract information accurately.

Examples:
{examples}

Task: Extract {extraction_type} from this transcript.

Process:
1. Read the transcript carefully
2. Identify relevant {extraction_type}
3. For each item, note exact quotes and context
4. Verify information is actually in transcript (no hallucinations)

Transcript:
{transcript}

Respond with valid JSON:
{schema}
"""

# Use Claude with JSON-like prompting
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    system=system_prompt,
    max_tokens=2000,
    temperature=0.2,  # Lower for consistency
    messages=[{
        "role": "user",
        "content": template.format(
            examples=few_shot_examples,
            extraction_type="quotes",
            transcript=transcript,
            schema=json_schema,
        )
    }]
)

# Parse and validate
try:
    data = parse_json_from_response(response.content)
    validate_schema(data, schema)
    return ExtractedContent(content=data, confidence=1.0)
except ValidationError as e:
    # Retry with more explicit instructions
    return retry_extraction(template, transcript, error=e)

Pros

✅ Best quality + consistency balance ✅ Leverages strengths of each pattern ✅ Production-ready reliability ✅ Handles edge cases gracefully

Cons

❌ More complex implementation ❌ Higher initial setup cost ❌ Requires good examples

Quality Score: 9/10

Consistency: 92%

Best For: Production systems (Inkwell)


Comparison Matrix

Pattern Quality Consistency Cost Complexity Speed
Zero-Shot 6/10 70% Low Low Fast
Few-Shot 8.5/10 90% Medium Low Fast
JSON Mode 9/10 98% Low Low Fast
Chain-of-Thought 8/10 80% High Medium Slow
Multi-Stage 9.5/10 95% Very High High Slow
Template-Based 8/10 85% Medium Medium Fast
Hybrid 9/10 92% Medium Medium Fast

Best Practices

1. Prompt Engineering

Be Specific

❌ "Extract the main ideas"
✅ "Extract 3-5 key concepts with definitions and examples"

Provide Format Examples

✅ "Format timestamps as MM:SS (e.g., 05:30, not 5:30 or 00:05:30)"

Set Constraints

✅ "Only extract information explicitly stated in the transcript.
    Do not infer or add information not present."

Request Thinking

✅ "Before extracting, briefly note what makes each quote significant."

2. Response Parsing

Extract JSON from Markdown

def parse_json_from_response(text: str) -> dict:
    """Handle LLM responses that wrap JSON in markdown"""
    # Check for markdown code block
    if "```json" in text:
        json_str = text.split("```json")[1].split("```")[0]
    elif "```" in text:
        json_str = text.split("```")[1].split("```")[0]
    else:
        json_str = text

    return json.loads(json_str.strip())

Fallback Parsing

def extract_with_fallback(response: str, schema: dict) -> dict:
    """Try multiple parsing strategies"""
    strategies = [
        parse_direct_json,
        parse_json_from_markdown,
        parse_yaml_format,
        parse_with_regex,
    ]

    for strategy in strategies:
        try:
            data = strategy(response)
            validate_schema(data, schema)
            return data
        except Exception:
            continue

    raise ExtractionError("Could not parse response")

3. Quality Validation

Required Field Checking

def validate_extraction(data: dict, template: ExtractionTemplate) -> list[str]:
    """Return list of validation warnings"""
    warnings = []

    # Check required fields
    if template.output_schema:
        try:
            jsonschema.validate(data, template.output_schema)
        except ValidationError as e:
            warnings.append(f"Schema validation: {e.message}")

    # Check for hallucination indicators
    if "I don't have" in str(data) or "not mentioned" in str(data):
        warnings.append("Possible hallucination detected")

    # Check for reasonable data
    if isinstance(data, dict) and len(data) == 0:
        warnings.append("Empty extraction")

    return warnings

4. Cost Optimization

Aggressive Caching

# Cache extracted content for 30 days
cache_key = hash(episode_url + template_name + template_version)
if cached := cache.get(cache_key):
    return cached

Batch Similar Extractions

# Extract multiple simple fields in one call
prompt = """
Extract from this transcript:
1. Summary (2-3 sentences)
2. Main topics (3-5 items)
3. Notable quotes (top 3)

Respond as JSON with keys: summary, topics, quotes
"""

Use Appropriate Models

# Simple extractions -> cheaper model
if template.complexity == "simple":
    model = "gemini-2.0-flash"  # Cheap
else:
    model = "claude-sonnet-4-5"  # Quality


Default Extraction Pattern

Use Hybrid Approach: 1. ✅ Template-based prompts (Jinja2) 2. ✅ Few-shot examples (1-2 per template) 3. ✅ JSON mode preference (when available) 4. ✅ Schema validation (Pydantic) 5. ✅ Fallback parsing strategies 6. ✅ Quality validation with warnings

Template Structure

name: quotes
version: "1.0"

system_prompt: |
  You are an expert at extracting notable quotes from podcast transcripts.
  Focus on memorable, insightful, or important statements.

few_shot_examples:
  - input: "[00:05:30] John: The key to success is persistence."
    output:
      quotes:
        - text: "The key to success is persistence."
          speaker: "John"
          timestamp: "00:05:30"

user_prompt_template: |
  Extract notable quotes from this transcript.

  Requirements:
  - Exact quotes (no paraphrasing)
  - Speaker name
  - Timestamp in MM:SS format
  - 5-10 quotes maximum

  Transcript:
  {{ transcript }}

  Respond with JSON matching this schema:
  {{ schema }}

expected_format: json
output_schema:
  type: object
  properties:
    quotes:
      type: array
      items:
        type: object
        required: [text, speaker, timestamp]
        properties:
          text: {type: string}
          speaker: {type: string}
          timestamp: {type: string, pattern: "^\\d+:\\d+$"}

max_tokens: 2000
temperature: 0.2

Extraction Flow

async def extract(template: ExtractionTemplate, transcript: str) -> ExtractedContent:
    # 1. Build prompt from template
    prompt = build_prompt(template, transcript)

    # 2. Call LLM with appropriate settings
    response = await llm_client.generate(
        prompt=prompt,
        max_tokens=template.max_tokens,
        temperature=template.temperature,
    )

    # 3. Parse response with fallback strategies
    try:
        content = parse_response(response, template.expected_format)
    except ParseError:
        content = fallback_parse(response)

    # 4. Validate against schema
    warnings = validate_extraction(content, template)

    # 5. Compute confidence score
    confidence = compute_confidence(content, warnings)

    return ExtractedContent(
        template_name=template.name,
        content=content,
        warnings=warnings,
        confidence=confidence,
    )

Conclusion

Recommended Pattern: Hybrid Approach - Combines best aspects of multiple patterns - Balances quality, consistency, and cost - Production-ready reliability - Handles edge cases gracefully

Implementation Priority: 1. ✅ Template-based extraction (Unit 3) 2. ✅ Few-shot examples in templates 3. ✅ JSON mode preference for Claude 4. ✅ Schema validation with Pydantic 5. ✅ Multiple parsing strategies 6. ✅ Quality scoring and warnings 7. ✅ Aggressive caching

Success Metrics: - 90%+ format adherence - <5% extraction errors - <1% hallucinations - 30-day cache hit rate >50%


References