Skip to content

Rate Extraction System - Findings & Accuracy Improvements

Executive Summary

The rate extraction pipeline processed 959 rate card documents with the following results:

Status Count Percentage
Completed 620 64.7%
Failed 256 26.7%
Needs Review 83 8.7%

Total rates extracted: 7,899


Pipeline Architecture

Document (PDF/PNG/JPG)
    → OCR (Mistral OCR / Vision)
    → Rate Extraction (Claude Sonnet)
    → Validation & Storage (DuckDB)

Components

  1. OCR Service (src/workers/ocr_service.py)
  2. Primary: Mistral OCR API for PDFs
  3. Fallback: Mistral Vision API for images
  4. Final fallback: Local Ollama (llava model)

  5. Rate Extractor (src/workers/rate_extractor.py)

  6. Model: Claude Sonnet 4.5
  7. Structured JSON extraction with validation
  8. Auto-calculates missing gross/net rates (15% commission)

  9. Document Processor (src/workers/document_processor.py)

  10. Orchestrates the pipeline
  11. Tracks status and logs extraction results

Confidence Distribution

Confidence Band Document Count
90-100% 323 (46%)
75-89% 263 (37%)
50-74% 76 (11%)
Below 50% 41 (6%)

Observation: 83% of successful extractions have confidence ≥75%, indicating the model performs well when documents are readable.


Failure Analysis

Error Breakdown

Error Type Count Percentage
Failed to parse extraction response 252 98.4%
OCR Error 3 1.2%
Other 1 0.4%

Root Causes

  1. JSON Parsing Failures (252 cases)
  2. Claude sometimes returns explanatory text instead of pure JSON
  3. Malformed JSON due to special characters in OCR text
  4. Model declines to extract when document is unclear

  5. OCR Failures (3 cases)

  6. Corrupted or encrypted PDFs
  7. Extremely low-resolution images
  8. Unsupported file formats masquerading as PDF/PNG

  9. Empty/Short OCR (1 case)

  10. Documents with minimal text content
  11. Scanned images with poor contrast

Data Quality Issues

1. Duplicate Rates

Total rates: 7,899
Unique combinations: 4,953
Potential duplicates: 2,946 (37%)

Cause: Same rate card processed multiple times, or same rate appearing in multiple documents for the same station.

Recommendation: Implement deduplication logic: - Unique constraint on (station_id, ad_type, preemption, duration_seconds, gross_rate, effective_date) - Or use UPSERT to update existing rates

2. Overly Generic Values

Many rates have generic values that reduce usefulness:

Field Generic Value Count
ad_type "all" 2,645
preemption "all" 1,351
slot_days "all" ~60%

Cause: When the model can't determine specific values, it defaults to "all".

Recommendation: - Improve prompt to discourage "all" values - Add validation to reject rates where too many fields are "all" - Flag for human review when confidence < 0.8

3. Missing Time Slots

Many rates lack slot_start_time and slot_end_time, reducing their value for scheduling.

Recommendation: Enhance prompt with examples of common daypart patterns: - Morning Drive: 6am-10am - Midday: 10am-3pm - Afternoon Drive: 3pm-7pm - Evening: 7pm-12am - Overnight: 12am-6am


Accuracy Improvement Recommendations

High Priority

1. Improve JSON Parsing Robustness

The current parsing attempts three methods but still fails on 252 documents.

Suggestions: - Add Claude's response_format: { type: "json_object" } parameter (if supported) - Use a more forgiving JSON parser (e.g., json5 or demjson3) - Implement multi-turn extraction: if first attempt fails, send error back to Claude to fix - Log the raw response for failed parses to identify patterns

# Example: Multi-turn correction
if not parsed:
    correction_prompt = f"""
    Your previous response was not valid JSON. Here's what you returned:
    {content[:500]}

    Please return ONLY a valid JSON object with the rate data.
    """
    # Retry with correction prompt

2. Add Document Type Classification

Before extraction, classify the document: - Rate card (structured rates) - Agreement (contract terms) - Letter/Correspondence - Unsupported format

Benefits: - Skip non-rate-card documents immediately - Use document-type-specific prompts - Improve success rate by focusing on extractable documents

3. Implement Confidence Thresholds

Current behavior extracts all rates regardless of confidence.

Recommendations: - Reject rates with confidence < 0.5 - Flag for human review when 0.5 ≤ confidence < 0.75 - Auto-approve only when confidence ≥ 0.75

Medium Priority

4. Enhanced Prompt Engineering

Current prompt is comprehensive but could be improved:

Add Examples: Include 2-3 example rate card snippets with expected JSON output (few-shot learning).

Add Negative Examples: Show what NOT to extract: - "Do not create rates if you cannot find specific dollar amounts" - "Do not guess rates based on similar stations"

Add Station Context: Provide market information if available: - Nielsen DMA - Typical rates for market size - Previous years' rates for comparison

5. Validation Rules

Add post-extraction validation:

def validate_rate(rate):
    issues = []

    # Rate sanity checks
    if rate['gross_rate'] and rate['gross_rate'] > 500:
        issues.append("Unusually high rate (>$500)")
    if rate['gross_rate'] and rate['gross_rate'] < 5:
        issues.append("Unusually low rate (<$5)")

    # Duration check
    if rate['duration_seconds'] not in [15, 30, 60, 90, 120]:
        issues.append("Non-standard duration")

    # Time slot validation
    if rate['slot_start_time'] and rate['slot_end_time']:
        if rate['slot_start_time'] >= rate['slot_end_time']:
            issues.append("Invalid time range")

    return issues

6. Deduplication Strategy

Implement during insertion:

INSERT INTO rate (...)
ON CONFLICT (station_id, ad_type, preemption, duration_seconds, effective_date)
DO UPDATE SET
    gross_rate = EXCLUDED.gross_rate,
    updated_at = now()
WHERE rate.source_document_id != EXCLUDED.source_document_id

Lower Priority

7. Human-in-the-Loop Review Interface

Build a simple review UI for: - Documents marked "needs_review" - Low-confidence extractions - Rates with validation warnings

8. Historical Rate Tracking

Track rate changes over time: - Store effective_date and expiration_date - Compare new rates to previous years - Flag significant rate changes for review

9. OCR Quality Scoring

Before sending to Claude, score OCR quality: - Character confidence (if available from OCR) - Text density (characters per page) - Presence of key terms (rate, political, candidate, etc.)

Skip extraction for documents with poor OCR quality.


Metrics to Track

Extraction Quality

  • Success rate by document type
  • Confidence score distribution
  • Rates per document average

Data Quality

  • Duplicate rate percentage
  • Fields with "all" values
  • Missing time slot percentage

Production Monitoring

  • Extraction latency (OCR + Claude)
  • API error rates
  • Cost per document

Appendix: Sample Extracted Data

High-Quality Extraction (Confidence 0.95)

{
  "station_callsign": "KAAA-AM",
  "rate_card_year": 2024,
  "rates": [
    {
      "ad_type": "candidate",
      "preemption": "non_preemptible",
      "slot_days": "m-f",
      "slot_start_time": "06:00",
      "slot_end_time": "10:00",
      "slot_name": "Morning Drive",
      "duration_seconds": 60,
      "gross_rate": 25.00,
      "net_rate": 21.25
    }
  ]
}

Low-Quality Extraction (Confidence 0.65)

{
  "station_callsign": "WXXX-FM",
  "rates": [
    {
      "ad_type": "all",
      "preemption": "all",
      "slot_days": "all",
      "duration_seconds": 60,
      "gross_rate": 15.00
    }
  ]
}

Next Steps

  1. Immediate: Process remaining 2,831 documents without "rate" filename filter
  2. Short-term: Implement JSON parsing improvements and multi-turn correction
  3. Medium-term: Add document classification and confidence thresholds
  4. Long-term: Build human review interface for quality assurance

Generated: December 2024 Documents Analyzed: 959 Rates Extracted: 7,899