Rate Extraction System - Findings & Accuracy Improvements¶

Executive Summary¶

The rate extraction pipeline processed 959 rate card documents with the following results:

Status	Count	Percentage
Completed	620	64.7%
Failed	256	26.7%
Needs Review	83	8.7%

Total rates extracted: 7,899

Pipeline Architecture¶

Document (PDF/PNG/JPG)
    → OCR (Mistral OCR / Vision)
    → Rate Extraction (Claude Sonnet)
    → Validation & Storage (DuckDB)

Components¶

OCR Service (src/workers/ocr_service.py)
Primary: Mistral OCR API for PDFs
Fallback: Mistral Vision API for images
Final fallback: Local Ollama (llava model)
Rate Extractor (src/workers/rate_extractor.py)
Model: Claude Sonnet 4.5
Structured JSON extraction with validation
Auto-calculates missing gross/net rates (15% commission)
Document Processor (src/workers/document_processor.py)
Orchestrates the pipeline
Tracks status and logs extraction results

Confidence Distribution¶

Confidence Band	Document Count
90-100%	323 (46%)
75-89%	263 (37%)
50-74%	76 (11%)
Below 50%	41 (6%)

Observation: 83% of successful extractions have confidence ≥75%, indicating the model performs well when documents are readable.

Failure Analysis¶

Error Breakdown¶

Error Type	Count	Percentage
Failed to parse extraction response	252	98.4%
OCR Error	3	1.2%
Other	1	0.4%

Root Causes¶

JSON Parsing Failures (252 cases)
Claude sometimes returns explanatory text instead of pure JSON
Malformed JSON due to special characters in OCR text
Model declines to extract when document is unclear
OCR Failures (3 cases)
Corrupted or encrypted PDFs
Extremely low-resolution images
Unsupported file formats masquerading as PDF/PNG
Empty/Short OCR (1 case)
Documents with minimal text content
Scanned images with poor contrast

Data Quality Issues¶

1. Duplicate Rates¶

Total rates: 7,899
Unique combinations: 4,953
Potential duplicates: 2,946 (37%)

Cause: Same rate card processed multiple times, or same rate appearing in multiple documents for the same station.

Recommendation: Implement deduplication logic: - Unique constraint on (station_id, ad_type, preemption, duration_seconds, gross_rate, effective_date) - Or use UPSERT to update existing rates

2. Overly Generic Values¶

Many rates have generic values that reduce usefulness:

Field	Generic Value	Count
ad_type	"all"	2,645
preemption	"all"	1,351
slot_days	"all"	~60%

Cause: When the model can't determine specific values, it defaults to "all".

Recommendation: - Improve prompt to discourage "all" values - Add validation to reject rates where too many fields are "all" - Flag for human review when confidence < 0.8

3. Missing Time Slots¶

Many rates lack slot_start_time and slot_end_time, reducing their value for scheduling.

Recommendation: Enhance prompt with examples of common daypart patterns: - Morning Drive: 6am-10am - Midday: 10am-3pm - Afternoon Drive: 3pm-7pm - Evening: 7pm-12am - Overnight: 12am-6am

Accuracy Improvement Recommendations¶

High Priority¶

1. Improve JSON Parsing Robustness¶

The current parsing attempts three methods but still fails on 252 documents.

Suggestions: - Add Claude's response_format: { type: "json_object" } parameter (if supported) - Use a more forgiving JSON parser (e.g., json5 or demjson3) - Implement multi-turn extraction: if first attempt fails, send error back to Claude to fix - Log the raw response for failed parses to identify patterns

# Example: Multi-turn correction
if not parsed:
    correction_prompt = f"""
    Your previous response was not valid JSON. Here's what you returned:
    {content[:500]}

    Please return ONLY a valid JSON object with the rate data.
    """
    # Retry with correction prompt

2. Add Document Type Classification¶

Before extraction, classify the document: - Rate card (structured rates) - Agreement (contract terms) - Letter/Correspondence - Unsupported format

Benefits: - Skip non-rate-card documents immediately - Use document-type-specific prompts - Improve success rate by focusing on extractable documents

3. Implement Confidence Thresholds¶

Current behavior extracts all rates regardless of confidence.

Recommendations: - Reject rates with confidence < 0.5 - Flag for human review when 0.5 ≤ confidence < 0.75 - Auto-approve only when confidence ≥ 0.75

Medium Priority¶

4. Enhanced Prompt Engineering¶

Current prompt is comprehensive but could be improved:

Add Examples: Include 2-3 example rate card snippets with expected JSON output (few-shot learning).

Add Negative Examples: Show what NOT to extract: - "Do not create rates if you cannot find specific dollar amounts" - "Do not guess rates based on similar stations"

Add Station Context: Provide market information if available: - Nielsen DMA - Typical rates for market size - Previous years' rates for comparison

5. Validation Rules¶

Add post-extraction validation:

def validate_rate(rate):
    issues = []

    # Rate sanity checks
    if rate['gross_rate'] and rate['gross_rate'] > 500:
        issues.append("Unusually high rate (>$500)")
    if rate['gross_rate'] and rate['gross_rate'] < 5:
        issues.append("Unusually low rate (<$5)")

    # Duration check
    if rate['duration_seconds'] not in [15, 30, 60, 90, 120]:
        issues.append("Non-standard duration")

    # Time slot validation
    if rate['slot_start_time'] and rate['slot_end_time']:
        if rate['slot_start_time'] >= rate['slot_end_time']:
            issues.append("Invalid time range")

    return issues

6. Deduplication Strategy¶

Implement during insertion:

INSERT INTO rate (...)
ON CONFLICT (station_id, ad_type, preemption, duration_seconds, effective_date)
DO UPDATE SET
    gross_rate = EXCLUDED.gross_rate,
    updated_at = now()
WHERE rate.source_document_id != EXCLUDED.source_document_id

Lower Priority¶

7. Human-in-the-Loop Review Interface¶

Build a simple review UI for: - Documents marked "needs_review" - Low-confidence extractions - Rates with validation warnings

8. Historical Rate Tracking¶

Track rate changes over time: - Store effective_date and expiration_date - Compare new rates to previous years - Flag significant rate changes for review

9. OCR Quality Scoring¶

Before sending to Claude, score OCR quality: - Character confidence (if available from OCR) - Text density (characters per page) - Presence of key terms (rate, political, candidate, etc.)

Skip extraction for documents with poor OCR quality.

Metrics to Track¶

Extraction Quality¶

Success rate by document type
Confidence score distribution
Rates per document average

Data Quality¶

Duplicate rate percentage
Fields with "all" values
Missing time slot percentage

Production Monitoring¶

Extraction latency (OCR + Claude)
API error rates
Cost per document

Appendix: Sample Extracted Data¶

High-Quality Extraction (Confidence 0.95)¶

{
  "station_callsign": "KAAA-AM",
  "rate_card_year": 2024,
  "rates": [
    {
      "ad_type": "candidate",
      "preemption": "non_preemptible",
      "slot_days": "m-f",
      "slot_start_time": "06:00",
      "slot_end_time": "10:00",
      "slot_name": "Morning Drive",
      "duration_seconds": 60,
      "gross_rate": 25.00,
      "net_rate": 21.25
    }
  ]
}

Low-Quality Extraction (Confidence 0.65)¶

{
  "station_callsign": "WXXX-FM",
  "rates": [
    {
      "ad_type": "all",
      "preemption": "all",
      "slot_days": "all",
      "duration_seconds": 60,
      "gross_rate": 15.00
    }
  ]
}

Next Steps¶

Immediate: Process remaining 2,831 documents without "rate" filename filter
Short-term: Implement JSON parsing improvements and multi-turn correction
Medium-term: Add document classification and confidence thresholds
Long-term: Build human review interface for quality assurance

Generated: December 2024 Documents Analyzed: 959 Rates Extracted: 7,899