Rate Extraction System - Findings & Accuracy Improvements¶
Executive Summary¶
The rate extraction pipeline processed 959 rate card documents with the following results:
| Status | Count | Percentage |
|---|---|---|
| Completed | 620 | 64.7% |
| Failed | 256 | 26.7% |
| Needs Review | 83 | 8.7% |
Total rates extracted: 7,899
Pipeline Architecture¶
Document (PDF/PNG/JPG)
→ OCR (Mistral OCR / Vision)
→ Rate Extraction (Claude Sonnet)
→ Validation & Storage (DuckDB)
Components¶
- OCR Service (
src/workers/ocr_service.py) - Primary: Mistral OCR API for PDFs
- Fallback: Mistral Vision API for images
-
Final fallback: Local Ollama (llava model)
-
Rate Extractor (
src/workers/rate_extractor.py) - Model: Claude Sonnet 4.5
- Structured JSON extraction with validation
-
Auto-calculates missing gross/net rates (15% commission)
-
Document Processor (
src/workers/document_processor.py) - Orchestrates the pipeline
- Tracks status and logs extraction results
Confidence Distribution¶
| Confidence Band | Document Count |
|---|---|
| 90-100% | 323 (46%) |
| 75-89% | 263 (37%) |
| 50-74% | 76 (11%) |
| Below 50% | 41 (6%) |
Observation: 83% of successful extractions have confidence ≥75%, indicating the model performs well when documents are readable.
Failure Analysis¶
Error Breakdown¶
| Error Type | Count | Percentage |
|---|---|---|
| Failed to parse extraction response | 252 | 98.4% |
| OCR Error | 3 | 1.2% |
| Other | 1 | 0.4% |
Root Causes¶
- JSON Parsing Failures (252 cases)
- Claude sometimes returns explanatory text instead of pure JSON
- Malformed JSON due to special characters in OCR text
-
Model declines to extract when document is unclear
-
OCR Failures (3 cases)
- Corrupted or encrypted PDFs
- Extremely low-resolution images
-
Unsupported file formats masquerading as PDF/PNG
-
Empty/Short OCR (1 case)
- Documents with minimal text content
- Scanned images with poor contrast
Data Quality Issues¶
1. Duplicate Rates¶
Cause: Same rate card processed multiple times, or same rate appearing in multiple documents for the same station.
Recommendation: Implement deduplication logic: - Unique constraint on (station_id, ad_type, preemption, duration_seconds, gross_rate, effective_date) - Or use UPSERT to update existing rates
2. Overly Generic Values¶
Many rates have generic values that reduce usefulness:
| Field | Generic Value | Count |
|---|---|---|
| ad_type | "all" | 2,645 |
| preemption | "all" | 1,351 |
| slot_days | "all" | ~60% |
Cause: When the model can't determine specific values, it defaults to "all".
Recommendation: - Improve prompt to discourage "all" values - Add validation to reject rates where too many fields are "all" - Flag for human review when confidence < 0.8
3. Missing Time Slots¶
Many rates lack slot_start_time and slot_end_time, reducing their value for scheduling.
Recommendation: Enhance prompt with examples of common daypart patterns: - Morning Drive: 6am-10am - Midday: 10am-3pm - Afternoon Drive: 3pm-7pm - Evening: 7pm-12am - Overnight: 12am-6am
Accuracy Improvement Recommendations¶
High Priority¶
1. Improve JSON Parsing Robustness¶
The current parsing attempts three methods but still fails on 252 documents.
Suggestions:
- Add Claude's response_format: { type: "json_object" } parameter (if supported)
- Use a more forgiving JSON parser (e.g., json5 or demjson3)
- Implement multi-turn extraction: if first attempt fails, send error back to Claude to fix
- Log the raw response for failed parses to identify patterns
# Example: Multi-turn correction
if not parsed:
correction_prompt = f"""
Your previous response was not valid JSON. Here's what you returned:
{content[:500]}
Please return ONLY a valid JSON object with the rate data.
"""
# Retry with correction prompt
2. Add Document Type Classification¶
Before extraction, classify the document: - Rate card (structured rates) - Agreement (contract terms) - Letter/Correspondence - Unsupported format
Benefits: - Skip non-rate-card documents immediately - Use document-type-specific prompts - Improve success rate by focusing on extractable documents
3. Implement Confidence Thresholds¶
Current behavior extracts all rates regardless of confidence.
Recommendations: - Reject rates with confidence < 0.5 - Flag for human review when 0.5 ≤ confidence < 0.75 - Auto-approve only when confidence ≥ 0.75
Medium Priority¶
4. Enhanced Prompt Engineering¶
Current prompt is comprehensive but could be improved:
Add Examples: Include 2-3 example rate card snippets with expected JSON output (few-shot learning).
Add Negative Examples: Show what NOT to extract: - "Do not create rates if you cannot find specific dollar amounts" - "Do not guess rates based on similar stations"
Add Station Context: Provide market information if available: - Nielsen DMA - Typical rates for market size - Previous years' rates for comparison
5. Validation Rules¶
Add post-extraction validation:
def validate_rate(rate):
issues = []
# Rate sanity checks
if rate['gross_rate'] and rate['gross_rate'] > 500:
issues.append("Unusually high rate (>$500)")
if rate['gross_rate'] and rate['gross_rate'] < 5:
issues.append("Unusually low rate (<$5)")
# Duration check
if rate['duration_seconds'] not in [15, 30, 60, 90, 120]:
issues.append("Non-standard duration")
# Time slot validation
if rate['slot_start_time'] and rate['slot_end_time']:
if rate['slot_start_time'] >= rate['slot_end_time']:
issues.append("Invalid time range")
return issues
6. Deduplication Strategy¶
Implement during insertion:
INSERT INTO rate (...)
ON CONFLICT (station_id, ad_type, preemption, duration_seconds, effective_date)
DO UPDATE SET
gross_rate = EXCLUDED.gross_rate,
updated_at = now()
WHERE rate.source_document_id != EXCLUDED.source_document_id
Lower Priority¶
7. Human-in-the-Loop Review Interface¶
Build a simple review UI for: - Documents marked "needs_review" - Low-confidence extractions - Rates with validation warnings
8. Historical Rate Tracking¶
Track rate changes over time: - Store effective_date and expiration_date - Compare new rates to previous years - Flag significant rate changes for review
9. OCR Quality Scoring¶
Before sending to Claude, score OCR quality: - Character confidence (if available from OCR) - Text density (characters per page) - Presence of key terms (rate, political, candidate, etc.)
Skip extraction for documents with poor OCR quality.
Metrics to Track¶
Extraction Quality¶
- Success rate by document type
- Confidence score distribution
- Rates per document average
Data Quality¶
- Duplicate rate percentage
- Fields with "all" values
- Missing time slot percentage
Production Monitoring¶
- Extraction latency (OCR + Claude)
- API error rates
- Cost per document
Appendix: Sample Extracted Data¶
High-Quality Extraction (Confidence 0.95)¶
{
"station_callsign": "KAAA-AM",
"rate_card_year": 2024,
"rates": [
{
"ad_type": "candidate",
"preemption": "non_preemptible",
"slot_days": "m-f",
"slot_start_time": "06:00",
"slot_end_time": "10:00",
"slot_name": "Morning Drive",
"duration_seconds": 60,
"gross_rate": 25.00,
"net_rate": 21.25
}
]
}
Low-Quality Extraction (Confidence 0.65)¶
{
"station_callsign": "WXXX-FM",
"rates": [
{
"ad_type": "all",
"preemption": "all",
"slot_days": "all",
"duration_seconds": 60,
"gross_rate": 15.00
}
]
}
Next Steps¶
- Immediate: Process remaining 2,831 documents without "rate" filename filter
- Short-term: Implement JSON parsing improvements and multi-turn correction
- Medium-term: Add document classification and confidence thresholds
- Long-term: Build human review interface for quality assurance
Generated: December 2024 Documents Analyzed: 959 Rates Extracted: 7,899