Skip to content

Next Steps - Document Processing & Rate Extraction

Last updated: 2026-01-06

Current Status

Document Processing Progress

Status Count Percentage
completed 1,349 26.5%
needs_review 2,092 41.1%
pending 862 16.9%
failed 782 15.4%
Total 5,085 100%

Rate Extraction Progress

  • Total rates extracted: 14,788
  • Batch 0 results: 96/100 docs succeeded, 1,858 rates extracted
  • Average rates per doc: ~19

What Was Completed This Session

1. Extended OCR Service (src/workers/ocr_service.py)

Added support for non-image formats: - .docx - python-docx library - .doc - antiword/catdoc/LibreOffice fallback - .xlsx - openpyxl library - .xls - xlrd library - .txt - direct file read

2. Marked Empty Files as Failed

332 documents with 0 bytes (failed Airtable downloads) marked as failed with error message.

3. Fixed Rate Extractor (src/workers/rate_extractor.py)

  • Increased CLAUDE_MAX_TOKENS from 4096 → 16384 (was truncating JSON)
  • Increased API timeout from 60s → 180s
  • Fixed JSON parsing regex (was non-greedy, breaking nested JSON)
  • Added balanced brace extraction for complex nested JSON
  • Fixed .env loading with explicit load_dotenv()

4. Generated Processing Batches

9 batch files created in batches/: - batch_0000.json through batch_0007.json: 100 docs each - batch_0008.json: 10 docs

5. Started Batch Processing

  • Batch 0000: ✅ Completed
  • Batch 0001: 🔄 Running (check logs/batch_0001.log)
  • Batches 0002-0008: ⏳ Pending

Immediate Next Steps

1. Continue Batch Processing

DuckDB only supports single writer, so run batches sequentially:

# Check if current batch finished
tail -20 logs/batch_0001.log

# When done, start next batch
source ~/.pyenv/versions/nominates/bin/activate
nohup python3 -u src/workers/document_processor.py --batch-file batches/batch_0002.json > logs/batch_0002.log 2>&1 &

# Continue for batches 0003-0008

2. Monitor Progress

# Check document status
python3 -c "
import duckdb
conn = duckdb.connect('db/cbradio.db', read_only=True)
print(conn.execute('''
    SELECT extraction_status, COUNT(*)
    FROM document
    GROUP BY 1
    ORDER BY 2 DESC
''').fetchall())
conn.close()
"

# Check rate count
python3 -c "
import duckdb
conn = duckdb.connect('db/cbradio.db', read_only=True)
print('Total rates:', conn.execute('SELECT COUNT(*) FROM rate').fetchone()[0])
conn.close()
"

3. After All Batches Complete

  1. Analyze failed documents (782)
  2. Query extraction errors
  3. Identify patterns (unsupported formats, corrupted files, etc.)
  4. Retry with different strategies if applicable

  5. Review needs_review documents (2,092)

  6. These have confidence < 0.5 or no rates extracted
  7. Many may be non-rate-card docs (agreements, emails, etc.)
  8. Reclassify as appropriate

  9. Implement rate deduplication

  10. ~37% potential duplicates identified
  11. Add unique constraint on (station_id, ad_type, duration_seconds, effective_date, slot_name)
  12. Run deduplication script

Phase 2: Rate Cards API (After Phase 1)

See docs/ROADMAP.md for full details:

  1. Create external rate card ingestion script for docs/rate-cards/amfm-rates/
  2. Implement rate card export endpoint in consumer format
  3. Add versioning logic (active vs scheduled vs archived)
  4. Tag in-network vs out-of-network stations

Phase 3: Proposals (After Phase 2)

  1. Create Proposal and ProposalLineItem data models
  2. Database schema migration
  3. CRUD API endpoints
  4. Proposal status workflow
  5. Historical proposal ingestion

Key Files Modified

File Changes
src/workers/ocr_service.py Added Word/Excel/text extraction
src/workers/rate_extractor.py Fixed max_tokens, timeout, JSON parsing
scripts/generate_batches.py Updated to include all supported formats
.env CLAUDE_MAX_TOKENS=16384

Configuration

Current .env settings for extraction:

CLAUDE_MODEL=claude-sonnet-4-5-20250929
CLAUDE_MAX_TOKENS=16384
MISTRAL_OCR_MODEL=mistral-ocr-latest
MISTRAL_VISION_MODEL=pixtral-large-latest

Rate extractor timeout: 180 seconds (in code)

Known Issues

  1. DuckDB single-writer limitation - Can only run one batch at a time
  2. Empty files (332) - Airtable download failures, marked as failed
  3. Unsupported formats - ODT, GIF, ZIP, PPTX/PPT files skipped (153 docs)
  4. Some .doc files - Need antiword/catdoc installed for extraction