Next Steps - Document Processing & Rate Extraction¶

Last updated: 2026-01-06

Current Status¶

Document Processing Progress¶

Status	Count	Percentage
completed	1,349	26.5%
needs_review	2,092	41.1%
pending	862	16.9%
failed	782	15.4%
Total	5,085	100%

Rate Extraction Progress¶

Total rates extracted: 14,788
Batch 0 results: 96/100 docs succeeded, 1,858 rates extracted
Average rates per doc: ~19

What Was Completed This Session¶

1. Extended OCR Service (`src/workers/ocr_service.py`)¶

Added support for non-image formats: - .docx - python-docx library - .doc - antiword/catdoc/LibreOffice fallback - .xlsx - openpyxl library - .xls - xlrd library - .txt - direct file read

2. Marked Empty Files as Failed¶

332 documents with 0 bytes (failed Airtable downloads) marked as failed with error message.

3. Fixed Rate Extractor (`src/workers/rate_extractor.py`)¶

Increased CLAUDE_MAX_TOKENS from 4096 → 16384 (was truncating JSON)
Increased API timeout from 60s → 180s
Fixed JSON parsing regex (was non-greedy, breaking nested JSON)
Added balanced brace extraction for complex nested JSON
Fixed .env loading with explicit load_dotenv()

4. Generated Processing Batches¶

9 batch files created in batches/: - batch_0000.json through batch_0007.json: 100 docs each - batch_0008.json: 10 docs

5. Started Batch Processing¶

Batch 0000: ✅ Completed
Batch 0001: 🔄 Running (check logs/batch_0001.log)
Batches 0002-0008: ⏳ Pending

Immediate Next Steps¶

1. Continue Batch Processing¶

DuckDB only supports single writer, so run batches sequentially:

# Check if current batch finished
tail -20 logs/batch_0001.log

# When done, start next batch
source ~/.pyenv/versions/nominates/bin/activate
nohup python3 -u src/workers/document_processor.py --batch-file batches/batch_0002.json > logs/batch_0002.log 2>&1 &

# Continue for batches 0003-0008

2. Monitor Progress¶

# Check document status
python3 -c "
import duckdb
conn = duckdb.connect('db/cbradio.db', read_only=True)
print(conn.execute('''
    SELECT extraction_status, COUNT(*)
    FROM document
    GROUP BY 1
    ORDER BY 2 DESC
''').fetchall())
conn.close()
"

# Check rate count
python3 -c "
import duckdb
conn = duckdb.connect('db/cbradio.db', read_only=True)
print('Total rates:', conn.execute('SELECT COUNT(*) FROM rate').fetchone()[0])
conn.close()
"

3. After All Batches Complete¶

Analyze failed documents (782)
Query extraction errors
Identify patterns (unsupported formats, corrupted files, etc.)
Retry with different strategies if applicable
Review needs_review documents (2,092)
These have confidence < 0.5 or no rates extracted
Many may be non-rate-card docs (agreements, emails, etc.)
Reclassify as appropriate
Implement rate deduplication
~37% potential duplicates identified
Add unique constraint on (station_id, ad_type, duration_seconds, effective_date, slot_name)
Run deduplication script

Phase 2: Rate Cards API (After Phase 1)¶

See docs/ROADMAP.md for full details:

Create external rate card ingestion script for docs/rate-cards/amfm-rates/
Implement rate card export endpoint in consumer format
Add versioning logic (active vs scheduled vs archived)
Tag in-network vs out-of-network stations

Phase 3: Proposals (After Phase 2)¶

Create Proposal and ProposalLineItem data models
Database schema migration
CRUD API endpoints
Proposal status workflow
Historical proposal ingestion

Key Files Modified¶

File	Changes
`src/workers/ocr_service.py`	Added Word/Excel/text extraction
`src/workers/rate_extractor.py`	Fixed max_tokens, timeout, JSON parsing
`scripts/generate_batches.py`	Updated to include all supported formats
`.env`	`CLAUDE_MAX_TOKENS=16384`

Configuration¶

Current .env settings for extraction:

CLAUDE_MODEL=claude-sonnet-4-5-20250929
CLAUDE_MAX_TOKENS=16384
MISTRAL_OCR_MODEL=mistral-ocr-latest
MISTRAL_VISION_MODEL=pixtral-large-latest

Rate extractor timeout: 180 seconds (in code)

Known Issues¶

DuckDB single-writer limitation - Can only run one batch at a time
Empty files (332) - Airtable download failures, marked as failed
Unsupported formats - ODT, GIF, ZIP, PPTX/PPT files skipped (153 docs)
Some .doc files - Need antiword/catdoc installed for extraction