Next Steps - Document Processing & Rate Extraction¶
Last updated: 2026-01-06
Current Status¶
Document Processing Progress¶
| Status | Count | Percentage |
|---|---|---|
| completed | 1,349 | 26.5% |
| needs_review | 2,092 | 41.1% |
| pending | 862 | 16.9% |
| failed | 782 | 15.4% |
| Total | 5,085 | 100% |
Rate Extraction Progress¶
- Total rates extracted: 14,788
- Batch 0 results: 96/100 docs succeeded, 1,858 rates extracted
- Average rates per doc: ~19
What Was Completed This Session¶
1. Extended OCR Service (src/workers/ocr_service.py)¶
Added support for non-image formats:
- .docx - python-docx library
- .doc - antiword/catdoc/LibreOffice fallback
- .xlsx - openpyxl library
- .xls - xlrd library
- .txt - direct file read
2. Marked Empty Files as Failed¶
332 documents with 0 bytes (failed Airtable downloads) marked as failed with error message.
3. Fixed Rate Extractor (src/workers/rate_extractor.py)¶
- Increased
CLAUDE_MAX_TOKENSfrom 4096 → 16384 (was truncating JSON) - Increased API timeout from 60s → 180s
- Fixed JSON parsing regex (was non-greedy, breaking nested JSON)
- Added balanced brace extraction for complex nested JSON
- Fixed .env loading with explicit
load_dotenv()
4. Generated Processing Batches¶
9 batch files created in batches/:
- batch_0000.json through batch_0007.json: 100 docs each
- batch_0008.json: 10 docs
5. Started Batch Processing¶
- Batch 0000: ✅ Completed
- Batch 0001: 🔄 Running (check
logs/batch_0001.log) - Batches 0002-0008: ⏳ Pending
Immediate Next Steps¶
1. Continue Batch Processing¶
DuckDB only supports single writer, so run batches sequentially:
# Check if current batch finished
tail -20 logs/batch_0001.log
# When done, start next batch
source ~/.pyenv/versions/nominates/bin/activate
nohup python3 -u src/workers/document_processor.py --batch-file batches/batch_0002.json > logs/batch_0002.log 2>&1 &
# Continue for batches 0003-0008
2. Monitor Progress¶
# Check document status
python3 -c "
import duckdb
conn = duckdb.connect('db/cbradio.db', read_only=True)
print(conn.execute('''
SELECT extraction_status, COUNT(*)
FROM document
GROUP BY 1
ORDER BY 2 DESC
''').fetchall())
conn.close()
"
# Check rate count
python3 -c "
import duckdb
conn = duckdb.connect('db/cbradio.db', read_only=True)
print('Total rates:', conn.execute('SELECT COUNT(*) FROM rate').fetchone()[0])
conn.close()
"
3. After All Batches Complete¶
- Analyze failed documents (782)
- Query extraction errors
- Identify patterns (unsupported formats, corrupted files, etc.)
-
Retry with different strategies if applicable
-
Review needs_review documents (2,092)
- These have confidence < 0.5 or no rates extracted
- Many may be non-rate-card docs (agreements, emails, etc.)
-
Reclassify as appropriate
-
Implement rate deduplication
- ~37% potential duplicates identified
- Add unique constraint on (station_id, ad_type, duration_seconds, effective_date, slot_name)
- Run deduplication script
Phase 2: Rate Cards API (After Phase 1)¶
See docs/ROADMAP.md for full details:
- Create external rate card ingestion script for
docs/rate-cards/amfm-rates/ - Implement rate card export endpoint in consumer format
- Add versioning logic (active vs scheduled vs archived)
- Tag in-network vs out-of-network stations
Phase 3: Proposals (After Phase 2)¶
- Create Proposal and ProposalLineItem data models
- Database schema migration
- CRUD API endpoints
- Proposal status workflow
- Historical proposal ingestion
Key Files Modified¶
| File | Changes |
|---|---|
src/workers/ocr_service.py |
Added Word/Excel/text extraction |
src/workers/rate_extractor.py |
Fixed max_tokens, timeout, JSON parsing |
scripts/generate_batches.py |
Updated to include all supported formats |
.env |
CLAUDE_MAX_TOKENS=16384 |
Configuration¶
Current .env settings for extraction:
CLAUDE_MODEL=claude-sonnet-4-5-20250929
CLAUDE_MAX_TOKENS=16384
MISTRAL_OCR_MODEL=mistral-ocr-latest
MISTRAL_VISION_MODEL=pixtral-large-latest
Rate extractor timeout: 180 seconds (in code)
Known Issues¶
- DuckDB single-writer limitation - Can only run one batch at a time
- Empty files (332) - Airtable download failures, marked as failed
- Unsupported formats - ODT, GIF, ZIP, PPTX/PPT files skipped (153 docs)
- Some .doc files - Need antiword/catdoc installed for extraction