ETL Pipeline (cbetl)¶
Contact data processing service for political campaign data extraction, normalization, and deduplication.
Overview¶
CBETL transforms raw contact data from various sources (CSV, PDF, Excel, images) into clean, normalized records with unique Person IDs (PIDs).
Pipeline Stages¶
flowchart LR
A[Convert] --> B[Analyze]
B --> C[Normalize]
C --> D[Hygiene]
D --> E[Entity Resolution]
| Stage | Purpose |
|---|---|
| Convert | File type detection, format conversion to JSONL |
| Analyze | Data structure assessment |
| Normalize | Address (libpostal) and name normalization |
| Hygiene | OCR noise filtering, data validation |
| Entity Resolution | Deduplication, PID generation, contact merging |
Key Features¶
- Person ID Generation: SHA256 hash of normalized name + address
- Address Normalization: libpostal-based international address parsing
- Name Variations: Handle equivalences (Billy=Bill=William)
- OCR Fallback: Mogger system with capability-based routing
PID Algorithm¶
PID = SHA256({normalized_last}|{normalized_first}|{normalized_street}|{normalized_city}|{normalized_state}|{normalized_zip})
Dependencies¶
- libpostal: International address parsing (git submodule)
- Unstructured: OCR server for document extraction
- phonenumbers, nameparser, usaddress: Data parsing
Developer Resources¶
- CLAUDE.md - AI-assisted development guide
- GitHub Repository