Skip to content

ETL Pipeline (cbetl)

Contact data processing service for political campaign data extraction, normalization, and deduplication.

Overview

CBETL transforms raw contact data from various sources (CSV, PDF, Excel, images) into clean, normalized records with unique Person IDs (PIDs).

Pipeline Stages

flowchart LR
    A[Convert] --> B[Analyze]
    B --> C[Normalize]
    C --> D[Hygiene]
    D --> E[Entity Resolution]
Stage Purpose
Convert File type detection, format conversion to JSONL
Analyze Data structure assessment
Normalize Address (libpostal) and name normalization
Hygiene OCR noise filtering, data validation
Entity Resolution Deduplication, PID generation, contact merging

Key Features

  • Person ID Generation: SHA256 hash of normalized name + address
  • Address Normalization: libpostal-based international address parsing
  • Name Variations: Handle equivalences (Billy=Bill=William)
  • OCR Fallback: Mogger system with capability-based routing

PID Algorithm

PID = SHA256({normalized_last}|{normalized_first}|{normalized_street}|{normalized_city}|{normalized_state}|{normalized_zip})

Dependencies

  • libpostal: International address parsing (git submodule)
  • Unstructured: OCR server for document extraction
  • phonenumbers, nameparser, usaddress: Data parsing

Developer Resources