Contact Data Implementation Plan¶

Created: 2026-01-01 Status: Draft for Review

Core Insight¶

There are two client scenarios:

Scenario	Starting Point	Header Source
With i360	8.2M voter records	i360 IS the list
Without i360	Empty	User lists become the header

In both cases, we need a canonical person identity that: - Survives deduplication and merging - Is human-readable (for support/debugging) - Works with or without i360 - Can be deterministically derived from source data

The PID (Person ID) Concept¶

Using the serial.py approach, we generate a 6-character human-readable ID:

# Examples:
# FL SVID 2712345 → "K7WH4R"
# Name+Address hash → "9BPNXQ"

Characteristics: - Base32-like alphabet (no confusable 0/O/1/I/L) - Deterministic: same input → same output - Short enough for humans to read/speak - Collision-resistant for practical purposes

Architecture Decision: Person as Universal Header¶

┌──────────────────────────────────────────────────────────────────────┐
│                        PERSON TABLE                                   │
│              (Universal header - works with or without i360)          │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│   pid         VARCHAR(6)   Human-readable ID (K7WH4R)                │
│   svid        BIGINT       i360 State Voter ID (nullable)            │
│   first_name  VARCHAR      Canonical name                            │
│   last_name   VARCHAR                                                 │
│   address1    VARCHAR      Canonical address                         │
│   city        VARCHAR                                                 │
│   state       VARCHAR                                                 │
│   zip         VARCHAR                                                 │
│   ...                                                                 │
│   source_type VARCHAR      'i360', 'import', 'manual'                │
│   match_score FLOAT        Confidence in identity (0-1)              │
│   match_type  VARCHAR      How identity was established              │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ pid (FK)
                                    ▼
┌──────────────────────────────────────────────────────────────────────┐
│                     SOURCE_RECORD TABLE                               │
│              (Audit trail of all source data)                         │
├──────────────────────────────────────────────────────────────────────┤
│   id          VARCHAR      UUID                                       │
│   pid         VARCHAR(6)   Links to person                           │
│   source_type VARCHAR      'i360', 'csv', 'api'                      │
│   source_name VARCHAR      'i360_voters', 'donors_2024.csv'          │
│   source_id   VARCHAR      SVID, row hash, or external ID            │
│   raw_data    JSON         Original source record                    │
│   imported_at TIMESTAMP                                              │
│   imported_by VARCHAR                                                │
└──────────────────────────────────────────────────────────────────────┘

PID Generation Strategy¶

For i360 Records (SVID available)¶

def generate_pid_from_svid(state: str, svid: int) -> str:
    """
    Generate deterministic PID from state + SVID.

    FL-2712345 → "K7WH4R"
    """
    input_str = f"{state.upper()}-{svid}"
    return get_serial(input_str, length=6)

Why this works: - SVID is unique within state - State prefix handles multi-state campaigns - Same voter always gets same PID - Human-readable for debugging

For Non-i360 Records (No SVID)¶

def generate_pid_from_fields(
    first_name: str,
    last_name: str,
    address1: str,
    zip_code: str
) -> str:
    """
    Generate deterministic PID from canonical fields.

    Best available fields → consistent hash → PID
    """
    # Normalize inputs
    canonical = normalize_for_hash(first_name, last_name, address1, zip_code)
    return get_serial(canonical, length=6)

def normalize_for_hash(first_name, last_name, address1, zip_code) -> str:
    """Normalize fields for consistent hashing."""
    return "|".join([
        first_name.strip().upper(),
        last_name.strip().upper(),
        normalize_address(address1),  # "123 Main St" → "123 MAIN ST"
        zip_code[:5] if zip_code else ""
    ])

Collision handling: - 6-char base32 = ~1B combinations - For 100K records, collision probability is ~0.005% - If collision detected, append suffix: "K7WH4R" → "K7WH4R-2"

Two-Scenario Import Flow¶

Scenario A: Client WITH i360 Data¶

┌─────────────────────────────────────────────────────────────────────┐
│                    CLIENT WITH i360                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Step 1: Bootstrap from i360                                        │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │ For each i360 voter in district:                           │     │
│  │   pid = generate_pid_from_svid(state, svid)                │     │
│  │   INSERT INTO person (pid, svid, first_name, ...)          │     │
│  │   source_type = 'i360', match_score = 1.0                  │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
│  Step 2: Import user list (donors, volunteers, etc.)                │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │ For each row:                                               │     │
│  │   IF row has SVID:                                          │     │
│  │     person = lookup_by_svid(svid)                          │     │
│  │     IF found: enrich person with new data                  │     │
│  │     ELSE: create new (non-voter contact)                   │     │
│  │   ELSE:                                                     │     │
│  │     candidates = simple_match(name, address)               │     │
│  │     IF single match with score >= threshold:               │     │
│  │       enrich person with new data                          │     │
│  │     ELSE:                                                   │     │
│  │       create new person (will be unlinked to i360)         │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Scenario B: Client WITHOUT i360 Data¶

┌─────────────────────────────────────────────────────────────────────┐
│                   CLIENT WITHOUT i360                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Step 1: Import first list (becomes the "header")                   │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │ For each row:                                               │     │
│  │   pid = generate_pid_from_fields(name, address, zip)       │     │
│  │   INSERT INTO person (pid, first_name, ...)                │     │
│  │   IF row has SVID-like field: store in svid column         │     │
│  │   source_type = 'import', match_score = 1.0 (by definition)│     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
│  Step 2: Import subsequent lists                                    │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │ For each row:                                               │     │
│  │   candidates = simple_match(name, address)                 │     │
│  │   IF single match with score >= threshold:                 │     │
│  │     enrich person with new data                            │     │
│  │   ELSE IF multiple matches:                                │     │
│  │     flag for manual review (or pick best)                  │     │
│  │   ELSE:                                                     │     │
│  │     create new person                                       │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
│  Step 3: (Future) Retroactive i360 matching                         │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │ When i360 data becomes available:                          │     │
│  │   For each person without svid:                            │     │
│  │     candidates = match_to_i360(name, address)              │     │
│  │     IF confident match: set svid, recalc pid               │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Simple Matching for MVP¶

Match Strategy (Priority Order)¶

Priority	Strategy	Score	Description
1	SVID exact	1.0	If both records have SVID, compare directly
2	Name + Address + ZIP	0.95	first + last + address1 + zip5
3	Name + City	0.70	first + last + city (current behavior)
4	Name + Phone	0.85	first + last + normalized phone
5	Name + Email	0.80	first + last + email domain match

MVP Match Function¶

def simple_match(
    new_record: dict,
    threshold: float = 0.75  # Configurable, default 75%
) -> tuple[Person | None, float, str]:
    """
    Attempt to match a new record to existing persons.

    Returns: (matched_person, score, match_type)
    """
    # Priority 1: SVID match
    if new_record.get('svid'):
        person = lookup_by_svid(new_record['svid'])
        if person:
            return (person, 1.0, 'svid_exact')

    # Priority 2: Name + Address + ZIP
    if all([new_record.get(f) for f in ('first_name', 'last_name', 'address1', 'zip')]):
        person = lookup_by_name_address_zip(
            new_record['first_name'],
            new_record['last_name'],
            new_record['address1'],
            new_record['zip'][:5]
        )
        if person:
            return (person, 0.95, 'name_address_zip')

    # Priority 3: Name + City
    if all([new_record.get(f) for f in ('first_name', 'last_name', 'city')]):
        persons = lookup_by_name_city(
            new_record['first_name'],
            new_record['last_name'],
            new_record['city']
        )
        if len(persons) == 1:
            return (persons[0], 0.70, 'name_city')

    # No confident match
    return (None, 0.0, 'none')

Match Threshold Configuration¶

# Default: 75% (between name+city 70% and name+address 95%)
# This means name+city matches are accepted by default
MATCH_THRESHOLD_DEFAULT = 0.75

# Can be configured per-tenant in integration_setting
# integration = 'matching', key = 'threshold'
def get_match_threshold() -> float:
    custom = get_integration_setting('matching', 'threshold')
    return float(custom) if custom else MATCH_THRESHOLD_DEFAULT

Schema Changes¶

Migration 1: Add PID and SVID columns¶

-- Add new columns
ALTER TABLE person ADD COLUMN pid VARCHAR(10);  -- 6 chars + possible suffix
ALTER TABLE person ADD COLUMN svid BIGINT;
ALTER TABLE person ADD COLUMN source_type VARCHAR DEFAULT 'legacy';
ALTER TABLE person ADD COLUMN match_score FLOAT DEFAULT 1.0;
ALTER TABLE person ADD COLUMN match_type VARCHAR DEFAULT 'legacy';

-- Create indexes
CREATE UNIQUE INDEX idx_person_pid ON person(pid);
CREATE INDEX idx_person_svid ON person(svid);
CREATE INDEX idx_person_name_city ON person(first_name, last_name, city);
CREATE INDEX idx_person_name_address ON person(first_name, last_name, address1, zip);

Migration 2: Populate PIDs for existing records¶

def migrate_existing_pids():
    """Generate PIDs for all existing person records."""

    # Get all persons with their SVID custom field value
    persons = execute_query("""
        SELECT p.id, p.first_name, p.last_name, p.address1, p.zip,
               pcf.value as svid_value
        FROM person p
        LEFT JOIN person_custom_field pcf ON pcf.person_id = p.id
        LEFT JOIN custom_field cf ON pcf.custom_field_id = cf.id AND cf.name = 'State Voter ID'
    """)

    for person in persons:
        if person['svid_value']:
            # Has SVID - use it
            svid = int(person['svid_value'])
            pid = generate_pid_from_svid('FL', svid)
            execute_query(
                "UPDATE person SET pid = ?, svid = ?, source_type = 'i360' WHERE id = ?",
                (pid, svid, person['id'])
            )
        else:
            # No SVID - generate from fields
            pid = generate_pid_from_fields(
                person['first_name'],
                person['last_name'],
                person['address1'],
                person['zip']
            )
            execute_query(
                "UPDATE person SET pid = ?, source_type = 'import' WHERE id = ?",
                (pid, person['id'])
            )

Migration 3: Source Record Audit Table¶

CREATE TABLE IF NOT EXISTS source_record (
    id VARCHAR PRIMARY KEY,
    pid VARCHAR(10) NOT NULL,
    source_type VARCHAR NOT NULL,      -- 'i360', 'csv', 'api', 'manual'
    source_name VARCHAR,               -- 'i360_voters', 'donors_2024.csv'
    source_id VARCHAR,                 -- SVID, row hash, external ID
    raw_data JSON,                     -- Original source record
    imported_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    imported_by VARCHAR,
    FOREIGN KEY (pid) REFERENCES person(pid)
);

CREATE INDEX idx_source_record_pid ON source_record(pid);
CREATE INDEX idx_source_record_source ON source_record(source_type, source_name);

Implementation Phases¶

Phase 1: Schema + PID Generator (This Week)¶

Files to create: - src/api/services/pid_generator.py - PID generation functions - scripts/migrate_add_pid_svid.py - Migration script

Files to modify: - scripts/create_schema.py - Add new columns and indexes

Deliverables: - [ ] PID generator with tests - [ ] Schema migration script - [ ] Run migration on testsite

Phase 2: Import Flow Update¶

Files to modify: - src/api/routes/list_loader.py - Add matching + PID assignment - src/api/routes/i360.py - Add PID assignment on i360 import

Deliverables: - [ ] List loader assigns PIDs - [ ] i360 import assigns PIDs - [ ] Simple matching implementation

Phase 3: Source Record Audit Trail¶

Files to create: - src/api/models/source_record.py - Pydantic models - src/api/routes/source_records.py - API for viewing lineage

Deliverables: - [ ] Source record table populated on import - [ ] API to view person's source records - [ ] UI to show "where did this data come from"

Phase 4: cbmodels Integration¶

Files to modify: - cbmodels/src/cbmodels/api/segment.py - Use PID/SVID directly

Deliverables: - [ ] Segment analysis uses svid column directly - [ ] Remove custom_field lookup overhead

Configuration Options¶

# integration_setting table entries for matching configuration

# Match threshold (default 0.75)
# integration = 'matching', key = 'threshold', value = '0.75'

# Auto-match enabled (default true)
# integration = 'matching', key = 'auto_match', value = 'true'

# State for SVID generation (default 'FL')
# integration = 'matching', key = 'state_code', value = 'FL'

Testing Strategy¶

Unit Tests¶

PID generation determinism (same input → same output)
PID collision detection
Match score calculation
Threshold comparison

Integration Tests¶

Import with SVID → correct PID assigned
Import without SVID → correct PID generated
Duplicate detection with matching
Source record audit trail

Migration Tests¶

Existing records get PIDs
SVID values migrated from custom_field
No data loss

Open Items (Post-MVP)¶

Fuzzy matching - Soundex, Levenshtein, nickname dictionary
Manual match UI - User can verify/correct suggested matches
Match queue - Low-confidence matches queued for review
Multi-state PIDs - State prefix in PID for cross-state campaigns
PID stability - What happens if canonical fields change?
Merge/unmerge - Combine duplicates, split incorrect matches

Success Criteria¶

Every person record has a PID
PIDs are deterministic (same source → same PID)
PIDs are human-readable (can speak over phone)
i360 records have SVID populated
Imports match existing records at configurable threshold
Source lineage is preserved for audit