Contact Data Implementation Plan¶
Created: 2026-01-01 Status: Draft for Review
Core Insight¶
There are two client scenarios:
| Scenario | Starting Point | Header Source |
|---|---|---|
| With i360 | 8.2M voter records | i360 IS the list |
| Without i360 | Empty | User lists become the header |
In both cases, we need a canonical person identity that: - Survives deduplication and merging - Is human-readable (for support/debugging) - Works with or without i360 - Can be deterministically derived from source data
The PID (Person ID) Concept¶
Using the serial.py approach, we generate a 6-character human-readable ID:
Characteristics: - Base32-like alphabet (no confusable 0/O/1/I/L) - Deterministic: same input → same output - Short enough for humans to read/speak - Collision-resistant for practical purposes
Architecture Decision: Person as Universal Header¶
┌──────────────────────────────────────────────────────────────────────┐
│ PERSON TABLE │
│ (Universal header - works with or without i360) │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ pid VARCHAR(6) Human-readable ID (K7WH4R) │
│ svid BIGINT i360 State Voter ID (nullable) │
│ first_name VARCHAR Canonical name │
│ last_name VARCHAR │
│ address1 VARCHAR Canonical address │
│ city VARCHAR │
│ state VARCHAR │
│ zip VARCHAR │
│ ... │
│ source_type VARCHAR 'i360', 'import', 'manual' │
│ match_score FLOAT Confidence in identity (0-1) │
│ match_type VARCHAR How identity was established │
│ │
└──────────────────────────────────────────────────────────────────────┘
│
│ pid (FK)
▼
┌──────────────────────────────────────────────────────────────────────┐
│ SOURCE_RECORD TABLE │
│ (Audit trail of all source data) │
├──────────────────────────────────────────────────────────────────────┤
│ id VARCHAR UUID │
│ pid VARCHAR(6) Links to person │
│ source_type VARCHAR 'i360', 'csv', 'api' │
│ source_name VARCHAR 'i360_voters', 'donors_2024.csv' │
│ source_id VARCHAR SVID, row hash, or external ID │
│ raw_data JSON Original source record │
│ imported_at TIMESTAMP │
│ imported_by VARCHAR │
└──────────────────────────────────────────────────────────────────────┘
PID Generation Strategy¶
For i360 Records (SVID available)¶
def generate_pid_from_svid(state: str, svid: int) -> str:
"""
Generate deterministic PID from state + SVID.
FL-2712345 → "K7WH4R"
"""
input_str = f"{state.upper()}-{svid}"
return get_serial(input_str, length=6)
Why this works: - SVID is unique within state - State prefix handles multi-state campaigns - Same voter always gets same PID - Human-readable for debugging
For Non-i360 Records (No SVID)¶
def generate_pid_from_fields(
first_name: str,
last_name: str,
address1: str,
zip_code: str
) -> str:
"""
Generate deterministic PID from canonical fields.
Best available fields → consistent hash → PID
"""
# Normalize inputs
canonical = normalize_for_hash(first_name, last_name, address1, zip_code)
return get_serial(canonical, length=6)
def normalize_for_hash(first_name, last_name, address1, zip_code) -> str:
"""Normalize fields for consistent hashing."""
return "|".join([
first_name.strip().upper(),
last_name.strip().upper(),
normalize_address(address1), # "123 Main St" → "123 MAIN ST"
zip_code[:5] if zip_code else ""
])
Collision handling: - 6-char base32 = ~1B combinations - For 100K records, collision probability is ~0.005% - If collision detected, append suffix: "K7WH4R" → "K7WH4R-2"
Two-Scenario Import Flow¶
Scenario A: Client WITH i360 Data¶
┌─────────────────────────────────────────────────────────────────────┐
│ CLIENT WITH i360 │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Bootstrap from i360 │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ For each i360 voter in district: │ │
│ │ pid = generate_pid_from_svid(state, svid) │ │
│ │ INSERT INTO person (pid, svid, first_name, ...) │ │
│ │ source_type = 'i360', match_score = 1.0 │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Step 2: Import user list (donors, volunteers, etc.) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ For each row: │ │
│ │ IF row has SVID: │ │
│ │ person = lookup_by_svid(svid) │ │
│ │ IF found: enrich person with new data │ │
│ │ ELSE: create new (non-voter contact) │ │
│ │ ELSE: │ │
│ │ candidates = simple_match(name, address) │ │
│ │ IF single match with score >= threshold: │ │
│ │ enrich person with new data │ │
│ │ ELSE: │ │
│ │ create new person (will be unlinked to i360) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Scenario B: Client WITHOUT i360 Data¶
┌─────────────────────────────────────────────────────────────────────┐
│ CLIENT WITHOUT i360 │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Import first list (becomes the "header") │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ For each row: │ │
│ │ pid = generate_pid_from_fields(name, address, zip) │ │
│ │ INSERT INTO person (pid, first_name, ...) │ │
│ │ IF row has SVID-like field: store in svid column │ │
│ │ source_type = 'import', match_score = 1.0 (by definition)│ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Step 2: Import subsequent lists │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ For each row: │ │
│ │ candidates = simple_match(name, address) │ │
│ │ IF single match with score >= threshold: │ │
│ │ enrich person with new data │ │
│ │ ELSE IF multiple matches: │ │
│ │ flag for manual review (or pick best) │ │
│ │ ELSE: │ │
│ │ create new person │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Step 3: (Future) Retroactive i360 matching │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ When i360 data becomes available: │ │
│ │ For each person without svid: │ │
│ │ candidates = match_to_i360(name, address) │ │
│ │ IF confident match: set svid, recalc pid │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Simple Matching for MVP¶
Match Strategy (Priority Order)¶
| Priority | Strategy | Score | Description |
|---|---|---|---|
| 1 | SVID exact | 1.0 | If both records have SVID, compare directly |
| 2 | Name + Address + ZIP | 0.95 | first + last + address1 + zip5 |
| 3 | Name + City | 0.70 | first + last + city (current behavior) |
| 4 | Name + Phone | 0.85 | first + last + normalized phone |
| 5 | Name + Email | 0.80 | first + last + email domain match |
MVP Match Function¶
def simple_match(
new_record: dict,
threshold: float = 0.75 # Configurable, default 75%
) -> tuple[Person | None, float, str]:
"""
Attempt to match a new record to existing persons.
Returns: (matched_person, score, match_type)
"""
# Priority 1: SVID match
if new_record.get('svid'):
person = lookup_by_svid(new_record['svid'])
if person:
return (person, 1.0, 'svid_exact')
# Priority 2: Name + Address + ZIP
if all([new_record.get(f) for f in ('first_name', 'last_name', 'address1', 'zip')]):
person = lookup_by_name_address_zip(
new_record['first_name'],
new_record['last_name'],
new_record['address1'],
new_record['zip'][:5]
)
if person:
return (person, 0.95, 'name_address_zip')
# Priority 3: Name + City
if all([new_record.get(f) for f in ('first_name', 'last_name', 'city')]):
persons = lookup_by_name_city(
new_record['first_name'],
new_record['last_name'],
new_record['city']
)
if len(persons) == 1:
return (persons[0], 0.70, 'name_city')
# No confident match
return (None, 0.0, 'none')
Match Threshold Configuration¶
# Default: 75% (between name+city 70% and name+address 95%)
# This means name+city matches are accepted by default
MATCH_THRESHOLD_DEFAULT = 0.75
# Can be configured per-tenant in integration_setting
# integration = 'matching', key = 'threshold'
def get_match_threshold() -> float:
custom = get_integration_setting('matching', 'threshold')
return float(custom) if custom else MATCH_THRESHOLD_DEFAULT
Schema Changes¶
Migration 1: Add PID and SVID columns¶
-- Add new columns
ALTER TABLE person ADD COLUMN pid VARCHAR(10); -- 6 chars + possible suffix
ALTER TABLE person ADD COLUMN svid BIGINT;
ALTER TABLE person ADD COLUMN source_type VARCHAR DEFAULT 'legacy';
ALTER TABLE person ADD COLUMN match_score FLOAT DEFAULT 1.0;
ALTER TABLE person ADD COLUMN match_type VARCHAR DEFAULT 'legacy';
-- Create indexes
CREATE UNIQUE INDEX idx_person_pid ON person(pid);
CREATE INDEX idx_person_svid ON person(svid);
CREATE INDEX idx_person_name_city ON person(first_name, last_name, city);
CREATE INDEX idx_person_name_address ON person(first_name, last_name, address1, zip);
Migration 2: Populate PIDs for existing records¶
def migrate_existing_pids():
"""Generate PIDs for all existing person records."""
# Get all persons with their SVID custom field value
persons = execute_query("""
SELECT p.id, p.first_name, p.last_name, p.address1, p.zip,
pcf.value as svid_value
FROM person p
LEFT JOIN person_custom_field pcf ON pcf.person_id = p.id
LEFT JOIN custom_field cf ON pcf.custom_field_id = cf.id AND cf.name = 'State Voter ID'
""")
for person in persons:
if person['svid_value']:
# Has SVID - use it
svid = int(person['svid_value'])
pid = generate_pid_from_svid('FL', svid)
execute_query(
"UPDATE person SET pid = ?, svid = ?, source_type = 'i360' WHERE id = ?",
(pid, svid, person['id'])
)
else:
# No SVID - generate from fields
pid = generate_pid_from_fields(
person['first_name'],
person['last_name'],
person['address1'],
person['zip']
)
execute_query(
"UPDATE person SET pid = ?, source_type = 'import' WHERE id = ?",
(pid, person['id'])
)
Migration 3: Source Record Audit Table¶
CREATE TABLE IF NOT EXISTS source_record (
id VARCHAR PRIMARY KEY,
pid VARCHAR(10) NOT NULL,
source_type VARCHAR NOT NULL, -- 'i360', 'csv', 'api', 'manual'
source_name VARCHAR, -- 'i360_voters', 'donors_2024.csv'
source_id VARCHAR, -- SVID, row hash, external ID
raw_data JSON, -- Original source record
imported_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
imported_by VARCHAR,
FOREIGN KEY (pid) REFERENCES person(pid)
);
CREATE INDEX idx_source_record_pid ON source_record(pid);
CREATE INDEX idx_source_record_source ON source_record(source_type, source_name);
Implementation Phases¶
Phase 1: Schema + PID Generator (This Week)¶
Files to create:
- src/api/services/pid_generator.py - PID generation functions
- scripts/migrate_add_pid_svid.py - Migration script
Files to modify:
- scripts/create_schema.py - Add new columns and indexes
Deliverables: - [ ] PID generator with tests - [ ] Schema migration script - [ ] Run migration on testsite
Phase 2: Import Flow Update¶
Files to modify:
- src/api/routes/list_loader.py - Add matching + PID assignment
- src/api/routes/i360.py - Add PID assignment on i360 import
Deliverables: - [ ] List loader assigns PIDs - [ ] i360 import assigns PIDs - [ ] Simple matching implementation
Phase 3: Source Record Audit Trail¶
Files to create:
- src/api/models/source_record.py - Pydantic models
- src/api/routes/source_records.py - API for viewing lineage
Deliverables: - [ ] Source record table populated on import - [ ] API to view person's source records - [ ] UI to show "where did this data come from"
Phase 4: cbmodels Integration¶
Files to modify:
- cbmodels/src/cbmodels/api/segment.py - Use PID/SVID directly
Deliverables: - [ ] Segment analysis uses svid column directly - [ ] Remove custom_field lookup overhead
Configuration Options¶
# integration_setting table entries for matching configuration
# Match threshold (default 0.75)
# integration = 'matching', key = 'threshold', value = '0.75'
# Auto-match enabled (default true)
# integration = 'matching', key = 'auto_match', value = 'true'
# State for SVID generation (default 'FL')
# integration = 'matching', key = 'state_code', value = 'FL'
Testing Strategy¶
Unit Tests¶
- PID generation determinism (same input → same output)
- PID collision detection
- Match score calculation
- Threshold comparison
Integration Tests¶
- Import with SVID → correct PID assigned
- Import without SVID → correct PID generated
- Duplicate detection with matching
- Source record audit trail
Migration Tests¶
- Existing records get PIDs
- SVID values migrated from custom_field
- No data loss
Open Items (Post-MVP)¶
- Fuzzy matching - Soundex, Levenshtein, nickname dictionary
- Manual match UI - User can verify/correct suggested matches
- Match queue - Low-confidence matches queued for review
- Multi-state PIDs - State prefix in PID for cross-state campaigns
- PID stability - What happens if canonical fields change?
- Merge/unmerge - Combine duplicates, split incorrect matches
Success Criteria¶
- Every person record has a PID
- PIDs are deterministic (same source → same PID)
- PIDs are human-readable (can speak over phone)
- i360 records have SVID populated
- Imports match existing records at configurable threshold
- Source lineage is preserved for audit