List Loader Redesign - Phase 1: File Ingestion¶

Date: 2025-12-26 Status: Planning Scope: File upload → Parsing → Clean IR output

Executive Summary¶

Consolidate List Loader functionality from cbtenant into cbapp with: 1. Centralized file storage on files.nominate.ai (cbfiles/MinIO) 2. Support for CSV, TSV, and Excel files 3. AI-assisted field mapping (existing Claude integration) 4. Clean Intermediate Representation (IR) output for downstream processing

Phase 1 Focus: Get files from user → file server → parsed → IR Phase 2 (Future): Universal schema, enrichment, and database ingestion

Current Architecture¶

What Exists Today¶

Component	Location	Status
cbtenant List Loader	`cbtenant/api/routes/list_loader.py`	To be deprecated
cbapp List Loader	`cbapp/src/api/routes/list_loader.py`	Functional, CSV only
AI Agent	`cbapp/src/api/agents/list_loader.py`	Works with Claude
cbfiles	`files.nominate.ai`	Ready, unused by List Loader
Import Template UI	`cbapp/src/app/templates/imports/index.html`	Mockup only

Current Data Flow (cbapp)¶

User uploads CSV → /tmp/cbapp-uploads → analyze → AI mapping → import to DB

Problems: 1. Files stored in /tmp/ (ephemeral, no audit trail) 2. CSV only (no Excel, TSV) 3. No worksheet selection for multi-sheet Excel 4. Files deleted after import (no reprocessing possible) 5. No standardized file naming or organization

Proposed Architecture¶

New Data Flow¶

┌─────────────────────────────────────────────────────────────────────────┐
│                           USER UPLOAD                                    │
│  Browser → cbapp /api/list-loader/upload                                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         FILE SERVER (cbfiles)                           │
│  POST files.nominate.ai/buckets/{tenant}/files/imports/{job_id}/{file}  │
│                                                                          │
│  Bucket: tenant-specific (e.g., "testsite")                             │
│  Path:   imports/{job_id}/original.{csv|xlsx|tsv}                       │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           PARSE & ANALYZE                               │
│  1. Detect file type (CSV, TSV, Excel)                                  │
│  2. For Excel: list worksheets, let user choose                         │
│  3. Parse selected sheet/file                                           │
│  4. Extract headers + sample rows                                       │
│  5. Store parsed data as IR in file server                              │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         AI FIELD MAPPING                                │
│  1. Heuristic mapping (fast, no API call)                               │
│  2. AI mapping (Claude, if requested)                                   │
│  3. User can adjust mappings manually                                   │
│  4. Save final mapping in job metadata                                  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    INTERMEDIATE REPRESENTATION (IR)                     │
│  Clean, normalized JSON ready for schema application                    │
│  Stored: imports/{job_id}/ir.json                                       │
│                                                                          │
│  {                                                                       │
│    "job_id": "imp_abc123",                                              │
│    "source_file": "volunteers.xlsx",                                    │
│    "worksheet": "Sheet1",                                               │
│    "row_count": 150,                                                    │
│    "mappings": [...],                                                   │
│    "records": [                                                         │
│      {"first_name": "John", "last_name": "Doe", "email": "j@x.com"},   │
│      ...                                                                │
│    ]                                                                    │
│  }                                                                      │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────┐
                    │      PHASE 2 (FUTURE)     │
                    │  Schema Application       │
                    │  Enrichment               │
                    │  Database Ingestion       │
                    └───────────────────────────┘

File Server Integration¶

Bucket Structure¶

{tenant-bucket}/
├── imports/
│   ├── {job_id}/
│   │   ├── original.csv          # Original uploaded file
│   │   ├── original.xlsx         # Or Excel
│   │   ├── metadata.json         # Job metadata
│   │   ├── ir.json               # Intermediate representation
│   │   └── result.json           # Import result (after Phase 2)
│   └── ...
├── exports/
│   └── ...
└── documents/
    └── ...

File Naming Convention¶

File	Purpose
`original.{ext}`	Unmodified uploaded file
`metadata.json`	Job info, mappings, status, timestamps
`ir.json`	Parsed & mapped data (clean records)
`result.json`	Import results (counts, errors)

cbfiles API Usage¶

Upload original file:

POST /buckets/{tenant}/files/imports/{job_id}/original.csv
Authorization: Bearer {token}
Content-Type: text/csv

[file content]

Download for parsing:

GET /buckets/{tenant}/files/imports/{job_id}/original.csv
Authorization: Bearer {token}

Store IR:

POST /buckets/{tenant}/files/imports/{job_id}/ir.json
Authorization: Bearer {token}
Content-Type: application/json

{"job_id": "...", "records": [...]}

Authentication Flow¶

cbfiles uses JWT tokens. Options:

Proxy through cbapp (recommended for Phase 1)
cbapp receives upload, forwards to cbfiles with service token
User never directly talks to cbfiles
Signed URLs (for large files)
cbapp generates presigned upload URL
Browser uploads directly to cbfiles
cbapp notified on completion

File Parsing¶

Supported Formats (Phase 1)¶

Format	Extension	Library
CSV	`.csv`	Python `csv` (built-in)
TSV	`.tsv`, `.txt`	Python `csv` with `delimiter='\t'`
Excel	`.xlsx`, `.xls`	`openpyxl` (xlsx) / `xlrd` (xls)

Parsing Logic¶

def parse_file(file_path: str, worksheet: str = None) -> ParseResult:
    """
    Parse uploaded file into headers + rows.

    Args:
        file_path: Path to file (local or cbfiles URL)
        worksheet: For Excel, which sheet to parse

    Returns:
        ParseResult with headers, rows, and metadata
    """
    ext = Path(file_path).suffix.lower()

    if ext in ('.csv', '.tsv', '.txt'):
        return parse_csv(file_path)
    elif ext in ('.xlsx', '.xls'):
        return parse_excel(file_path, worksheet)
    else:
        raise UnsupportedFileType(ext)

Excel Worksheet Selection¶

For Excel files with multiple sheets:

API returns sheet list:

{
  "job_id": "imp_abc123",
  "file_type": "excel",
  "worksheets": ["Contacts", "Events", "Summary"],
  "status": "awaiting_worksheet_selection"
}

User selects sheet:

POST /api/list-loader/{job_id}/select-worksheet
{"worksheet": "Contacts"}

Parsing continues with selected sheet

AI Field Mapping¶

Existing Implementation (Keep)¶

The current ListLoaderAgent works well: - heuristic_mapping() - Fast, rule-based - ai_mapping() - Claude API, context-aware

Improvements¶

Cache AI mappings - Same headers = reuse mapping
Learning from corrections - Track when users change AI suggestions
Better prompting - Include more sample data for ambiguous columns

Mapping Workflow¶

1. Upload file
2. Parse headers + 5 sample rows
3. Run heuristic_mapping() immediately (instant)
4. User can click "AI Suggest" to run ai_mapping() (async)
5. User adjusts mappings manually if needed
6. Save final mappings to metadata.json
7. Apply mappings to all rows → IR

Intermediate Representation (IR)¶

Structure¶

{
  "version": "1.0",
  "job_id": "imp_abc123def456",
  "created_at": "2025-12-26T14:30:00Z",
  "source": {
    "filename": "volunteers_2025.xlsx",
    "worksheet": "Active",
    "file_type": "excel",
    "encoding": "utf-8",
    "row_count": 150,
    "column_count": 12
  },
  "mappings": [
    {"source": "First Name", "target": "first_name", "confidence": 1.0},
    {"source": "Last Name", "target": "last_name", "confidence": 1.0},
    {"source": "Email Address", "target": "email", "confidence": 0.95},
    {"source": "Party", "target": "custom:party", "confidence": 0.8},
    {"source": "Internal ID", "target": "_skip", "confidence": 1.0}
  ],
  "records": [
    {
      "_row": 1,
      "first_name": "John",
      "last_name": "Doe",
      "email": "john.doe@example.com",
      "custom:party": "Republican"
    },
    {
      "_row": 2,
      "first_name": "Jane",
      "last_name": "Smith",
      "email": "jane@example.com",
      "custom:party": "Democrat"
    }
  ],
  "validation": {
    "valid_count": 148,
    "error_count": 2,
    "errors": [
      {"row": 45, "field": "email", "value": "notanemail", "error": "Invalid email format"},
      {"row": 89, "field": "first_name", "value": "", "error": "Required field empty"}
    ]
  }
}

IR Design Principles¶

Self-contained - All info needed to understand the data
Auditable - Track source file, mappings, and decisions
Re-processable - Can regenerate IR from original file + mappings
Schema-agnostic - IR doesn't know about Person schema (that's Phase 2)

Special Target Values¶

Target	Meaning
`first_name`, `last_name`, etc.	Standard Person fields
`custom:fieldname`	Custom field (will become person_custom_field)
`_skip`	Column ignored
`_meta:source_row`	Original row number (auto-added)

API Endpoints (Revised)¶

Upload Flow¶

# 1. Upload file
POST /api/list-loader/upload
Content-Type: multipart/form-data
file: [binary]

Response:
{
  "job_id": "imp_abc123",
  "status": "uploaded",
  "filename": "contacts.xlsx",
  "file_url": "https://files.nominate.ai/buckets/testsite/files/imports/imp_abc123/original.xlsx"
}

# 2. Get worksheets (Excel only)
GET /api/list-loader/{job_id}/worksheets

Response:
{
  "job_id": "imp_abc123",
  "worksheets": ["Sheet1", "Contacts", "Events"],
  "row_counts": {"Sheet1": 0, "Contacts": 150, "Events": 25}
}

# 3. Select worksheet and parse
POST /api/list-loader/{job_id}/parse
{"worksheet": "Contacts"}  # optional for CSV

Response:
{
  "job_id": "imp_abc123",
  "status": "parsed",
  "headers": ["First Name", "Last Name", "Email", ...],
  "sample_rows": [...],
  "total_rows": 150,
  "mappings": [...]  # heuristic mappings auto-generated
}

# 4. Request AI mapping (optional)
POST /api/list-loader/{job_id}/ai-suggest

Response:
{
  "status": "analyzing",
  "message": "AI is analyzing your file..."
}

# 5. Get/update mappings
GET /api/list-loader/{job_id}/mappings
PUT /api/list-loader/{job_id}/mappings
{"mappings": [...]}

# 6. Generate IR
POST /api/list-loader/{job_id}/generate-ir

Response:
{
  "job_id": "imp_abc123",
  "status": "ir_ready",
  "ir_url": "https://files.nominate.ai/.../ir.json",
  "valid_count": 148,
  "error_count": 2
}

Implementation Tasks¶

Phase 1A: File Server Integration¶

Add cbfiles client to cbapp (src/api/services/cbfiles_client.py)
Configure cbfiles auth (service token or user JWT passthrough)
Update upload endpoint to store in cbfiles
Implement file retrieval for parsing

Phase 1B: Multi-format Parsing¶

Add openpyxl to dependencies
Create unified FileParser class
Implement CSV/TSV parsing (existing, refactor)
Implement Excel parsing with worksheet listing
Add worksheet selection endpoint

Phase 1C: IR Generation¶

Define IR JSON schema
Implement IR generation from parsed data + mappings
Store IR in cbfiles
Add validation during IR generation

Phase 1D: Frontend Updates¶

Update /imports template (replace mockup)
Add file upload with drag-and-drop
Add worksheet selection UI (for Excel)
Add field mapping table with AI suggest button
Add preview of mapped data
Show validation errors

Dependencies¶

Python Packages to Add¶

# pyproject.toml
dependencies = [
    # ... existing ...
    "openpyxl>=3.1",      # Excel xlsx support
    "httpx>=0.24",        # Already in dev, move to main for cbfiles client
]

Environment Variables¶

# cbfiles integration
CBFILES_URL=https://files.nominate.ai
CBFILES_SERVICE_TOKEN=...  # or use user JWT passthrough
CBFILES_BUCKET=testsite    # tenant-specific bucket

Migration Path¶

Keep existing /list-loader routes - Don't break current functionality
Add new routes under /imports - New implementation
Feature flag - USE_NEW_LIST_LOADER=true to switch
Once stable - Deprecate old routes, update /imports template

Open Questions for User¶

Tenant bucket naming - Use tenant subdomain (e.g., testsite) or tenant ID?
File retention - Keep original files forever, or TTL (30 days)?
Max file size - 10MB? 50MB? 100MB?
Concurrent imports - Allow multiple active jobs per user?

Next Steps¶

After this plan is approved: 1. Implement Phase 1A (cbfiles integration) 2. Test upload/download flow 3. Implement Phase 1B (parsing) 4. Implement Phase 1C (IR generation) 5. Update frontend (Phase 1D)

User will then provide universal schema spec for Phase 2.