Skip to content

List Loader Redesign - Phase 1: File Ingestion

Date: 2025-12-26 Status: Planning Scope: File upload → Parsing → Clean IR output


Executive Summary

Consolidate List Loader functionality from cbtenant into cbapp with: 1. Centralized file storage on files.nominate.ai (cbfiles/MinIO) 2. Support for CSV, TSV, and Excel files 3. AI-assisted field mapping (existing Claude integration) 4. Clean Intermediate Representation (IR) output for downstream processing

Phase 1 Focus: Get files from user → file server → parsed → IR Phase 2 (Future): Universal schema, enrichment, and database ingestion


Current Architecture

What Exists Today

Component Location Status
cbtenant List Loader cbtenant/api/routes/list_loader.py To be deprecated
cbapp List Loader cbapp/src/api/routes/list_loader.py Functional, CSV only
AI Agent cbapp/src/api/agents/list_loader.py Works with Claude
cbfiles files.nominate.ai Ready, unused by List Loader
Import Template UI cbapp/src/app/templates/imports/index.html Mockup only

Current Data Flow (cbapp)

User uploads CSV → /tmp/cbapp-uploads → analyze → AI mapping → import to DB

Problems: 1. Files stored in /tmp/ (ephemeral, no audit trail) 2. CSV only (no Excel, TSV) 3. No worksheet selection for multi-sheet Excel 4. Files deleted after import (no reprocessing possible) 5. No standardized file naming or organization


Proposed Architecture

New Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                           USER UPLOAD                                    │
│  Browser → cbapp /api/list-loader/upload                                │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│                         FILE SERVER (cbfiles)                           │
│  POST files.nominate.ai/buckets/{tenant}/files/imports/{job_id}/{file}  │
│                                                                          │
│  Bucket: tenant-specific (e.g., "testsite")                             │
│  Path:   imports/{job_id}/original.{csv|xlsx|tsv}                       │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│                           PARSE & ANALYZE                               │
│  1. Detect file type (CSV, TSV, Excel)                                  │
│  2. For Excel: list worksheets, let user choose                         │
│  3. Parse selected sheet/file                                           │
│  4. Extract headers + sample rows                                       │
│  5. Store parsed data as IR in file server                              │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│                         AI FIELD MAPPING                                │
│  1. Heuristic mapping (fast, no API call)                               │
│  2. AI mapping (Claude, if requested)                                   │
│  3. User can adjust mappings manually                                   │
│  4. Save final mapping in job metadata                                  │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│                    INTERMEDIATE REPRESENTATION (IR)                     │
│  Clean, normalized JSON ready for schema application                    │
│  Stored: imports/{job_id}/ir.json                                       │
│                                                                          │
│  {                                                                       │
│    "job_id": "imp_abc123",                                              │
│    "source_file": "volunteers.xlsx",                                    │
│    "worksheet": "Sheet1",                                               │
│    "row_count": 150,                                                    │
│    "mappings": [...],                                                   │
│    "records": [                                                         │
│      {"first_name": "John", "last_name": "Doe", "email": "j@x.com"},   │
│      ...                                                                │
│    ]                                                                    │
│  }                                                                      │
└─────────────────────────────────────────────────────────────────────────┘
                    ┌───────────────────────────┐
                    │      PHASE 2 (FUTURE)     │
                    │  Schema Application       │
                    │  Enrichment               │
                    │  Database Ingestion       │
                    └───────────────────────────┘

File Server Integration

Bucket Structure

{tenant-bucket}/
├── imports/
│   ├── {job_id}/
│   │   ├── original.csv          # Original uploaded file
│   │   ├── original.xlsx         # Or Excel
│   │   ├── metadata.json         # Job metadata
│   │   ├── ir.json               # Intermediate representation
│   │   └── result.json           # Import result (after Phase 2)
│   └── ...
├── exports/
│   └── ...
└── documents/
    └── ...

File Naming Convention

File Purpose
original.{ext} Unmodified uploaded file
metadata.json Job info, mappings, status, timestamps
ir.json Parsed & mapped data (clean records)
result.json Import results (counts, errors)

cbfiles API Usage

Upload original file:

POST /buckets/{tenant}/files/imports/{job_id}/original.csv
Authorization: Bearer {token}
Content-Type: text/csv

[file content]

Download for parsing:

GET /buckets/{tenant}/files/imports/{job_id}/original.csv
Authorization: Bearer {token}

Store IR:

POST /buckets/{tenant}/files/imports/{job_id}/ir.json
Authorization: Bearer {token}
Content-Type: application/json

{"job_id": "...", "records": [...]}

Authentication Flow

cbfiles uses JWT tokens. Options:

  1. Proxy through cbapp (recommended for Phase 1)
  2. cbapp receives upload, forwards to cbfiles with service token
  3. User never directly talks to cbfiles

  4. Signed URLs (for large files)

  5. cbapp generates presigned upload URL
  6. Browser uploads directly to cbfiles
  7. cbapp notified on completion

File Parsing

Supported Formats (Phase 1)

Format Extension Library
CSV .csv Python csv (built-in)
TSV .tsv, .txt Python csv with delimiter='\t'
Excel .xlsx, .xls openpyxl (xlsx) / xlrd (xls)

Parsing Logic

def parse_file(file_path: str, worksheet: str = None) -> ParseResult:
    """
    Parse uploaded file into headers + rows.

    Args:
        file_path: Path to file (local or cbfiles URL)
        worksheet: For Excel, which sheet to parse

    Returns:
        ParseResult with headers, rows, and metadata
    """
    ext = Path(file_path).suffix.lower()

    if ext in ('.csv', '.tsv', '.txt'):
        return parse_csv(file_path)
    elif ext in ('.xlsx', '.xls'):
        return parse_excel(file_path, worksheet)
    else:
        raise UnsupportedFileType(ext)

Excel Worksheet Selection

For Excel files with multiple sheets:

  1. API returns sheet list:

    {
      "job_id": "imp_abc123",
      "file_type": "excel",
      "worksheets": ["Contacts", "Events", "Summary"],
      "status": "awaiting_worksheet_selection"
    }
    

  2. User selects sheet:

    POST /api/list-loader/{job_id}/select-worksheet
    {"worksheet": "Contacts"}
    

  3. Parsing continues with selected sheet


AI Field Mapping

Existing Implementation (Keep)

The current ListLoaderAgent works well: - heuristic_mapping() - Fast, rule-based - ai_mapping() - Claude API, context-aware

Improvements

  1. Cache AI mappings - Same headers = reuse mapping
  2. Learning from corrections - Track when users change AI suggestions
  3. Better prompting - Include more sample data for ambiguous columns

Mapping Workflow

1. Upload file
2. Parse headers + 5 sample rows
3. Run heuristic_mapping() immediately (instant)
4. User can click "AI Suggest" to run ai_mapping() (async)
5. User adjusts mappings manually if needed
6. Save final mappings to metadata.json
7. Apply mappings to all rows → IR

Intermediate Representation (IR)

Structure

{
  "version": "1.0",
  "job_id": "imp_abc123def456",
  "created_at": "2025-12-26T14:30:00Z",
  "source": {
    "filename": "volunteers_2025.xlsx",
    "worksheet": "Active",
    "file_type": "excel",
    "encoding": "utf-8",
    "row_count": 150,
    "column_count": 12
  },
  "mappings": [
    {"source": "First Name", "target": "first_name", "confidence": 1.0},
    {"source": "Last Name", "target": "last_name", "confidence": 1.0},
    {"source": "Email Address", "target": "email", "confidence": 0.95},
    {"source": "Party", "target": "custom:party", "confidence": 0.8},
    {"source": "Internal ID", "target": "_skip", "confidence": 1.0}
  ],
  "records": [
    {
      "_row": 1,
      "first_name": "John",
      "last_name": "Doe",
      "email": "john.doe@example.com",
      "custom:party": "Republican"
    },
    {
      "_row": 2,
      "first_name": "Jane",
      "last_name": "Smith",
      "email": "jane@example.com",
      "custom:party": "Democrat"
    }
  ],
  "validation": {
    "valid_count": 148,
    "error_count": 2,
    "errors": [
      {"row": 45, "field": "email", "value": "notanemail", "error": "Invalid email format"},
      {"row": 89, "field": "first_name", "value": "", "error": "Required field empty"}
    ]
  }
}

IR Design Principles

  1. Self-contained - All info needed to understand the data
  2. Auditable - Track source file, mappings, and decisions
  3. Re-processable - Can regenerate IR from original file + mappings
  4. Schema-agnostic - IR doesn't know about Person schema (that's Phase 2)

Special Target Values

Target Meaning
first_name, last_name, etc. Standard Person fields
custom:fieldname Custom field (will become person_custom_field)
_skip Column ignored
_meta:source_row Original row number (auto-added)

API Endpoints (Revised)

Upload Flow

# 1. Upload file
POST /api/list-loader/upload
Content-Type: multipart/form-data
file: [binary]

Response:
{
  "job_id": "imp_abc123",
  "status": "uploaded",
  "filename": "contacts.xlsx",
  "file_url": "https://files.nominate.ai/buckets/testsite/files/imports/imp_abc123/original.xlsx"
}

# 2. Get worksheets (Excel only)
GET /api/list-loader/{job_id}/worksheets

Response:
{
  "job_id": "imp_abc123",
  "worksheets": ["Sheet1", "Contacts", "Events"],
  "row_counts": {"Sheet1": 0, "Contacts": 150, "Events": 25}
}

# 3. Select worksheet and parse
POST /api/list-loader/{job_id}/parse
{"worksheet": "Contacts"}  # optional for CSV

Response:
{
  "job_id": "imp_abc123",
  "status": "parsed",
  "headers": ["First Name", "Last Name", "Email", ...],
  "sample_rows": [...],
  "total_rows": 150,
  "mappings": [...]  # heuristic mappings auto-generated
}

# 4. Request AI mapping (optional)
POST /api/list-loader/{job_id}/ai-suggest

Response:
{
  "status": "analyzing",
  "message": "AI is analyzing your file..."
}

# 5. Get/update mappings
GET /api/list-loader/{job_id}/mappings
PUT /api/list-loader/{job_id}/mappings
{"mappings": [...]}

# 6. Generate IR
POST /api/list-loader/{job_id}/generate-ir

Response:
{
  "job_id": "imp_abc123",
  "status": "ir_ready",
  "ir_url": "https://files.nominate.ai/.../ir.json",
  "valid_count": 148,
  "error_count": 2
}

Implementation Tasks

Phase 1A: File Server Integration

  • Add cbfiles client to cbapp (src/api/services/cbfiles_client.py)
  • Configure cbfiles auth (service token or user JWT passthrough)
  • Update upload endpoint to store in cbfiles
  • Implement file retrieval for parsing

Phase 1B: Multi-format Parsing

  • Add openpyxl to dependencies
  • Create unified FileParser class
  • Implement CSV/TSV parsing (existing, refactor)
  • Implement Excel parsing with worksheet listing
  • Add worksheet selection endpoint

Phase 1C: IR Generation

  • Define IR JSON schema
  • Implement IR generation from parsed data + mappings
  • Store IR in cbfiles
  • Add validation during IR generation

Phase 1D: Frontend Updates

  • Update /imports template (replace mockup)
  • Add file upload with drag-and-drop
  • Add worksheet selection UI (for Excel)
  • Add field mapping table with AI suggest button
  • Add preview of mapped data
  • Show validation errors

Dependencies

Python Packages to Add

# pyproject.toml
dependencies = [
    # ... existing ...
    "openpyxl>=3.1",      # Excel xlsx support
    "httpx>=0.24",        # Already in dev, move to main for cbfiles client
]

Environment Variables

# cbfiles integration
CBFILES_URL=https://files.nominate.ai
CBFILES_SERVICE_TOKEN=...  # or use user JWT passthrough
CBFILES_BUCKET=testsite    # tenant-specific bucket

Migration Path

  1. Keep existing /list-loader routes - Don't break current functionality
  2. Add new routes under /imports - New implementation
  3. Feature flag - USE_NEW_LIST_LOADER=true to switch
  4. Once stable - Deprecate old routes, update /imports template

Open Questions for User

  1. Tenant bucket naming - Use tenant subdomain (e.g., testsite) or tenant ID?
  2. File retention - Keep original files forever, or TTL (30 days)?
  3. Max file size - 10MB? 50MB? 100MB?
  4. Concurrent imports - Allow multiple active jobs per user?

Next Steps

After this plan is approved: 1. Implement Phase 1A (cbfiles integration) 2. Test upload/download flow 3. Implement Phase 1B (parsing) 4. Implement Phase 1C (IR generation) 5. Update frontend (Phase 1D)

User will then provide universal schema spec for Phase 2.