List Loader Redesign - Phase 1: File Ingestion¶
Date: 2025-12-26 Status: Planning Scope: File upload → Parsing → Clean IR output
Executive Summary¶
Consolidate List Loader functionality from cbtenant into cbapp with:
1. Centralized file storage on files.nominate.ai (cbfiles/MinIO)
2. Support for CSV, TSV, and Excel files
3. AI-assisted field mapping (existing Claude integration)
4. Clean Intermediate Representation (IR) output for downstream processing
Phase 1 Focus: Get files from user → file server → parsed → IR Phase 2 (Future): Universal schema, enrichment, and database ingestion
Current Architecture¶
What Exists Today¶
| Component | Location | Status |
|---|---|---|
| cbtenant List Loader | cbtenant/api/routes/list_loader.py |
To be deprecated |
| cbapp List Loader | cbapp/src/api/routes/list_loader.py |
Functional, CSV only |
| AI Agent | cbapp/src/api/agents/list_loader.py |
Works with Claude |
| cbfiles | files.nominate.ai |
Ready, unused by List Loader |
| Import Template UI | cbapp/src/app/templates/imports/index.html |
Mockup only |
Current Data Flow (cbapp)¶
Problems:
1. Files stored in /tmp/ (ephemeral, no audit trail)
2. CSV only (no Excel, TSV)
3. No worksheet selection for multi-sheet Excel
4. Files deleted after import (no reprocessing possible)
5. No standardized file naming or organization
Proposed Architecture¶
New Data Flow¶
┌─────────────────────────────────────────────────────────────────────────┐
│ USER UPLOAD │
│ Browser → cbapp /api/list-loader/upload │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ FILE SERVER (cbfiles) │
│ POST files.nominate.ai/buckets/{tenant}/files/imports/{job_id}/{file} │
│ │
│ Bucket: tenant-specific (e.g., "testsite") │
│ Path: imports/{job_id}/original.{csv|xlsx|tsv} │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ PARSE & ANALYZE │
│ 1. Detect file type (CSV, TSV, Excel) │
│ 2. For Excel: list worksheets, let user choose │
│ 3. Parse selected sheet/file │
│ 4. Extract headers + sample rows │
│ 5. Store parsed data as IR in file server │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ AI FIELD MAPPING │
│ 1. Heuristic mapping (fast, no API call) │
│ 2. AI mapping (Claude, if requested) │
│ 3. User can adjust mappings manually │
│ 4. Save final mapping in job metadata │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ INTERMEDIATE REPRESENTATION (IR) │
│ Clean, normalized JSON ready for schema application │
│ Stored: imports/{job_id}/ir.json │
│ │
│ { │
│ "job_id": "imp_abc123", │
│ "source_file": "volunteers.xlsx", │
│ "worksheet": "Sheet1", │
│ "row_count": 150, │
│ "mappings": [...], │
│ "records": [ │
│ {"first_name": "John", "last_name": "Doe", "email": "j@x.com"}, │
│ ... │
│ ] │
│ } │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────┐
│ PHASE 2 (FUTURE) │
│ Schema Application │
│ Enrichment │
│ Database Ingestion │
└───────────────────────────┘
File Server Integration¶
Bucket Structure¶
{tenant-bucket}/
├── imports/
│ ├── {job_id}/
│ │ ├── original.csv # Original uploaded file
│ │ ├── original.xlsx # Or Excel
│ │ ├── metadata.json # Job metadata
│ │ ├── ir.json # Intermediate representation
│ │ └── result.json # Import result (after Phase 2)
│ └── ...
├── exports/
│ └── ...
└── documents/
└── ...
File Naming Convention¶
| File | Purpose |
|---|---|
original.{ext} |
Unmodified uploaded file |
metadata.json |
Job info, mappings, status, timestamps |
ir.json |
Parsed & mapped data (clean records) |
result.json |
Import results (counts, errors) |
cbfiles API Usage¶
Upload original file:
POST /buckets/{tenant}/files/imports/{job_id}/original.csv
Authorization: Bearer {token}
Content-Type: text/csv
[file content]
Download for parsing:
Store IR:
POST /buckets/{tenant}/files/imports/{job_id}/ir.json
Authorization: Bearer {token}
Content-Type: application/json
{"job_id": "...", "records": [...]}
Authentication Flow¶
cbfiles uses JWT tokens. Options:
- Proxy through cbapp (recommended for Phase 1)
- cbapp receives upload, forwards to cbfiles with service token
-
User never directly talks to cbfiles
-
Signed URLs (for large files)
- cbapp generates presigned upload URL
- Browser uploads directly to cbfiles
- cbapp notified on completion
File Parsing¶
Supported Formats (Phase 1)¶
| Format | Extension | Library |
|---|---|---|
| CSV | .csv |
Python csv (built-in) |
| TSV | .tsv, .txt |
Python csv with delimiter='\t' |
| Excel | .xlsx, .xls |
openpyxl (xlsx) / xlrd (xls) |
Parsing Logic¶
def parse_file(file_path: str, worksheet: str = None) -> ParseResult:
"""
Parse uploaded file into headers + rows.
Args:
file_path: Path to file (local or cbfiles URL)
worksheet: For Excel, which sheet to parse
Returns:
ParseResult with headers, rows, and metadata
"""
ext = Path(file_path).suffix.lower()
if ext in ('.csv', '.tsv', '.txt'):
return parse_csv(file_path)
elif ext in ('.xlsx', '.xls'):
return parse_excel(file_path, worksheet)
else:
raise UnsupportedFileType(ext)
Excel Worksheet Selection¶
For Excel files with multiple sheets:
-
API returns sheet list:
-
User selects sheet:
-
Parsing continues with selected sheet
AI Field Mapping¶
Existing Implementation (Keep)¶
The current ListLoaderAgent works well:
- heuristic_mapping() - Fast, rule-based
- ai_mapping() - Claude API, context-aware
Improvements¶
- Cache AI mappings - Same headers = reuse mapping
- Learning from corrections - Track when users change AI suggestions
- Better prompting - Include more sample data for ambiguous columns
Mapping Workflow¶
1. Upload file
2. Parse headers + 5 sample rows
3. Run heuristic_mapping() immediately (instant)
4. User can click "AI Suggest" to run ai_mapping() (async)
5. User adjusts mappings manually if needed
6. Save final mappings to metadata.json
7. Apply mappings to all rows → IR
Intermediate Representation (IR)¶
Structure¶
{
"version": "1.0",
"job_id": "imp_abc123def456",
"created_at": "2025-12-26T14:30:00Z",
"source": {
"filename": "volunteers_2025.xlsx",
"worksheet": "Active",
"file_type": "excel",
"encoding": "utf-8",
"row_count": 150,
"column_count": 12
},
"mappings": [
{"source": "First Name", "target": "first_name", "confidence": 1.0},
{"source": "Last Name", "target": "last_name", "confidence": 1.0},
{"source": "Email Address", "target": "email", "confidence": 0.95},
{"source": "Party", "target": "custom:party", "confidence": 0.8},
{"source": "Internal ID", "target": "_skip", "confidence": 1.0}
],
"records": [
{
"_row": 1,
"first_name": "John",
"last_name": "Doe",
"email": "john.doe@example.com",
"custom:party": "Republican"
},
{
"_row": 2,
"first_name": "Jane",
"last_name": "Smith",
"email": "jane@example.com",
"custom:party": "Democrat"
}
],
"validation": {
"valid_count": 148,
"error_count": 2,
"errors": [
{"row": 45, "field": "email", "value": "notanemail", "error": "Invalid email format"},
{"row": 89, "field": "first_name", "value": "", "error": "Required field empty"}
]
}
}
IR Design Principles¶
- Self-contained - All info needed to understand the data
- Auditable - Track source file, mappings, and decisions
- Re-processable - Can regenerate IR from original file + mappings
- Schema-agnostic - IR doesn't know about Person schema (that's Phase 2)
Special Target Values¶
| Target | Meaning |
|---|---|
first_name, last_name, etc. |
Standard Person fields |
custom:fieldname |
Custom field (will become person_custom_field) |
_skip |
Column ignored |
_meta:source_row |
Original row number (auto-added) |
API Endpoints (Revised)¶
Upload Flow¶
# 1. Upload file
POST /api/list-loader/upload
Content-Type: multipart/form-data
file: [binary]
Response:
{
"job_id": "imp_abc123",
"status": "uploaded",
"filename": "contacts.xlsx",
"file_url": "https://files.nominate.ai/buckets/testsite/files/imports/imp_abc123/original.xlsx"
}
# 2. Get worksheets (Excel only)
GET /api/list-loader/{job_id}/worksheets
Response:
{
"job_id": "imp_abc123",
"worksheets": ["Sheet1", "Contacts", "Events"],
"row_counts": {"Sheet1": 0, "Contacts": 150, "Events": 25}
}
# 3. Select worksheet and parse
POST /api/list-loader/{job_id}/parse
{"worksheet": "Contacts"} # optional for CSV
Response:
{
"job_id": "imp_abc123",
"status": "parsed",
"headers": ["First Name", "Last Name", "Email", ...],
"sample_rows": [...],
"total_rows": 150,
"mappings": [...] # heuristic mappings auto-generated
}
# 4. Request AI mapping (optional)
POST /api/list-loader/{job_id}/ai-suggest
Response:
{
"status": "analyzing",
"message": "AI is analyzing your file..."
}
# 5. Get/update mappings
GET /api/list-loader/{job_id}/mappings
PUT /api/list-loader/{job_id}/mappings
{"mappings": [...]}
# 6. Generate IR
POST /api/list-loader/{job_id}/generate-ir
Response:
{
"job_id": "imp_abc123",
"status": "ir_ready",
"ir_url": "https://files.nominate.ai/.../ir.json",
"valid_count": 148,
"error_count": 2
}
Implementation Tasks¶
Phase 1A: File Server Integration¶
- Add cbfiles client to cbapp (
src/api/services/cbfiles_client.py) - Configure cbfiles auth (service token or user JWT passthrough)
- Update upload endpoint to store in cbfiles
- Implement file retrieval for parsing
Phase 1B: Multi-format Parsing¶
- Add
openpyxlto dependencies - Create unified
FileParserclass - Implement CSV/TSV parsing (existing, refactor)
- Implement Excel parsing with worksheet listing
- Add worksheet selection endpoint
Phase 1C: IR Generation¶
- Define IR JSON schema
- Implement IR generation from parsed data + mappings
- Store IR in cbfiles
- Add validation during IR generation
Phase 1D: Frontend Updates¶
- Update
/importstemplate (replace mockup) - Add file upload with drag-and-drop
- Add worksheet selection UI (for Excel)
- Add field mapping table with AI suggest button
- Add preview of mapped data
- Show validation errors
Dependencies¶
Python Packages to Add¶
# pyproject.toml
dependencies = [
# ... existing ...
"openpyxl>=3.1", # Excel xlsx support
"httpx>=0.24", # Already in dev, move to main for cbfiles client
]
Environment Variables¶
# cbfiles integration
CBFILES_URL=https://files.nominate.ai
CBFILES_SERVICE_TOKEN=... # or use user JWT passthrough
CBFILES_BUCKET=testsite # tenant-specific bucket
Migration Path¶
- Keep existing
/list-loaderroutes - Don't break current functionality - Add new routes under
/imports- New implementation - Feature flag -
USE_NEW_LIST_LOADER=trueto switch - Once stable - Deprecate old routes, update
/importstemplate
Open Questions for User¶
- Tenant bucket naming - Use tenant subdomain (e.g.,
testsite) or tenant ID? - File retention - Keep original files forever, or TTL (30 days)?
- Max file size - 10MB? 50MB? 100MB?
- Concurrent imports - Allow multiple active jobs per user?
Next Steps¶
After this plan is approved: 1. Implement Phase 1A (cbfiles integration) 2. Test upload/download flow 3. Implement Phase 1B (parsing) 4. Implement Phase 1C (IR generation) 5. Update frontend (Phase 1D)
User will then provide universal schema spec for Phase 2.