cbmodels CLI Reference¶
Complete command reference for the cbmodels data analysis and ETL toolkit.
Quick Start¶
# Activate environment
source ~/.pyenv/versions/nominates/bin/activate
# Test Snowflake connection
cbmodels extract test
# Run autonomous pipeline (extract + consolidate)
cbmodels pipeline run CONSERVATIVECONNECTOR.PIPELINE.MY_TABLE
# Check pipeline status
cbmodels pipeline status CONSERVATIVECONNECTOR.PIPELINE.MY_TABLE
# Load parquet into DuckDB
cbmodels ingest load data/cc/sources/my_table_20251226.parquet --db analysis.db
Pipeline Commands (Autonomous ETL)¶
The pipeline automates: Extract (chunked) → Consolidate → Load
cbmodels pipeline run¶
Run the full autonomous ETL pipeline.
cbmodels pipeline run TABLE [OPTIONS]
# Examples:
# Extract and consolidate only (no DuckDB loading)
cbmodels pipeline run CONSERVATIVECONNECTOR.KPI.CAMPAIGNS
# Full pipeline with DuckDB loading
cbmodels pipeline run CONSERVATIVECONNECTOR.KPI.CAMPAIGNS --db analysis.db -t campaigns
# Large table with custom chunk size
cbmodels pipeline run CONSERVATIVECONNECTOR.PIPELINE.BIG_TABLE --chunk-size 500000
# Start fresh (ignore previous progress)
cbmodels pipeline run CONSERVATIVECONNECTOR.KPI.MY_TABLE --no-resume
| Option | Description | Default |
|---|---|---|
TABLE |
Fully qualified Snowflake table name | Required |
--db |
Target DuckDB database (skip if not provided) | None |
-t, --table |
Target table name in DuckDB | Source table name |
--chunk-size |
Rows per extraction chunk | 1,000,000 |
--work-dir |
Directory for temp chunk files | ./data/cc/work |
--output-dir |
Directory for consolidated parquet | ./data/cc/sources |
--no-resume |
Start fresh, ignore previous state | False |
--env |
Path to .env file with credentials | .env |
Output: {output-dir}/{table_name}_{YYYYMMDD}.parquet
cbmodels pipeline status¶
Check the current state of a pipeline.
cbmodels pipeline status TABLE
# Example:
cbmodels pipeline status CONSERVATIVECONNECTOR.KPI.EMAIL_CLICKERS
Shows: status, total rows, extracted rows, chunks, consolidated flag, loaded rows.
cbmodels pipeline list¶
List all pipeline states.
cbmodels pipeline cleanup¶
Remove work files (chunks) after successful extraction.
cbmodels pipeline cleanup TABLE [--keep-output]
# Example: Clean up chunks but keep the consolidated parquet
cbmodels pipeline cleanup CONSERVATIVECONNECTOR.KPI.MY_TABLE
cbmodels pipeline preview¶
Dry-run: show row count and time estimates without extracting.
cbmodels pipeline preview TABLE
# Example:
cbmodels pipeline preview CONSERVATIVECONNECTOR.KPI.CAMPAIGNS
Output shows: rows, columns, estimated chunks, time, and file size.
cbmodels pipeline sources¶
List all extracted parquet files with sizes and dates.
Extract Commands (Manual Extraction)¶
For manual/ad-hoc extraction without the full pipeline.
cbmodels extract test¶
Test Snowflake connection.
cbmodels extract tables¶
List available tables in a Snowflake schema.
cbmodels extract tables [-s SCHEMA]
# Examples:
cbmodels extract tables
cbmodels extract tables -s PIPELINE
cbmodels extract tables -s KPI
cbmodels extract snowflake¶
Manual extraction (single query, no chunking).
# Extract a table with row limit
cbmodels extract snowflake -t PIPELINE.MY_TABLE -l 1000 -o ./data/extracts
# Run a custom query
cbmodels extract snowflake -q "SELECT * FROM table WHERE date > '2024-01-01'" -n my_extract
| Option | Description |
|---|---|
-t, --table |
Table to extract |
-q, --query |
Custom SQL query |
-o, --output |
Output directory |
-n, --name |
Output file name |
-f, --format |
Format: parquet, json, csv |
-l, --limit |
Row limit |
Ingest Commands (DuckDB Loading)¶
Load parquet files into DuckDB.
cbmodels ingest load¶
Quick load a parquet file directly into DuckDB.
cbmodels ingest load PARQUET_FILE [OPTIONS]
# Examples:
cbmodels ingest load data/cc/sources/campaigns_20251226.parquet --db analysis.db -t campaigns
# Load with table drop
cbmodels ingest load data.parquet --db mydb.db --drop
| Option | Description | Default |
|---|---|---|
--db |
Target DuckDB database | analysis.db |
-t, --table |
Target table name | File stem |
--drop/--no-drop |
Drop existing table | True |
cbmodels ingest analyze¶
Analyze a parquet file and generate a mapping spec (for complex transforms).
cbmodels ingest run¶
Run ingestion with a mapping spec (for complex ETL with transforms).
Analysis Commands (Model Building)¶
Build and query data models from DuckDB databases.
cbmodels build¶
Build a data model with correlations, outliers, and patterns.
cbmodels info¶
Display model summary.
cbmodels tables¶
List tables in the model.
cbmodels schema¶
Display database schema.
cbmodels correlations¶
Show correlations between numeric columns.
cbmodels outliers¶
Show detected outliers.
cbmodels patterns¶
Show detected patterns (missing data, imbalance, skew).
cbmodels stats¶
Show detailed statistics for a table/column.
Environment Setup¶
Required Environment Variables¶
Create a .env file in the project root:
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_WAREHOUSE=your_warehouse
SNOWFLAKE_DATABASE=CONSERVATIVECONNECTOR
SNOWFLAKE_SCHEMA=PIPELINE
Directory Structure¶
data/cc/
├── work/ # Temporary chunk files (auto-cleaned)
│ └── {table_name}/
│ ├── chunk_0000.parquet
│ ├── chunk_0001.parquet
│ └── ...
├── sources/ # Consolidated parquet files
│ ├── campaigns_20251226.parquet
│ ├── email_clickers_20251226.parquet
│ └── list_link_phone_email_map_20251226.parquet
└── databases/ # DuckDB databases
└── analysis.db
Typical Workflows¶
1. Extract a New Table¶
# Run pipeline (autonomous)
cbmodels pipeline run CONSERVATIVECONNECTOR.PIPELINE.NEW_TABLE
# Monitor progress
cbmodels pipeline status CONSERVATIVECONNECTOR.PIPELINE.NEW_TABLE
# Clean up work files when done
cbmodels pipeline cleanup CONSERVATIVECONNECTOR.PIPELINE.NEW_TABLE
2. Load Multiple Tables into DuckDB¶
# Load each parquet file
cbmodels ingest load data/cc/sources/campaigns_20251226.parquet --db analysis.db -t campaigns
cbmodels ingest load data/cc/sources/email_clickers_20251226.parquet --db analysis.db -t email_clickers
cbmodels ingest load data/cc/sources/donations_20251226.parquet --db analysis.db -t donations
3. Build Analysis Model¶
# Build model from DuckDB
cbmodels build analysis.db -o model.json
# Explore results
cbmodels tables model.json
cbmodels correlations model.json --min 0.5
cbmodels patterns model.json
4. Resume Failed Pipeline¶
# Pipeline auto-resumes from last checkpoint
cbmodels pipeline run CONSERVATIVECONNECTOR.PIPELINE.BIG_TABLE
# Check where it left off
cbmodels pipeline status CONSERVATIVECONNECTOR.PIPELINE.BIG_TABLE
Troubleshooting¶
Connection Timeouts¶
For very large tables, the pipeline automatically: - Retries failed chunks (up to 5 times) - Reconnects after network errors - Uses 10-minute network/socket timeouts
If timeouts persist:
Schema Mismatches¶
The consolidation phase automatically normalizes schemas: - Date columns → String (for consistency across chunks) - Handles type variations between extraction batches
Disk Space¶
Work files are stored in ./data/cc/work/. Clean up after successful extraction: