Skip to content

Lessons Learned: Snowflake ETL Pipeline

Findings from extracting 220M+ rows from Snowflake (Dec 2025).


Key Findings

1. Chunk Size Matters

Chunk Size Result
2M rows Frequent timeouts on large tables
1M rows Stable, good throughput (~1 chunk/min)
500K rows Slower but more resilient for unstable networks
100K rows Too slow for 100M+ row tables

Recommendation: Default to 1M rows. Drop to 500K if seeing frequent timeouts.

2. Network Resilience is Critical

Snowflake connector fetches results from S3. This can fail due to: - DNS resolution failures ("Temporary failure in name resolution") - S3 read timeouts - Connection resets

Solution implemented: - Auto-retry with exponential backoff (5 retries, 10s × retry_num) - Reconnect to Snowflake after each failure - Long network/socket timeouts (600s)

3. Schema Normalization Required

Different extraction batches can have schema variations: - First chunk: TIMESTAMP as date32[day] - Later chunks: TIMESTAMP as string

Solution: Consolidation phase casts all date columns to strings for consistency.

4. Resumability is Essential

For 100M+ row tables, extraction takes hours. Interruptions happen.

Solution implemented: - State file tracks: status, extracted_rows, chunks_extracted - Pipeline auto-resumes from last successful chunk - Work files preserved until cleanup

5. Memory Management

Table Size Memory Usage
1M rows/chunk ~1.5 GB peak
182M total 3 GB during consolidation

Note: PyArrow streaming writer keeps memory bounded during consolidation.


Performance Benchmarks

Table Rows Extract Time Consolidate Total
CAMPAIGNS 73K <1 min <1 min ~1 min
EMAIL_CLICKERS 41M ~45 min ~2 min ~47 min
DONATIONS 3.2M ~4 min <1 min ~5 min
PHONE_EMAIL_MAP 182M ~3 hrs ~5 min ~3.1 hrs

Throughput: ~1M rows/minute (varies with network conditions).


Priority 1: Add to Pipeline

  • --parallel: Extract multiple tables concurrently
  • --dry-run: Show row count and estimated time without extracting
  • --notify: Slack/email notification on completion

Priority 2: Operational

  • pipeline logs TABLE: Show extraction logs for debugging
  • pipeline retry TABLE: Retry failed chunks only
  • Scheduled runs: Cron-friendly wrapper with lock files

Priority 3: Nice to Have

  • Incremental extraction: Only new rows since last run (requires timestamp column)
  • Compression options: Choose snappy/gzip/zstd for parquet
  • S3 upload: Direct upload to S3 after consolidation

Current Data Inventory

Extracted 2025-12-26:

File Rows Size Description
list_link_phone_email_map_20251226.parquet 182M 9.94 GB Contact → Phone/Email mapping
email_clickers_20251226.parquet 41M 778 MB Email engagement (clicks) by client
campaigns_20251226.parquet 73K 5.9 MB Campaign → Client lookup
all_manual_donations_emails_clean_merged_20251226.parquet 3.2M 83 MB Donation records

Total: 227M rows, 10.8 GB


Environment Notes

Snowflake Connection

SNOWFLAKE_DATABASE=CONSERVATIVECONNECTOR
SNOWFLAKE_WAREHOUSE=<your_warehouse>

Available schemas: - PIPELINE - Processed/merged data - KPI - Analytics/metrics tables

File Locations

data/cc/sources/     # Consolidated parquet files (keep these)
data/cc/work/        # Temp chunks (safe to delete after success)

Gotchas

  1. Table names are case-sensitive in Snowflake. Use uppercase: CONSERVATIVECONNECTOR.PIPELINE.TABLE_NAME

  2. First chunk may differ from later chunks in schema. The pipeline handles this, but manual extractions may need schema alignment.

  3. Work directory grows fast for large tables. 182M rows = 10GB of chunks before consolidation.

  4. DuckDB loading is fast - loading 182M rows takes ~5 minutes with chunked inserts.