Lessons Learned: Snowflake ETL Pipeline¶
Findings from extracting 220M+ rows from Snowflake (Dec 2025).
Key Findings¶
1. Chunk Size Matters¶
| Chunk Size | Result |
|---|---|
| 2M rows | Frequent timeouts on large tables |
| 1M rows | Stable, good throughput (~1 chunk/min) |
| 500K rows | Slower but more resilient for unstable networks |
| 100K rows | Too slow for 100M+ row tables |
Recommendation: Default to 1M rows. Drop to 500K if seeing frequent timeouts.
2. Network Resilience is Critical¶
Snowflake connector fetches results from S3. This can fail due to: - DNS resolution failures ("Temporary failure in name resolution") - S3 read timeouts - Connection resets
Solution implemented: - Auto-retry with exponential backoff (5 retries, 10s × retry_num) - Reconnect to Snowflake after each failure - Long network/socket timeouts (600s)
3. Schema Normalization Required¶
Different extraction batches can have schema variations:
- First chunk: TIMESTAMP as date32[day]
- Later chunks: TIMESTAMP as string
Solution: Consolidation phase casts all date columns to strings for consistency.
4. Resumability is Essential¶
For 100M+ row tables, extraction takes hours. Interruptions happen.
Solution implemented: - State file tracks: status, extracted_rows, chunks_extracted - Pipeline auto-resumes from last successful chunk - Work files preserved until cleanup
5. Memory Management¶
| Table Size | Memory Usage |
|---|---|
| 1M rows/chunk | ~1.5 GB peak |
| 182M total | 3 GB during consolidation |
Note: PyArrow streaming writer keeps memory bounded during consolidation.
Performance Benchmarks¶
| Table | Rows | Extract Time | Consolidate | Total |
|---|---|---|---|---|
| CAMPAIGNS | 73K | <1 min | <1 min | ~1 min |
| EMAIL_CLICKERS | 41M | ~45 min | ~2 min | ~47 min |
| DONATIONS | 3.2M | ~4 min | <1 min | ~5 min |
| PHONE_EMAIL_MAP | 182M | ~3 hrs | ~5 min | ~3.1 hrs |
Throughput: ~1M rows/minute (varies with network conditions).
Recommended Improvements¶
Priority 1: Add to Pipeline¶
-
--parallel: Extract multiple tables concurrently -
--dry-run: Show row count and estimated time without extracting -
--notify: Slack/email notification on completion
Priority 2: Operational¶
-
pipeline logs TABLE: Show extraction logs for debugging -
pipeline retry TABLE: Retry failed chunks only - Scheduled runs: Cron-friendly wrapper with lock files
Priority 3: Nice to Have¶
- Incremental extraction: Only new rows since last run (requires timestamp column)
- Compression options: Choose snappy/gzip/zstd for parquet
- S3 upload: Direct upload to S3 after consolidation
Current Data Inventory¶
Extracted 2025-12-26:
| File | Rows | Size | Description |
|---|---|---|---|
list_link_phone_email_map_20251226.parquet |
182M | 9.94 GB | Contact → Phone/Email mapping |
email_clickers_20251226.parquet |
41M | 778 MB | Email engagement (clicks) by client |
campaigns_20251226.parquet |
73K | 5.9 MB | Campaign → Client lookup |
all_manual_donations_emails_clean_merged_20251226.parquet |
3.2M | 83 MB | Donation records |
Total: 227M rows, 10.8 GB
Environment Notes¶
Snowflake Connection¶
Available schemas:
- PIPELINE - Processed/merged data
- KPI - Analytics/metrics tables
File Locations¶
data/cc/sources/ # Consolidated parquet files (keep these)
data/cc/work/ # Temp chunks (safe to delete after success)
Gotchas¶
-
Table names are case-sensitive in Snowflake. Use uppercase:
CONSERVATIVECONNECTOR.PIPELINE.TABLE_NAME -
First chunk may differ from later chunks in schema. The pipeline handles this, but manual extractions may need schema alignment.
-
Work directory grows fast for large tables. 182M rows = 10GB of chunks before consolidation.
-
DuckDB loading is fast - loading 182M rows takes ~5 minutes with chunked inserts.