Congressional District Data Platform¶
Data Architecture Analysis & Platform Proposal
December 2025
Executive Summary¶
This analysis explores the feasibility of building an interactive congressional district data mining platform for Campaign Brain. After examining the U.S. Census Bureau's data offerings for the 119th Congress, I've identified a manageable, well-structured dataset that would work excellently with DuckDB's spatial extension and GeoParquet format.
Key Finding: The entire dataset for all 441 congressional districts—including geometries and comprehensive demographic data—would total approximately 200-250MB, making a single-database approach viable without needing per-district partitioning.
Data Coverage & Sources¶
Geographic Boundaries¶
The Census Bureau provides two types of boundary files for the 119th Congress (January 2025 – January 2027):
| File Type | Size (Compressed) | Coverage | Use Case |
|---|---|---|---|
| TIGER/Line Shapefiles | ~150 MB total | Per-state files | Detailed boundaries |
| Cartographic Boundary | 7 MB (national) | Single national file | Visualization (1:500k) |
Source URLs:
- TIGER/Line: https://www2.census.gov/geo/tiger/TIGER2024/CD/
- Cartographic: https://www2.census.gov/geo/tiger/GENZ2024/shp/cb_2024_us_cd119_500k.zip
Demographic & Economic Data¶
The Census Bureau's American Community Survey (ACS) provides comprehensive data via API:
| Data Profile | Variables | Coverage |
|---|---|---|
| DP02 – Social | 616 | Education, marital status, language, ancestry, veterans |
| DP03 – Economic | 274 | Employment, commuting, income, poverty, health insurance |
| DP04 – Housing | 286 | Occupancy, structure, value, rent, utilities |
| DP05 – Demographics | 188 | Age, sex, race, Hispanic origin, voting-age population |
| Detailed Tables | 36,722 | Granular breakdowns of all topics |
| Subject Tables | 18,645 | Topic-specific summaries with percentages |
Total: ~56,731 variables available per congressional district
Data Size Estimation¶
| Component | Estimated Size |
|---|---|
| Geometry (GeoParquet, 441 districts) | ~10 MB |
| Data Profiles (1,364 vars × 441 districts) | ~5 MB |
| Full ACS Tables (56,731 vars × 441 districts) | ~190 MB |
| 5 Years Historical Data | ~950 MB |
| Total (Core + 5 Years) | ~1.2 GB (manageable) |
Architecture Recommendation¶
Single DuckDB Database (Recommended)¶
Given the manageable data size, I recommend a single DuckDB database rather than per-district partitioning:
- Simplicity: One database file, easy to deploy and backup
- Performance: DuckDB handles ~1GB datasets easily; spatial queries are fast with bbox filtering
- Joins: Cross-district analysis becomes trivial (e.g., "show all districts where median income > $75k")
- DuckDB Spatial: Reads Shapefiles and GeoPackages directly, exports GeoParquet with bbox for fast filtering
Proposed Schema¶
-- Core tables
districts (geoid, geometry, state_fips, district_num, namelsad, aland, awater)
demographics (geoid, year, dp02_*, dp03_*, dp04_*, dp05_* columns)
detailed_tables (geoid, year, table_id, variable_id, estimate, moe)
-- Enrichment tables (for Campaign Brain)
district_news (geoid, date, headline, source_url, sentiment)
rep_info (geoid, congress_num, rep_name, party, committees)
campaign_events (geoid, event_date, event_type, description)
Your Tile Server Idea¶
The tile server would be valuable for interactive visualization! Here's how it fits:
- Vector Tiles: Generate MVT tiles from the cartographic boundaries (7MB source → fast tiles)
- Dynamic Styling: Color districts by any metric (turnout, income, party lean)
- Tippecanoe: Convert GeoJSON/FlatGeobuf → PMTiles for serverless hosting or MBTiles for tile server
- DuckDB + Tiles: Query DuckDB for data, tile server for rendering—clean separation of concerns
Implementation Approach¶
Phase 1: Data Pipeline¶
- Download: Fetch national cartographic boundary file (7MB) + TIGER/Line if detailed boundaries needed
- Census API: Batch-fetch ACS Data Profiles for all 441 districts (API key required, free)
- Transform: Convert to GeoParquet + normalize demographic tables
- Load: Insert into DuckDB with spatial extension
Phase 2: Visualization¶
- Generate vector tiles with Tippecanoe or similar
- Deploy tile server (or use PMTiles for serverless)
- Build FastHTML frontend with MapLibre GL JS
- Connect click events to DuckDB queries for district details
Phase 3: Enrichment (Campaign Brain Value-Add)¶
- News Aggregation: RSS feeds + Claude summarization for district-level political news
- Representative Data: Congress.gov API for current rep info, voting records, committees
- Election History: MIT Election Lab data for historical results
- Custom Metrics: Combine Census data into campaign-relevant scores (persuadability, turnout potential)
Technical Details from Analysis¶
Boundary File Structure¶
The national cartographic boundary file contains:
- 441 districts (435 voting + 6 non-voting delegates)
- 56 state/territory FIPS codes represented
- 633,459 total vertices across all geometries
- Average 1,436 vertices per district
Top States by District Count¶
| State | Districts |
|---|---|
| California | 52 |
| Texas | 38 |
| Florida | 28 |
| New York | 26 |
| Illinois | 17 |
| Pennsylvania | 17 |
| Ohio | 15 |
| Georgia | 14 |
| North Carolina | 14 |
| Michigan | 13 |
Census API Sample¶
import requests
# Get demographic data for all California congressional districts
params = {
'get': 'NAME,DP05_0001E,DP05_0018E', # Name, Total Pop, Median Age
'for': 'congressional district:*',
'in': 'state:06'
}
resp = requests.get(
'https://api.census.gov/data/2023/acs/acs1/profile',
params=params
)
# Returns data for all 52 CA districts
Summary¶
The congressional district data platform is highly feasible:
- ✅ Data is available: Census provides comprehensive, well-documented data via free API
- ✅ Size is manageable: ~1-2GB total, easily handled by DuckDB
- ✅ No per-district DBs needed: Single database with spatial indexing is more practical
- ✅ Tile server adds value: For interactive maps, your tile server idea is the right approach
- ✅ Growth path: Start with Census data, add news/election/rep data incrementally
Next Steps: I can help build out any of these components—the data pipeline, DuckDB schema, tile generation workflow, or FastHTML visualization. Let me know which piece you'd like to tackle first!