Skip to content

Congressional District Data Platform

Data Architecture Analysis & Platform Proposal

December 2025


Executive Summary

This analysis explores the feasibility of building an interactive congressional district data mining platform for Campaign Brain. After examining the U.S. Census Bureau's data offerings for the 119th Congress, I've identified a manageable, well-structured dataset that would work excellently with DuckDB's spatial extension and GeoParquet format.

Key Finding: The entire dataset for all 441 congressional districts—including geometries and comprehensive demographic data—would total approximately 200-250MB, making a single-database approach viable without needing per-district partitioning.


Data Coverage & Sources

Geographic Boundaries

The Census Bureau provides two types of boundary files for the 119th Congress (January 2025 – January 2027):

File Type Size (Compressed) Coverage Use Case
TIGER/Line Shapefiles ~150 MB total Per-state files Detailed boundaries
Cartographic Boundary 7 MB (national) Single national file Visualization (1:500k)

Source URLs: - TIGER/Line: https://www2.census.gov/geo/tiger/TIGER2024/CD/ - Cartographic: https://www2.census.gov/geo/tiger/GENZ2024/shp/cb_2024_us_cd119_500k.zip

Demographic & Economic Data

The Census Bureau's American Community Survey (ACS) provides comprehensive data via API:

Data Profile Variables Coverage
DP02 – Social 616 Education, marital status, language, ancestry, veterans
DP03 – Economic 274 Employment, commuting, income, poverty, health insurance
DP04 – Housing 286 Occupancy, structure, value, rent, utilities
DP05 – Demographics 188 Age, sex, race, Hispanic origin, voting-age population
Detailed Tables 36,722 Granular breakdowns of all topics
Subject Tables 18,645 Topic-specific summaries with percentages

Total: ~56,731 variables available per congressional district


Data Size Estimation

Component Estimated Size
Geometry (GeoParquet, 441 districts) ~10 MB
Data Profiles (1,364 vars × 441 districts) ~5 MB
Full ACS Tables (56,731 vars × 441 districts) ~190 MB
5 Years Historical Data ~950 MB
Total (Core + 5 Years) ~1.2 GB (manageable)

Architecture Recommendation

Given the manageable data size, I recommend a single DuckDB database rather than per-district partitioning:

  • Simplicity: One database file, easy to deploy and backup
  • Performance: DuckDB handles ~1GB datasets easily; spatial queries are fast with bbox filtering
  • Joins: Cross-district analysis becomes trivial (e.g., "show all districts where median income > $75k")
  • DuckDB Spatial: Reads Shapefiles and GeoPackages directly, exports GeoParquet with bbox for fast filtering

Proposed Schema

-- Core tables
districts        (geoid, geometry, state_fips, district_num, namelsad, aland, awater)
demographics     (geoid, year, dp02_*, dp03_*, dp04_*, dp05_* columns)
detailed_tables  (geoid, year, table_id, variable_id, estimate, moe)

-- Enrichment tables (for Campaign Brain)
district_news    (geoid, date, headline, source_url, sentiment)
rep_info         (geoid, congress_num, rep_name, party, committees)
campaign_events  (geoid, event_date, event_type, description)

Your Tile Server Idea

The tile server would be valuable for interactive visualization! Here's how it fits:

  • Vector Tiles: Generate MVT tiles from the cartographic boundaries (7MB source → fast tiles)
  • Dynamic Styling: Color districts by any metric (turnout, income, party lean)
  • Tippecanoe: Convert GeoJSON/FlatGeobuf → PMTiles for serverless hosting or MBTiles for tile server
  • DuckDB + Tiles: Query DuckDB for data, tile server for rendering—clean separation of concerns

Implementation Approach

Phase 1: Data Pipeline

  1. Download: Fetch national cartographic boundary file (7MB) + TIGER/Line if detailed boundaries needed
  2. Census API: Batch-fetch ACS Data Profiles for all 441 districts (API key required, free)
  3. Transform: Convert to GeoParquet + normalize demographic tables
  4. Load: Insert into DuckDB with spatial extension

Phase 2: Visualization

  • Generate vector tiles with Tippecanoe or similar
  • Deploy tile server (or use PMTiles for serverless)
  • Build FastHTML frontend with MapLibre GL JS
  • Connect click events to DuckDB queries for district details

Phase 3: Enrichment (Campaign Brain Value-Add)

  • News Aggregation: RSS feeds + Claude summarization for district-level political news
  • Representative Data: Congress.gov API for current rep info, voting records, committees
  • Election History: MIT Election Lab data for historical results
  • Custom Metrics: Combine Census data into campaign-relevant scores (persuadability, turnout potential)

Technical Details from Analysis

Boundary File Structure

The national cartographic boundary file contains:

  • 441 districts (435 voting + 6 non-voting delegates)
  • 56 state/territory FIPS codes represented
  • 633,459 total vertices across all geometries
  • Average 1,436 vertices per district

Top States by District Count

State Districts
California 52
Texas 38
Florida 28
New York 26
Illinois 17
Pennsylvania 17
Ohio 15
Georgia 14
North Carolina 14
Michigan 13

Census API Sample

import requests

# Get demographic data for all California congressional districts
params = {
    'get': 'NAME,DP05_0001E,DP05_0018E',  # Name, Total Pop, Median Age
    'for': 'congressional district:*',
    'in': 'state:06'
}
resp = requests.get(
    'https://api.census.gov/data/2023/acs/acs1/profile',
    params=params
)
# Returns data for all 52 CA districts

Summary

The congressional district data platform is highly feasible:

  • Data is available: Census provides comprehensive, well-documented data via free API
  • Size is manageable: ~1-2GB total, easily handled by DuckDB
  • No per-district DBs needed: Single database with spatial indexing is more practical
  • Tile server adds value: For interactive maps, your tile server idea is the right approach
  • Growth path: Start with Census data, add news/election/rep data incrementally

Next Steps: I can help build out any of these components—the data pipeline, DuckDB schema, tile generation workflow, or FastHTML visualization. Let me know which piece you'd like to tackle first!