Rate Ingestion

OVERVIEW

I need you to put on your Commercial Radio hat for a moment :) Need you to Ultrathink on a plan to assemble a working backend MVP for processing raw data into a database and have it accessible via API (read-only endpoints for now)

BUSINESS CASE

Our organization represents mainly political candidates and purchases radio ad-buys on their behalf. We have a network of stations around the United States that we work with.

It's a mainly cyclical business - and there are certain constraints, variables and modifiers related to rates

We depend on accurate, up-to-date rates from our network of stations. Rates are somewhat of a moving target. Managing rates with our networks and clients is what makes the business.

There are a myriad of variables that impact the rates from our stations. Knowing (the best we can) what these rates are becomes critical when we create proposals from our clients. We can lose money if our information is old, stale or incorrect.

DATA

Here is a general overview of "What Makes a Rate"

STATION ID: WALK-AM RATE EFFECTIVE DATE: 2025-01-01 TYPE OF AD: TIME SLOT: M-F 6:00 AM - 10:00 AM 10:00 AM - 3:00 PM 3:00 PM - 7:00 PM WEEKDAYS SATURDAY / SUNDAY PREEMPTIVE NOT PREEMPTIVE PREEMPTIVE WITH NOTICE

DURATION: <30 SECONDS|60 SECONDS|...>

CAMPAIGN CYCLE OVERRIDE: LOWEST UNIT RATE (LUR) 90 DAYS BEFORE PRIMARY ELECTION 60 DAYS BEFORE GENERAL ELECTION

This is the datastructure that needs to be kept current.

GOALS

Standardize based on the above datastruct sketch, which evolved from real-life data, want to solidify into a proper object:

Prototype a schema - We'll use Pydantic for spec and impl) that captures most of the logical relationships we need with expansion root.
Some data processing workers - python 3.13.3, Mistral OCR and Anthopic access via API / Python libs
API - FastAPI backed by SQLite and/or DuckDB
As for validation, not sure at this point what combinations are allowed and which are not. Having a good foundation with some allowances for non-breaking modification would be 10x what we have now.

WORKFLOW

Extract raw data -> standard

The source data for the rate info is a miasma of PDFs, emails, napkins, photocopies images, photos of napkins :) Basically organic, non-standard, natural language. It's an old-school business.

We've been able to channel most of the rate information from email, scans, images. That is to say, we have them all in some type of accessible eletronic format.

So we need something that can ingest email, interpret meaning, extract information and land it in the correct rate card location. Several versions may exist, so we need to do our best to capture precendence / dates so they can self-organize into the correct STATION-EFFECTIVE DATE bucket.

Perhaps rules can be built into the object? Or represented in some way?

THE GOOD NEWS

Raw Airtable csv files containing the current dataset and relationships in ./airtable/source/
All of the attachment data is available and organized by station call letters in ./airtable/processed/
The attachment data contains most of the rate information we'd need to "seed" the system initially.
We have API keys for the AI services and Google account setup already. Look at the .env and .env.example
The platform is already running FastAPI with (I believe SQLite, although not 100% sure; probably better to keep this data separate)
Most of the rates we have are current and in the Airtable data already. The purpose of this exercise is to develop the correct heuristics or prompts to gather and prepare new rate data as it becomes available.

The laser foucus is on ingesting the data correctly. We'll improve ingestion by starting naively, processing, comparing with the ground-truth, creating and adjusting heuristics (or AI calls), re-iterate until we hit a good accuracy level.