cbintel ArchitectureΒΆ
System architecture and component relationships for the cbintel intelligence toolkit.
OverviewΒΆ
cbintel is a modular intelligence gathering and knowledge synthesis platform that combines: - Web Crawling - AI-powered iterative web discovery - Historical Archives - Internet Archive and Common Crawl retrieval - Vector Search - Semantic similarity search with embeddings - Browser Automation - Screenshots, PDFs, and DOM extraction - VPN Cluster - Geographic proxy routing via OpenWRT workers - Jobs API - Async job submission with progress tracking and result storage
System ArchitectureΒΆ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β cbintel CLI Tools β
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββ€
β cbintel- β cbintel- β cbintel- β cbintel- β cbintel- β
β crawl β lazarus β vectl β screenshots β cluster β
ββββββββ¬βββββββ΄βββββββ¬βββββββ΄βββββββ¬βββββββ΄βββββββ¬βββββββ΄βββββββββββ¬βββββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β cbintel Sub-Services β
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββ€
β crawl β lazarus β vectl β screenshots β cluster β
β β β β β β
β - Pipeline β - CDX API β - Embeddingsβ - Capture β - Device Registry β
β - Batches β - Discovery β - Storage β - PDF Gen β - VPN Banks β
β - Evaluate β - Archives β - Search β - DOM β - Workers β
β - Synthesizeβ - Temporal β - Clusteringβ - Links β - HAProxy β
ββββββββ¬βββββββ΄βββββββ¬βββββββ΄βββββββ¬βββββββ΄βββββββ¬βββββββ΄βββββββββββ¬βββββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Core Libraries β
βββββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ€
β cbintel.ai β cbintel.net β cbintel.io β
β β β β
β - Anthropic API β - HTTP Client β - HTML Processing β
β - Ollama Client β - URL Cleaning β - Markdown Conversion β
β - CBAI Unified β - Web Search β - File Storage β
β - Embeddings β - Proxy Support β - Session Management β
βββββββββββ¬ββββββββββ΄ββββββββββ¬ββββββββββ΄ββββββββββββββββ¬ββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β External Dependencies β
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββ€
β Anthropic β Ollama β Playwright β cdx_toolkit β OpenWRT/LuCI β
β Claude API β Local LLM β Browsers β Web Archive β RPC Interface β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββ
Component DetailsΒΆ
1. cbintel.crawl - AI-Powered Web CrawlingΒΆ
Iterative web discovery with AI-driven evaluation and synthesis.
User Query
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Discover ββββββΆβ Retrieve ββββββΆβ Process β
β (Search) β β (Fetch) β β (Parse/Clean)β
βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Evaluate ββββββΆβ Decide ββββββΆβ Synthesize β
β (AI Score) β β (Continue?) β β (Report) β
βββββββββββββββ ββββββββ¬βββββββ βββββββββββββββ
β
β More batches needed
βΌ
βββββββββββββββ
β Child Batch β
β (New URLs) β
βββββββββββββββ
Key Features: - Multi-model AI support (Anthropic Claude, Ollama local models) - Iterative batch processing with child URL discovery - Quality-based content evaluation - Automatic synthesis and report generation
2. cbintel.lazarus - Historical Web ArchivesΒΆ
Retrieve and analyze historical web content from Internet Archive and Common Crawl.
Domain/URL
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Discovery ββββββΆβ CDX API ββββββΆβ Retrieve β
β (gau) β β Query β β Content β
βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ
β Temporal ββββββΆβ Timeline β
β Analysis β β Report β
βββββββββββββββ βββββββββββββββ
Components: - CDXClient - Query Internet Archive/Common Crawl CDX APIs - URLDiscovery - Discover URLs via gau (wayback, commoncrawl, etc.) - ArchiveClient - High-level orchestration of discovery + retrieval - TemporalAnalyzer - Time-series content change analysis
3. cbintel.vectl - Vector Embeddings & SearchΒΆ
Semantic similarity search using text embeddings and vector storage.
Documents
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Chunk ββββββΆβ Embed ββββββΆβ Store β
β Text β β (Ollama) β β (vectl) β
βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ
β
Query β
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Embed ββββββΆβ Search ββββββΆβ Results β
β Query β β (K-means) β β (Ranked) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Components: - EmbeddingService - Generate 768D vectors via Ollama (nomic-embed-text) - VectorStore - K-means clustered storage (vectl C++ or NumPy fallback) - SemanticSearch - Text-to-text similarity search - ChunkingService - Split documents into overlapping chunks
4. cbintel.screenshots - Browser AutomationΒΆ
Screenshot capture, PDF generation, and DOM extraction using Playwright.
URL
β
βΌ
βββββββββββββββ
β Playwright β
β Browser β
ββββββββ¬βββββββ
β
ββββββββββββββββββ¬βββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Screenshot β β PDF β β DOM β
β Capture β β Generation β β Extraction β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Components: - ScreenshotService - Full-page and element screenshots - PDFService - PDF generation with configurable format/margins - DOMService - Element extraction with bounding boxes
5. cbintel.cluster - VPN Cluster ManagementΒΆ
Geographic VPN routing via 16 OpenWRT workers with HAProxy load balancing.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Host Server β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastAPI Cluster API (port 9002) β β
β β /api/v1/banks /api/v1/workers /api/v1/devices β β
β ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Master Router (17.0.0.1) β
β βββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β HAProxy β β LuCI RPC β β
β β 8890-8999 β β Device Control Interface β β
β ββββββββ¬βββββββ βββββββββββββββββββββββββββββββββββββββ β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β Load Balance
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenWRT Workers (16x) β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β βWorker 1 β βWorker 2 β βWorker 3 β ... βWorker 16β β
β β17.0.0.10β β17.0.0.11β β17.0.0.12β β17.0.0.25β β
β β β β β β β β β β
β β OpenVPN β β OpenVPN β β OpenVPN β β OpenVPN β β
β βTinyProxyβ βTinyProxyβ βTinyProxyβ βTinyProxyβ β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
βββββββββΌβββββββββββΌβββββββββββΌβββββββββββββββββββββΌβββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ProtonVPN Exit Nodes β
β (~12,900 profiles across 127 countries) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Components: - DeviceRegistry - Comprehensive device tracking with WireGuard info - DeviceService - Ping, speedtest, execute, reboot operations - BankService - Geographic VPN pool management - WorkerService - VPN and proxy control per worker - StateManager - Persistent JSON-based state
6. cbintel.jobs - Async Job Processing APIΒΆ
Unified job submission and processing for all cbintel modules.
Client Request
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Jobs API (port 9003) β
β POST /api/v1/jobs/{crawl,lazarus,vectl,screenshots} β
β GET /api/v1/jobs/{job_id} (poll status) β
β DELETE /api/v1/jobs/{job_id} (cancel) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Job Queue (Redis/In-Memory) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Worker 1 β β Worker 2 β β Worker 3 β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
βββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β CrawlWorker β βLazarusWorkerβ β VectlWorker β β Screenshot β
β β β β β β β Worker β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β files.nominate.ai (Result Storage) β
β cbintel-jobs bucket for job outputs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Components: - JobQueue - Redis-backed task queue with in-memory fallback - BaseWorker - Abstract worker with progress callbacks - CrawlWorker - Wraps CrawlPipeline for research queries - LazarusWorker - Wraps ArchiveClient for historical retrieval - VectlWorker - Wraps EmbeddingService for vector operations - ScreenshotWorker - Wraps ScreenshotService for captures - FilesClient - Uploads results to files.nominate.ai
Job Lifecycle:
Data FlowΒΆ
Typical Crawl PipelineΒΆ
1. User submits query via cbintel-crawl CLI
β
βΌ
2. Search engine discovers initial URLs
β
βΌ
3. URLs fetched, HTML parsed to markdown
β
βΌ
4. AI evaluates content relevance (0-10 score)
β
βΌ
5. High-scoring pages analyzed for child URLs
β
βΌ
6. Child batch created with discovered URLs
β
βΌ
7. Repeat steps 3-6 until depth limit or satisfaction
β
βΌ
8. Final synthesis generates report
Archive Research PipelineΒΆ
1. User queries domain via cbintel-lazarus
β
βΌ
2. gau discovers historical URLs from wayback
β
βΌ
3. CDX API queried for snapshots of each URL
β
βΌ
4. Content retrieved from Internet Archive
β
βΌ
5. Temporal analysis detects content changes
β
βΌ
6. Timeline report generated
Semantic Search PipelineΒΆ
1. Documents indexed via cbintel-vectl index
β
βΌ
2. Text chunked into 512-word segments
β
βΌ
3. Ollama generates embeddings (768D vectors)
β
βΌ
4. Vectors stored with K-means clustering
β
βΌ
5. User queries via cbintel-vectl search
β
βΌ
6. Query embedded and matched to clusters
β
βΌ
7. Cosine similarity ranks results
ConfigurationΒΆ
Environment VariablesΒΆ
# AI/LLM
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_EMBED_MODEL=nomic-embed-text
# Cluster API
OPENWRT_USERNAME=root
OPENWRT_PASSWORD=<password>
MASTER_IP=17.0.0.1
CLUSTER_API_PORT=9002
# Storage
BANK_STATE_FILE=/var/lib/vpn-banks/bank-state.json
DEVICE_REGISTRY_FILE=/var/lib/vpn-banks/device-registry.json
# VPN Profiles
PROFILES_BASE=/path/to/profiles/intl-ovpn
DeploymentΒΆ
DevelopmentΒΆ
# Clone and install
git clone <repo>
cd cbintel
pip install -e ".[dev]"
# Install Playwright browsers
playwright install
# Run tests
pytest
ProductionΒΆ
# Install package
pip install .
# Start services manually
cbintel-cluster # VPN Cluster API on port 9002
cbintel-jobs # Jobs API on port 9003
# Or via systemd (recommended)
sudo systemctl enable cbcluster cbjobs
sudo systemctl start cbcluster cbjobs
Systemd ServicesΒΆ
| Service | Description | Port | URL |
|---|---|---|---|
cbcluster.service |
VPN Cluster Management API | 32203 | https://intel.nominate.ai |
cbjobs.service |
Async Job Processing API | 9003 | https://jobs.nominate.ai |
Service files location: /etc/systemd/system/
Nginx configs location: /etc/nginx/sites-nominate/
# Check service status
sudo systemctl status cbcluster cbjobs
# View logs
sudo journalctl -u cbjobs -f
sudo journalctl -u cbcluster -f
# Restart after code changes
sudo systemctl restart cbjobs
File OrganizationΒΆ
cbintel/
βββ src/cbintel/
β βββ ai/ # AI client wrappers
β βββ net/ # Network operations
β βββ io/ # File/process I/O
β βββ crawl/ # Crawl pipeline
β βββ lazarus/ # Historical archives
β βββ vectl/ # Vector search
β βββ screenshots/ # Browser automation
β βββ cluster/ # VPN cluster API
β βββ jobs/ # Async job queue with workers
βββ docs/ # Documentation
βββ extern/ # External project symlinks
βββ tests/ # Test suite
Security ConsiderationsΒΆ
- API Keys: Store in environment variables, never in code
- VPN Profiles: Contain credentials, restrict file permissions
- Cluster API: Currently no authentication (TODO for production)
- Command Execution: Device execute endpoint runs as root
- Proxy Traffic: All cluster traffic routes through VPN tunnels