Skip to content

CRAWWWL Analysis - Dependency Map and Consolidation Guide

Analysis of extern/crawwwl for consolidation into cbintel.

Overview

Crawwwl is an AI-powered web crawling and knowledge synthesis system that: 1. Expands search queries using LLM 2. Searches across 24+ search engines 3. Fetches and converts content to markdown 4. Analyzes relevance using vector similarity 5. Synthesizes knowledge using LLM 6. Recursively expands with child batches

Architecture

flowchart TD
    subgraph Input
        Q[Search Query]
        PT[Prompt Type]
    end

    subgraph QueryExpansion["Query Expansion (LLM)"]
        GEQ[generate_enhanced_queries]
        OLLAMA1[Ollama: mistral-small]
        PROMPTS[prompts/*.txt]
    end

    subgraph SearchPhase["Search Phase"]
        SE[search-engines.txt]
        GSR[generate_search_results]
        LYNX[Lynx]
        CURL[curl-impersonate]
        GUZL[Guzl/Playwright]
    end

    subgraph LinkProcessing["Link Processing"]
        EL[extract_links]
        LC[linkclean.sh]
        LINKS[links.txt]
    end

    subgraph ContentFetch["Content Fetching"]
        PUE[process_urls_with_engines]
        SOUP[soup.py]
        H2M[html2markdown]
        PAGES[pages/*.txt]
    end

    subgraph Analysis["Analysis & Scoring"]
        PCU[process_crawl_urls]
        FASTCOMP[fastcomp]
        STATS[stats]
        JSON[stats/*.json]
    end

    subgraph PostProcess["Post-Processing (Python)"]
        EVAL[evaluate_results.py]
        EVALS[evaluate_results_simple.py]
        EVALE[evaluate_results_enhanced.py]
        KI[knowledge_integrator.py]
        EXT[extend_results.py]
    end

    subgraph Output
        REPORT[Analysis Report]
        RESPONSE[Integrated Response]
        CHILD[Child Batches]
    end

    Q --> GEQ
    PT --> GEQ
    PROMPTS --> GEQ
    GEQ --> OLLAMA1
    OLLAMA1 --> GSR
    SE --> GSR
    GSR --> LYNX & CURL & GUZL
    LYNX & CURL & GUZL --> EL
    EL --> LC --> LINKS
    LINKS --> PUE
    PUE --> LYNX & CURL & GUZL
    LYNX & CURL & GUZL --> SOUP --> H2M --> PAGES
    PAGES --> PCU
    PCU --> FASTCOMP & STATS --> JSON
    JSON --> EVAL & EVALS & EVALE
    EVALE --> KI --> RESPONSE
    KI --> EXT --> CHILD
    CHILD -.->|recursive| Q

Directory Structure

extern/crawwwl/
├── crawl.sh                    # Main entry point (817 lines)
├── crawl-single.sh             # Single URL processing
├── crawl-orchestrator.sh       # Multi-batch orchestration
├── auto-recovery.sh            # Crash recovery daemon
├── soup.py                     # HTML cleaning (BeautifulSoup)
├── linkclean.sh                # URL filtering/cleaning
├── evaluate_results.py         # Basic evaluation
├── evaluate_results_simple.py  # Stats-only evaluation
├── evaluate_results_enhanced.py # AI-enhanced evaluation
├── knowledge_integrator.py     # Knowledge synthesis
├── extend_results.py           # Child batch generation
├── src/crawwwl/
│   ├── core/
│   │   ├── config.py           # CrawwwlConfig dataclass
│   │   ├── pipeline.py         # Pipeline abstraction
│   │   ├── chunker.py          # Text chunking
│   │   └── semantic_recovery.py
│   ├── discovery/
│   │   ├── autonomous.py       # Autonomous discovery
│   │   └── semantic_seed_filter.py
│   └── optimization/
│       └── parameter_optimizer.py
├── prompts/                    # LLM prompt templates
│   ├── investigative.txt
│   ├── question_answering.txt
│   ├── technical_research.txt
│   ├── competitive_analysis.txt
│   └── trend_analysis.txt
├── search-engines.txt          # 24 search engine URLs
├── .env / .env.example         # Configuration
├── repos/                      # External dependencies
│   ├── guzl/                   # Playwright CLI
│   ├── vectl/                  # Vector store (C++)
│   ├── html-to-markdown/       # Go HTML converter
│   ├── curl-impersonate/       # Fingerprint curl
│   ├── stats/                  # Statistics tool
│   ├── top-user-agents/        # User agent DB
│   └── ...
└── working/                    # Runtime data
    └── {batch_id}/
        ├── html/               # Raw HTML
        ├── pages/              # Converted markdown
        ├── links/              # Extracted URLs
        ├── stats/              # Similarity scores
        └── analysis/           # Reports

Dependency Map

External Binaries (repos/)

Binary Source Purpose Required
guzl repos/guzl/dist/guzl-linux Playwright browser automation Yes
fastcomp repos/vectl/build/fastcomp Vector similarity scoring Yes
stats repos/stats/stats Statistical analysis Yes
html2markdown repos/html-to-markdown/html2markdown HTML to Markdown Yes
curl_* repos/curl-impersonate/chrome/ Browser-fingerprint curl Optional

System Commands

Command Purpose Required
lynx Text-mode browser, link extraction Yes
curl HTTP client (fallback) Yes
jq JSON processing Yes
md5sum URL hashing Yes
ollama Local LLM inference Yes

Python Dependencies

# Core
beautifulsoup4      # soup.py - HTML parsing
readability-lxml    # soup.py - content extraction
ollama              # LLM client
python-dotenv       # Configuration

# Implicit
requests            # HTTP (optional)
pathlib             # File paths (stdlib)
dataclasses         # Config (stdlib)

Data Dependencies

File Purpose
search-engines.txt 24 search engine URL templates
prompts/*.txt LLM prompt templates (5 types)
repos/top-user-agents/src/*.json User agent strings
.env Configuration (Ollama, thresholds)

Processing Pipeline Detail

Phase 1: Initialization (initialize_globals)

flowchart LR
    A[Query] --> B[validate_binaries]
    B --> C[Select curl variant]
    C --> D[Get user agent]
    D --> E[Setup directories]
  • Validates all required binaries exist
  • Selects curl-impersonate variant (chrome/safari/ff/edge)
  • Matches user agent to browser type
  • Creates batch directory structure

Phase 2: Query Enhancement

generate_enhanced_queries() {
    # Load prompt template
    sed "s|{{QUERY}}|${query}|g" "${ROOT}/prompts/${PROMPT_TYPE}.txt"
    # Run through Ollama
    | ollama run mistral-small3.2:24b
    # Extract JSON array
    | jq -rc '.[]'
}

Prompt Types: - investigative - Journalist perspective, 5 critical questions - question_answering - Direct Q&A format - technical_research - Technical deep-dive - competitive_analysis - Market/competitor focus - trend_analysis - Trend identification

flowchart TD
    Q[Enhanced Query] --> SE[24 Search Engines]
    SE --> |Parallel| F1[Lynx -listonly]
    SE --> |Parallel| F2[Curl + Lynx parse]
    SE --> |Parallel| F3[Guzl + Lynx parse]
    F1 & F2 & F3 --> LC[linkclean.sh]
    LC --> |Filter| LINKS[links.txt]

Search Engines (search-engines.txt): - Google, DuckDuckGo, Bing, Yahoo, Yandex - Brave, Ecosia, Mojeek, Qwant, Startpage - Metacrawler, Dogpile, Carrot2, You.com - And 10+ more specialized engines

Link Cleaning (linkclean.sh): - Removes search engine domains - Strips tracking parameters (utm_*, fbclid, etc.) - Filters query parameters - Deduplicates URLs

Phase 4: Content Processing

flowchart LR
    URL --> FETCH[Fetch Engine]
    FETCH --> HTML[Raw HTML]
    HTML --> SOUP[soup.py]
    SOUP --> CLEAN[Cleaned HTML]
    CLEAN --> H2M[html2markdown]
    H2M --> MD[Markdown]
    MD --> FASTCOMP[fastcomp]
    FASTCOMP --> SCORE[Similarity JSON]

soup.py Processing: - Removes: scripts, styles, nav, header, footer, aside, img, links - Clears all HTML attributes - Removes empty elements - Uses BeautifulSoup + readability

Scoring Pipeline:

fetch_with_curl "${url}" \
    | ${SOUP} \
    | ${HTML2MARKDOWN} \
    | tee \
        >(${FASTCOMP} | ${STATS} --json > stats/${md5}.json) \
        >(${FASTCOMP} --json > stats/${md5}-fastcomp.json) \
        >(cat > pages/${md5}.txt)

Phase 5: Evaluation & Integration

flowchart TD
    STATS[stats/*.json] --> ES[evaluate_results_simple.py]
    ES --> |inheritance| EE[evaluate_results_enhanced.py]
    EE --> |Ollama| INSIGHTS[AI Insights]
    INSIGHTS --> KI[knowledge_integrator.py]
    KI --> |Ollama| RESPONSE[Integrated Response]
    RESPONSE --> EXT[extend_results.py]
    EXT --> |Ollama| GAPS[Gap Analysis]
    GAPS --> CHILD[Child Batch Queries]

Evaluation Chain: 1. evaluate_results_simple.py - Load stats, calculate aggregates 2. evaluate_results_enhanced.py - Add AI summaries, insights 3. knowledge_integrator.py - Synthesize coherent response 4. extend_results.py - Identify gaps, generate child queries

Configuration

.env Variables

# Ollama
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
OLLAMA_CHAT_MODEL=qwen3:8b
OLLAMA_PREMIUM_MODEL=mistral-small3.2:24b

# Vector Store
VECTOR_DIM=768
QUALITY_THRESHOLD=0.4

# Processing
MAX_PARALLEL_PROCESSES=10
PROCESSING_TIMEOUT=300
CHILD_BATCH_TIMEOUT=1800

CrawwwlConfig (Python)

Located in src/crawwwl/core/config.py: - Dataclass with all configuration - Loads from .env file - Path resolution - Logging setup

Consolidation Plan

Components to Port

Priority Component Target Location Notes
1 crawl.sh scripts/crawl/ Refactor paths
1 soup.py src/crawl/ Python module
1 linkclean.sh scripts/crawl/ Shell utility
2 evaluate_*.py src/crawl/evaluate/ Python package
2 knowledge_integrator.py src/crawl/ Python module
2 extend_results.py src/crawl/ Python module
3 src/crawwwl/ src/crawl/ Core library
3 prompts/ data/prompts/ Static data
3 search-engines.txt data/ Static data

External Repos to Reference

Repo Status Action
vectl Already in extern Link/import
guzl Already in extern Link/import
html-to-markdown External Go binary Keep as binary
curl-impersonate External Keep as binary
stats Simple tool Keep or rewrite
top-user-agents Data only Copy JSON files

Refactoring Tasks

  1. Path Abstraction
  2. Replace hardcoded /home/bisenbek/projects/crawwwl
  3. Use relative paths from config

  4. Configuration Consolidation

  5. Merge with cbintel config system
  6. Single .env location

  7. Python Package Structure

    src/crawl/
    ├── __init__.py
    ├── config.py
    ├── pipeline.py
    ├── soup.py
    ├── evaluate/
    │   ├── __init__.py
    │   ├── simple.py
    │   └── enhanced.py
    ├── integrate.py
    └── extend.py
    

  8. Shell Script Cleanup

  9. Remove debugging artifacts
  10. Parameterize ROOT path
  11. Add proper error handling

Key Insights

Strengths

  • Multi-engine search for comprehensive coverage
  • LLM-enhanced query expansion
  • Vector similarity scoring for relevance
  • Recursive child batch expansion
  • Multiple prompt types for different use cases

Weaknesses

  • Hardcoded paths throughout
  • Complex shell pipeline (fragile)
  • Mixed Python ⅔ patterns
  • No tests
  • Heavy Ollama dependency

Recommendations

  1. Port Python components first (cleaner)
  2. Refactor shell scripts to use config
  3. Add abstraction layer for LLM (not just Ollama)
  4. Implement proper error recovery
  5. Add integration tests

Quick Reference

Run Crawl

cd extern/crawwwl
./crawl.sh "search query" --prompt investigative

Output Structure

working/{batch_id}/
├── links/links.txt          # Cleaned URLs
├── pages/*.txt              # Markdown content
├── stats/*.json             # Similarity scores
└── analysis/
    ├── evaluation_report_*.json
    ├── enhanced_evaluation_report_*.json
    └── integrated_response_*.json

Key Functions (crawl.sh)

Function Purpose
main() Orchestrates entire pipeline
initialize_globals() Setup environment
generate_enhanced_queries() LLM query expansion
generate_search_results() Search engine crawling
extract_links() Link extraction + cleaning
process_urls_with_engines() Content fetching
process_crawl_urls() Main processing loop