CRAWWWL Analysis - Dependency Map and Consolidation Guide¶

Analysis of extern/crawwwl for consolidation into cbintel.

Overview¶

Crawwwl is an AI-powered web crawling and knowledge synthesis system that: 1. Expands search queries using LLM 2. Searches across 24+ search engines 3. Fetches and converts content to markdown 4. Analyzes relevance using vector similarity 5. Synthesizes knowledge using LLM 6. Recursively expands with child batches

Architecture¶

flowchart TD
    subgraph Input
        Q[Search Query]
        PT[Prompt Type]
    end

    subgraph QueryExpansion["Query Expansion (LLM)"]
        GEQ[generate_enhanced_queries]
        OLLAMA1[Ollama: mistral-small]
        PROMPTS[prompts/*.txt]
    end

    subgraph SearchPhase["Search Phase"]
        SE[search-engines.txt]
        GSR[generate_search_results]
        LYNX[Lynx]
        CURL[curl-impersonate]
        GUZL[Guzl/Playwright]
    end

    subgraph LinkProcessing["Link Processing"]
        EL[extract_links]
        LC[linkclean.sh]
        LINKS[links.txt]
    end

    subgraph ContentFetch["Content Fetching"]
        PUE[process_urls_with_engines]
        SOUP[soup.py]
        H2M[html2markdown]
        PAGES[pages/*.txt]
    end

    subgraph Analysis["Analysis & Scoring"]
        PCU[process_crawl_urls]
        FASTCOMP[fastcomp]
        STATS[stats]
        JSON[stats/*.json]
    end

    subgraph PostProcess["Post-Processing (Python)"]
        EVAL[evaluate_results.py]
        EVALS[evaluate_results_simple.py]
        EVALE[evaluate_results_enhanced.py]
        KI[knowledge_integrator.py]
        EXT[extend_results.py]
    end

    subgraph Output
        REPORT[Analysis Report]
        RESPONSE[Integrated Response]
        CHILD[Child Batches]
    end

    Q --> GEQ
    PT --> GEQ
    PROMPTS --> GEQ
    GEQ --> OLLAMA1
    OLLAMA1 --> GSR
    SE --> GSR
    GSR --> LYNX & CURL & GUZL
    LYNX & CURL & GUZL --> EL
    EL --> LC --> LINKS
    LINKS --> PUE
    PUE --> LYNX & CURL & GUZL
    LYNX & CURL & GUZL --> SOUP --> H2M --> PAGES
    PAGES --> PCU
    PCU --> FASTCOMP & STATS --> JSON
    JSON --> EVAL & EVALS & EVALE
    EVALE --> KI --> RESPONSE
    KI --> EXT --> CHILD
    CHILD -.->|recursive| Q

Directory Structure¶

extern/crawwwl/
├── crawl.sh                    # Main entry point (817 lines)
├── crawl-single.sh             # Single URL processing
├── crawl-orchestrator.sh       # Multi-batch orchestration
├── auto-recovery.sh            # Crash recovery daemon
│
├── soup.py                     # HTML cleaning (BeautifulSoup)
├── linkclean.sh                # URL filtering/cleaning
├── evaluate_results.py         # Basic evaluation
├── evaluate_results_simple.py  # Stats-only evaluation
├── evaluate_results_enhanced.py # AI-enhanced evaluation
├── knowledge_integrator.py     # Knowledge synthesis
├── extend_results.py           # Child batch generation
│
├── src/crawwwl/
│   ├── core/
│   │   ├── config.py           # CrawwwlConfig dataclass
│   │   ├── pipeline.py         # Pipeline abstraction
│   │   ├── chunker.py          # Text chunking
│   │   └── semantic_recovery.py
│   ├── discovery/
│   │   ├── autonomous.py       # Autonomous discovery
│   │   └── semantic_seed_filter.py
│   └── optimization/
│       └── parameter_optimizer.py
│
├── prompts/                    # LLM prompt templates
│   ├── investigative.txt
│   ├── question_answering.txt
│   ├── technical_research.txt
│   ├── competitive_analysis.txt
│   └── trend_analysis.txt
│
├── search-engines.txt          # 24 search engine URLs
├── .env / .env.example         # Configuration
│
├── repos/                      # External dependencies
│   ├── guzl/                   # Playwright CLI
│   ├── vectl/                  # Vector store (C++)
│   ├── html-to-markdown/       # Go HTML converter
│   ├── curl-impersonate/       # Fingerprint curl
│   ├── stats/                  # Statistics tool
│   ├── top-user-agents/        # User agent DB
│   └── ...
│
└── working/                    # Runtime data
    └── {batch_id}/
        ├── html/               # Raw HTML
        ├── pages/              # Converted markdown
        ├── links/              # Extracted URLs
        ├── stats/              # Similarity scores
        └── analysis/           # Reports

Dependency Map¶

External Binaries (repos/)¶

Binary	Source	Purpose	Required
`guzl`	`repos/guzl/dist/guzl-linux`	Playwright browser automation	Yes
`fastcomp`	`repos/vectl/build/fastcomp`	Vector similarity scoring	Yes
`stats`	`repos/stats/stats`	Statistical analysis	Yes
`html2markdown`	`repos/html-to-markdown/html2markdown`	HTML to Markdown	Yes
`curl_*`	`repos/curl-impersonate/chrome/`	Browser-fingerprint curl	Optional

System Commands¶

Command	Purpose	Required
`lynx`	Text-mode browser, link extraction	Yes
`curl`	HTTP client (fallback)	Yes
`jq`	JSON processing	Yes
`md5sum`	URL hashing	Yes
`ollama`	Local LLM inference	Yes

Python Dependencies¶

# Core
beautifulsoup4      # soup.py - HTML parsing
readability-lxml    # soup.py - content extraction
ollama              # LLM client
python-dotenv       # Configuration

# Implicit
requests            # HTTP (optional)
pathlib             # File paths (stdlib)
dataclasses         # Config (stdlib)

Data Dependencies¶

File	Purpose
`search-engines.txt`	24 search engine URL templates
`prompts/*.txt`	LLM prompt templates (5 types)
`repos/top-user-agents/src/*.json`	User agent strings
`.env`	Configuration (Ollama, thresholds)

Processing Pipeline Detail¶

Phase 1: Initialization (`initialize_globals`)¶

flowchart LR
    A[Query] --> B[validate_binaries]
    B --> C[Select curl variant]
    C --> D[Get user agent]
    D --> E[Setup directories]

Validates all required binaries exist
Selects curl-impersonate variant (chrome/safari/ff/edge)
Matches user agent to browser type
Creates batch directory structure

Phase 2: Query Enhancement¶

generate_enhanced_queries() {
    # Load prompt template
    sed "s|{{QUERY}}|${query}|g" "${ROOT}/prompts/${PROMPT_TYPE}.txt"
    # Run through Ollama
    | ollama run mistral-small3.2:24b
    # Extract JSON array
    | jq -rc '.[]'
}

Prompt Types: - investigative - Journalist perspective, 5 critical questions - question_answering - Direct Q&A format - technical_research - Technical deep-dive - competitive_analysis - Market/competitor focus - trend_analysis - Trend identification

Phase 3: Search & Link Collection¶

flowchart TD
    Q[Enhanced Query] --> SE[24 Search Engines]
    SE --> |Parallel| F1[Lynx -listonly]
    SE --> |Parallel| F2[Curl + Lynx parse]
    SE --> |Parallel| F3[Guzl + Lynx parse]
    F1 & F2 & F3 --> LC[linkclean.sh]
    LC --> |Filter| LINKS[links.txt]

Search Engines (search-engines.txt): - Google, DuckDuckGo, Bing, Yahoo, Yandex - Brave, Ecosia, Mojeek, Qwant, Startpage - Metacrawler, Dogpile, Carrot2, You.com - And 10+ more specialized engines

Link Cleaning (linkclean.sh): - Removes search engine domains - Strips tracking parameters (utm_*, fbclid, etc.) - Filters query parameters - Deduplicates URLs

Phase 4: Content Processing¶

flowchart LR
    URL --> FETCH[Fetch Engine]
    FETCH --> HTML[Raw HTML]
    HTML --> SOUP[soup.py]
    SOUP --> CLEAN[Cleaned HTML]
    CLEAN --> H2M[html2markdown]
    H2M --> MD[Markdown]
    MD --> FASTCOMP[fastcomp]
    FASTCOMP --> SCORE[Similarity JSON]

soup.py Processing: - Removes: scripts, styles, nav, header, footer, aside, img, links - Clears all HTML attributes - Removes empty elements - Uses BeautifulSoup + readability

Scoring Pipeline:

fetch_with_curl "${url}" \
    | ${SOUP} \
    | ${HTML2MARKDOWN} \
    | tee \
        >(${FASTCOMP} | ${STATS} --json > stats/${md5}.json) \
        >(${FASTCOMP} --json > stats/${md5}-fastcomp.json) \
        >(cat > pages/${md5}.txt)

Phase 5: Evaluation & Integration¶

flowchart TD
    STATS[stats/*.json] --> ES[evaluate_results_simple.py]
    ES --> |inheritance| EE[evaluate_results_enhanced.py]
    EE --> |Ollama| INSIGHTS[AI Insights]
    INSIGHTS --> KI[knowledge_integrator.py]
    KI --> |Ollama| RESPONSE[Integrated Response]
    RESPONSE --> EXT[extend_results.py]
    EXT --> |Ollama| GAPS[Gap Analysis]
    GAPS --> CHILD[Child Batch Queries]

Evaluation Chain: 1. evaluate_results_simple.py - Load stats, calculate aggregates 2. evaluate_results_enhanced.py - Add AI summaries, insights 3. knowledge_integrator.py - Synthesize coherent response 4. extend_results.py - Identify gaps, generate child queries

Configuration¶

.env Variables¶

# Ollama
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
OLLAMA_CHAT_MODEL=qwen3:8b
OLLAMA_PREMIUM_MODEL=mistral-small3.2:24b

# Vector Store
VECTOR_DIM=768
QUALITY_THRESHOLD=0.4

# Processing
MAX_PARALLEL_PROCESSES=10
PROCESSING_TIMEOUT=300
CHILD_BATCH_TIMEOUT=1800

CrawwwlConfig (Python)¶

Located in src/crawwwl/core/config.py: - Dataclass with all configuration - Loads from .env file - Path resolution - Logging setup

Consolidation Plan¶

Components to Port¶

Priority	Component	Target Location	Notes
1	`crawl.sh`	`scripts/crawl/`	Refactor paths
1	`soup.py`	`src/crawl/`	Python module
1	`linkclean.sh`	`scripts/crawl/`	Shell utility
2	`evaluate_*.py`	`src/crawl/evaluate/`	Python package
2	`knowledge_integrator.py`	`src/crawl/`	Python module
2	`extend_results.py`	`src/crawl/`	Python module
3	`src/crawwwl/`	`src/crawl/`	Core library
3	`prompts/`	`data/prompts/`	Static data
3	`search-engines.txt`	`data/`	Static data

External Repos to Reference¶

Repo	Status	Action
`vectl`	Already in extern	Link/import
`guzl`	Already in extern	Link/import
`html-to-markdown`	External Go binary	Keep as binary
`curl-impersonate`	External	Keep as binary
`stats`	Simple tool	Keep or rewrite
`top-user-agents`	Data only	Copy JSON files

Refactoring Tasks¶

Path Abstraction
Replace hardcoded /home/bisenbek/projects/crawwwl
Use relative paths from config
Configuration Consolidation
Merge with cbintel config system
Single .env location

Python Package Structure

src/crawl/
├── __init__.py
├── config.py
├── pipeline.py
├── soup.py
├── evaluate/
│   ├── __init__.py
│   ├── simple.py
│   └── enhanced.py
├── integrate.py
└── extend.py

Shell Script Cleanup
Remove debugging artifacts
Parameterize ROOT path
Add proper error handling

Key Insights¶

Strengths¶

Multi-engine search for comprehensive coverage
LLM-enhanced query expansion
Vector similarity scoring for relevance
Recursive child batch expansion
Multiple prompt types for different use cases

Weaknesses¶

Hardcoded paths throughout
Complex shell pipeline (fragile)
Mixed Python ⅔ patterns
No tests
Heavy Ollama dependency

Recommendations¶

Port Python components first (cleaner)
Refactor shell scripts to use config
Add abstraction layer for LLM (not just Ollama)
Implement proper error recovery
Add integration tests

Quick Reference¶

Run Crawl¶

cd extern/crawwwl
./crawl.sh "search query" --prompt investigative

Output Structure¶

working/{batch_id}/
├── links/links.txt          # Cleaned URLs
├── pages/*.txt              # Markdown content
├── stats/*.json             # Similarity scores
└── analysis/
    ├── evaluation_report_*.json
    ├── enhanced_evaluation_report_*.json
    └── integrated_response_*.json

Key Functions (crawl.sh)¶

Function	Purpose
`main()`	Orchestrates entire pipeline
`initialize_globals()`	Setup environment
`generate_enhanced_queries()`	LLM query expansion
`generate_search_results()`	Search engine crawling
`extract_links()`	Link extraction + cleaning
`process_urls_with_engines()`	Content fetching
`process_crawl_urls()`	Main processing loop