CRAWWWL Analysis - Dependency Map and Consolidation Guide¶
Analysis of extern/crawwwl for consolidation into cbintel.
Overview¶
Crawwwl is an AI-powered web crawling and knowledge synthesis system that: 1. Expands search queries using LLM 2. Searches across 24+ search engines 3. Fetches and converts content to markdown 4. Analyzes relevance using vector similarity 5. Synthesizes knowledge using LLM 6. Recursively expands with child batches
Architecture¶
flowchart TD
subgraph Input
Q[Search Query]
PT[Prompt Type]
end
subgraph QueryExpansion["Query Expansion (LLM)"]
GEQ[generate_enhanced_queries]
OLLAMA1[Ollama: mistral-small]
PROMPTS[prompts/*.txt]
end
subgraph SearchPhase["Search Phase"]
SE[search-engines.txt]
GSR[generate_search_results]
LYNX[Lynx]
CURL[curl-impersonate]
GUZL[Guzl/Playwright]
end
subgraph LinkProcessing["Link Processing"]
EL[extract_links]
LC[linkclean.sh]
LINKS[links.txt]
end
subgraph ContentFetch["Content Fetching"]
PUE[process_urls_with_engines]
SOUP[soup.py]
H2M[html2markdown]
PAGES[pages/*.txt]
end
subgraph Analysis["Analysis & Scoring"]
PCU[process_crawl_urls]
FASTCOMP[fastcomp]
STATS[stats]
JSON[stats/*.json]
end
subgraph PostProcess["Post-Processing (Python)"]
EVAL[evaluate_results.py]
EVALS[evaluate_results_simple.py]
EVALE[evaluate_results_enhanced.py]
KI[knowledge_integrator.py]
EXT[extend_results.py]
end
subgraph Output
REPORT[Analysis Report]
RESPONSE[Integrated Response]
CHILD[Child Batches]
end
Q --> GEQ
PT --> GEQ
PROMPTS --> GEQ
GEQ --> OLLAMA1
OLLAMA1 --> GSR
SE --> GSR
GSR --> LYNX & CURL & GUZL
LYNX & CURL & GUZL --> EL
EL --> LC --> LINKS
LINKS --> PUE
PUE --> LYNX & CURL & GUZL
LYNX & CURL & GUZL --> SOUP --> H2M --> PAGES
PAGES --> PCU
PCU --> FASTCOMP & STATS --> JSON
JSON --> EVAL & EVALS & EVALE
EVALE --> KI --> RESPONSE
KI --> EXT --> CHILD
CHILD -.->|recursive| Q
Directory Structure¶
extern/crawwwl/
├── crawl.sh # Main entry point (817 lines)
├── crawl-single.sh # Single URL processing
├── crawl-orchestrator.sh # Multi-batch orchestration
├── auto-recovery.sh # Crash recovery daemon
│
├── soup.py # HTML cleaning (BeautifulSoup)
├── linkclean.sh # URL filtering/cleaning
├── evaluate_results.py # Basic evaluation
├── evaluate_results_simple.py # Stats-only evaluation
├── evaluate_results_enhanced.py # AI-enhanced evaluation
├── knowledge_integrator.py # Knowledge synthesis
├── extend_results.py # Child batch generation
│
├── src/crawwwl/
│ ├── core/
│ │ ├── config.py # CrawwwlConfig dataclass
│ │ ├── pipeline.py # Pipeline abstraction
│ │ ├── chunker.py # Text chunking
│ │ └── semantic_recovery.py
│ ├── discovery/
│ │ ├── autonomous.py # Autonomous discovery
│ │ └── semantic_seed_filter.py
│ └── optimization/
│ └── parameter_optimizer.py
│
├── prompts/ # LLM prompt templates
│ ├── investigative.txt
│ ├── question_answering.txt
│ ├── technical_research.txt
│ ├── competitive_analysis.txt
│ └── trend_analysis.txt
│
├── search-engines.txt # 24 search engine URLs
├── .env / .env.example # Configuration
│
├── repos/ # External dependencies
│ ├── guzl/ # Playwright CLI
│ ├── vectl/ # Vector store (C++)
│ ├── html-to-markdown/ # Go HTML converter
│ ├── curl-impersonate/ # Fingerprint curl
│ ├── stats/ # Statistics tool
│ ├── top-user-agents/ # User agent DB
│ └── ...
│
└── working/ # Runtime data
└── {batch_id}/
├── html/ # Raw HTML
├── pages/ # Converted markdown
├── links/ # Extracted URLs
├── stats/ # Similarity scores
└── analysis/ # Reports
Dependency Map¶
External Binaries (repos/)¶
| Binary | Source | Purpose | Required |
|---|---|---|---|
guzl |
repos/guzl/dist/guzl-linux |
Playwright browser automation | Yes |
fastcomp |
repos/vectl/build/fastcomp |
Vector similarity scoring | Yes |
stats |
repos/stats/stats |
Statistical analysis | Yes |
html2markdown |
repos/html-to-markdown/html2markdown |
HTML to Markdown | Yes |
curl_* |
repos/curl-impersonate/chrome/ |
Browser-fingerprint curl | Optional |
System Commands¶
| Command | Purpose | Required |
|---|---|---|
lynx |
Text-mode browser, link extraction | Yes |
curl |
HTTP client (fallback) | Yes |
jq |
JSON processing | Yes |
md5sum |
URL hashing | Yes |
ollama |
Local LLM inference | Yes |
Python Dependencies¶
# Core
beautifulsoup4 # soup.py - HTML parsing
readability-lxml # soup.py - content extraction
ollama # LLM client
python-dotenv # Configuration
# Implicit
requests # HTTP (optional)
pathlib # File paths (stdlib)
dataclasses # Config (stdlib)
Data Dependencies¶
| File | Purpose |
|---|---|
search-engines.txt |
24 search engine URL templates |
prompts/*.txt |
LLM prompt templates (5 types) |
repos/top-user-agents/src/*.json |
User agent strings |
.env |
Configuration (Ollama, thresholds) |
Processing Pipeline Detail¶
Phase 1: Initialization (initialize_globals)¶
flowchart LR
A[Query] --> B[validate_binaries]
B --> C[Select curl variant]
C --> D[Get user agent]
D --> E[Setup directories]
- Validates all required binaries exist
- Selects curl-impersonate variant (chrome/safari/ff/edge)
- Matches user agent to browser type
- Creates batch directory structure
Phase 2: Query Enhancement¶
generate_enhanced_queries() {
# Load prompt template
sed "s|{{QUERY}}|${query}|g" "${ROOT}/prompts/${PROMPT_TYPE}.txt"
# Run through Ollama
| ollama run mistral-small3.2:24b
# Extract JSON array
| jq -rc '.[]'
}
Prompt Types:
- investigative - Journalist perspective, 5 critical questions
- question_answering - Direct Q&A format
- technical_research - Technical deep-dive
- competitive_analysis - Market/competitor focus
- trend_analysis - Trend identification
Phase 3: Search & Link Collection¶
flowchart TD
Q[Enhanced Query] --> SE[24 Search Engines]
SE --> |Parallel| F1[Lynx -listonly]
SE --> |Parallel| F2[Curl + Lynx parse]
SE --> |Parallel| F3[Guzl + Lynx parse]
F1 & F2 & F3 --> LC[linkclean.sh]
LC --> |Filter| LINKS[links.txt]
Search Engines (search-engines.txt): - Google, DuckDuckGo, Bing, Yahoo, Yandex - Brave, Ecosia, Mojeek, Qwant, Startpage - Metacrawler, Dogpile, Carrot2, You.com - And 10+ more specialized engines
Link Cleaning (linkclean.sh): - Removes search engine domains - Strips tracking parameters (utm_*, fbclid, etc.) - Filters query parameters - Deduplicates URLs
Phase 4: Content Processing¶
flowchart LR
URL --> FETCH[Fetch Engine]
FETCH --> HTML[Raw HTML]
HTML --> SOUP[soup.py]
SOUP --> CLEAN[Cleaned HTML]
CLEAN --> H2M[html2markdown]
H2M --> MD[Markdown]
MD --> FASTCOMP[fastcomp]
FASTCOMP --> SCORE[Similarity JSON]
soup.py Processing: - Removes: scripts, styles, nav, header, footer, aside, img, links - Clears all HTML attributes - Removes empty elements - Uses BeautifulSoup + readability
Scoring Pipeline:
fetch_with_curl "${url}" \
| ${SOUP} \
| ${HTML2MARKDOWN} \
| tee \
>(${FASTCOMP} | ${STATS} --json > stats/${md5}.json) \
>(${FASTCOMP} --json > stats/${md5}-fastcomp.json) \
>(cat > pages/${md5}.txt)
Phase 5: Evaluation & Integration¶
flowchart TD
STATS[stats/*.json] --> ES[evaluate_results_simple.py]
ES --> |inheritance| EE[evaluate_results_enhanced.py]
EE --> |Ollama| INSIGHTS[AI Insights]
INSIGHTS --> KI[knowledge_integrator.py]
KI --> |Ollama| RESPONSE[Integrated Response]
RESPONSE --> EXT[extend_results.py]
EXT --> |Ollama| GAPS[Gap Analysis]
GAPS --> CHILD[Child Batch Queries]
Evaluation Chain:
1. evaluate_results_simple.py - Load stats, calculate aggregates
2. evaluate_results_enhanced.py - Add AI summaries, insights
3. knowledge_integrator.py - Synthesize coherent response
4. extend_results.py - Identify gaps, generate child queries
Configuration¶
.env Variables¶
# Ollama
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
OLLAMA_CHAT_MODEL=qwen3:8b
OLLAMA_PREMIUM_MODEL=mistral-small3.2:24b
# Vector Store
VECTOR_DIM=768
QUALITY_THRESHOLD=0.4
# Processing
MAX_PARALLEL_PROCESSES=10
PROCESSING_TIMEOUT=300
CHILD_BATCH_TIMEOUT=1800
CrawwwlConfig (Python)¶
Located in src/crawwwl/core/config.py:
- Dataclass with all configuration
- Loads from .env file
- Path resolution
- Logging setup
Consolidation Plan¶
Components to Port¶
| Priority | Component | Target Location | Notes |
|---|---|---|---|
| 1 | crawl.sh |
scripts/crawl/ |
Refactor paths |
| 1 | soup.py |
src/crawl/ |
Python module |
| 1 | linkclean.sh |
scripts/crawl/ |
Shell utility |
| 2 | evaluate_*.py |
src/crawl/evaluate/ |
Python package |
| 2 | knowledge_integrator.py |
src/crawl/ |
Python module |
| 2 | extend_results.py |
src/crawl/ |
Python module |
| 3 | src/crawwwl/ |
src/crawl/ |
Core library |
| 3 | prompts/ |
data/prompts/ |
Static data |
| 3 | search-engines.txt |
data/ |
Static data |
External Repos to Reference¶
| Repo | Status | Action |
|---|---|---|
vectl |
Already in extern | Link/import |
guzl |
Already in extern | Link/import |
html-to-markdown |
External Go binary | Keep as binary |
curl-impersonate |
External | Keep as binary |
stats |
Simple tool | Keep or rewrite |
top-user-agents |
Data only | Copy JSON files |
Refactoring Tasks¶
- Path Abstraction
- Replace hardcoded
/home/bisenbek/projects/crawwwl -
Use relative paths from config
-
Configuration Consolidation
- Merge with cbintel config system
-
Single .env location
-
Python Package Structure
-
Shell Script Cleanup
- Remove debugging artifacts
- Parameterize ROOT path
- Add proper error handling
Key Insights¶
Strengths¶
- Multi-engine search for comprehensive coverage
- LLM-enhanced query expansion
- Vector similarity scoring for relevance
- Recursive child batch expansion
- Multiple prompt types for different use cases
Weaknesses¶
- Hardcoded paths throughout
- Complex shell pipeline (fragile)
- Mixed Python ⅔ patterns
- No tests
- Heavy Ollama dependency
Recommendations¶
- Port Python components first (cleaner)
- Refactor shell scripts to use config
- Add abstraction layer for LLM (not just Ollama)
- Implement proper error recovery
- Add integration tests
Quick Reference¶
Run Crawl¶
Output Structure¶
working/{batch_id}/
├── links/links.txt # Cleaned URLs
├── pages/*.txt # Markdown content
├── stats/*.json # Similarity scores
└── analysis/
├── evaluation_report_*.json
├── enhanced_evaluation_report_*.json
└── integrated_response_*.json
Key Functions (crawl.sh)¶
| Function | Purpose |
|---|---|
main() |
Orchestrates entire pipeline |
initialize_globals() |
Setup environment |
generate_enhanced_queries() |
LLM query expansion |
generate_search_results() |
Search engine crawling |
extract_links() |
Link extraction + cleaning |
process_urls_with_engines() |
Content fetching |
process_crawl_urls() |
Main processing loop |