Lazarus - Web Archive Processing Pipeline¶
Source:
extern/lazarus/README.mdLast updated: 2024-11
A high-performance data processing pipeline for discovering, retrieving, indexing, and analyzing web content from archives and live sources.
Overview¶
Lazarus discovers URLs from web archives, retrieves historical content, processes text with AI models, and stores embeddings in a vector database for semantic search. It's designed for large-scale content analysis, entity extraction, and building knowledge bases from web data.
Key Features: - Smart URL sampling with deduplication to avoid processing duplicate pages - Content preservation with automatic HTML and metadata saving - Configurable sampling strategies for diverse content discovery - High-performance vector search for semantic similarity
Components¶
- gau: URL discovery from web archives (Wayback Machine, Common Crawl, etc.)
- cdx_toolkit: Content retrieval from Internet Archive and Common Crawl
- Ollama: Local LLM for embeddings, entity extraction, and text processing
- vectl: High-performance C++ vector storage with clustering support
Setup¶
# Activate the Python environment
source ~/.pyenv/versions/nominates/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
# Ensure Ollama is running and has required models
ollama pull nomic-embed-text # For embeddings
ollama pull llama3.2 # For text processing
Quick Start Workflow¶
# 1. Process a domain to build your knowledge base
lazarus process example.com --save-content --content-limit 1000
# 2. View progress during or after processing
lazarus peek example.com
# 3. Search the indexed content semantically
lazarus search "your topic of interest"
# 4. Locate specific entities with vector IDs
lazarus locate organizations "Company Name"
# 5. Archive your work when done
lazarus archive --name "project_analysis"
Main Commands¶
| Command | Description |
|---|---|
lazarus process <domain> |
Process domain through complete pipeline |
lazarus search <query> |
Semantic search through vector store |
lazarus locate <type> <name> |
Find entity locations with vector IDs |
lazarus peek <domain> |
View progress reports in browser |
lazarus archive |
Save work to timestamped folders |
lazarus clear |
Delete all generated data |
lazarus stats |
Show vector store statistics |
lazarus batch <domains...> |
Process multiple domains |
Sampling Strategies¶
| Strategy | Description | Use Case |
|---|---|---|
diverse |
Deduplicates URLs by path first, then samples randomly | Best for getting unique content |
random |
Pure random sampling from all discovered URLs | Good for statistical sampling |
stratified |
Samples proportionally from different URL depth levels | Ideal for hierarchical sites |
Pipeline Architecture¶
- URL Discovery - Uses gau to find URLs from web archives
- Content Retrieval - Downloads content via cdx_toolkit
- Text Processing - Chunks text and generates embeddings with Ollama
- Vector Storage - Indexes embeddings in vectl for similarity search
Performance¶
- Processes ~100-200 pages per hour (depending on Ollama model)
- Vector search on 1M+ documents in <100ms
- Smart sampling reduces processing time by 60-80% for large sites