Lazarus - Web Archive Processing Pipeline¶

Source: extern/lazarus/README.md Last updated: 2024-11

A high-performance data processing pipeline for discovering, retrieving, indexing, and analyzing web content from archives and live sources.

Overview¶

Lazarus discovers URLs from web archives, retrieves historical content, processes text with AI models, and stores embeddings in a vector database for semantic search. It's designed for large-scale content analysis, entity extraction, and building knowledge bases from web data.

Key Features: - Smart URL sampling with deduplication to avoid processing duplicate pages - Content preservation with automatic HTML and metadata saving - Configurable sampling strategies for diverse content discovery - High-performance vector search for semantic similarity

Components¶

gau: URL discovery from web archives (Wayback Machine, Common Crawl, etc.)
cdx_toolkit: Content retrieval from Internet Archive and Common Crawl
Ollama: Local LLM for embeddings, entity extraction, and text processing
vectl: High-performance C++ vector storage with clustering support

Setup¶

# Activate the Python environment
source ~/.pyenv/versions/nominates/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Ensure Ollama is running and has required models
ollama pull nomic-embed-text  # For embeddings
ollama pull llama3.2          # For text processing

Quick Start Workflow¶

# 1. Process a domain to build your knowledge base
lazarus process example.com --save-content --content-limit 1000

# 2. View progress during or after processing
lazarus peek example.com

# 3. Search the indexed content semantically
lazarus search "your topic of interest"

# 4. Locate specific entities with vector IDs
lazarus locate organizations "Company Name"

# 5. Archive your work when done
lazarus archive --name "project_analysis"

Main Commands¶

Command	Description
`lazarus process <domain>`	Process domain through complete pipeline
`lazarus search <query>`	Semantic search through vector store
`lazarus locate <type> <name>`	Find entity locations with vector IDs
`lazarus peek <domain>`	View progress reports in browser
`lazarus archive`	Save work to timestamped folders
`lazarus clear`	Delete all generated data
`lazarus stats`	Show vector store statistics
`lazarus batch <domains...>`	Process multiple domains

Sampling Strategies¶

Strategy	Description	Use Case
`diverse`	Deduplicates URLs by path first, then samples randomly	Best for getting unique content
`random`	Pure random sampling from all discovered URLs	Good for statistical sampling
`stratified`	Samples proportionally from different URL depth levels	Ideal for hierarchical sites

Pipeline Architecture¶

URL Discovery - Uses gau to find URLs from web archives
Content Retrieval - Downloads content via cdx_toolkit
Text Processing - Chunks text and generates embeddings with Ollama
Vector Storage - Indexes embeddings in vectl for similarity search

Performance¶

Processes ~100-200 pages per hour (depending on Ollama model)
Vector search on 1M+ documents in <100ms
Smart sampling reduces processing time by 60-80% for large sites