Skip to content

Lazarus - Web Archive Processing Pipeline

Source: extern/lazarus/README.md Last updated: 2024-11

A high-performance data processing pipeline for discovering, retrieving, indexing, and analyzing web content from archives and live sources.

Overview

Lazarus discovers URLs from web archives, retrieves historical content, processes text with AI models, and stores embeddings in a vector database for semantic search. It's designed for large-scale content analysis, entity extraction, and building knowledge bases from web data.

Key Features: - Smart URL sampling with deduplication to avoid processing duplicate pages - Content preservation with automatic HTML and metadata saving - Configurable sampling strategies for diverse content discovery - High-performance vector search for semantic similarity

Components

  • gau: URL discovery from web archives (Wayback Machine, Common Crawl, etc.)
  • cdx_toolkit: Content retrieval from Internet Archive and Common Crawl
  • Ollama: Local LLM for embeddings, entity extraction, and text processing
  • vectl: High-performance C++ vector storage with clustering support

Setup

# Activate the Python environment
source ~/.pyenv/versions/nominates/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Ensure Ollama is running and has required models
ollama pull nomic-embed-text  # For embeddings
ollama pull llama3.2          # For text processing

Quick Start Workflow

# 1. Process a domain to build your knowledge base
lazarus process example.com --save-content --content-limit 1000

# 2. View progress during or after processing
lazarus peek example.com

# 3. Search the indexed content semantically
lazarus search "your topic of interest"

# 4. Locate specific entities with vector IDs
lazarus locate organizations "Company Name"

# 5. Archive your work when done
lazarus archive --name "project_analysis"

Main Commands

Command Description
lazarus process <domain> Process domain through complete pipeline
lazarus search <query> Semantic search through vector store
lazarus locate <type> <name> Find entity locations with vector IDs
lazarus peek <domain> View progress reports in browser
lazarus archive Save work to timestamped folders
lazarus clear Delete all generated data
lazarus stats Show vector store statistics
lazarus batch <domains...> Process multiple domains

Sampling Strategies

Strategy Description Use Case
diverse Deduplicates URLs by path first, then samples randomly Best for getting unique content
random Pure random sampling from all discovered URLs Good for statistical sampling
stratified Samples proportionally from different URL depth levels Ideal for hierarchical sites

Pipeline Architecture

  1. URL Discovery - Uses gau to find URLs from web archives
  2. Content Retrieval - Downloads content via cdx_toolkit
  3. Text Processing - Chunks text and generates embeddings with Ollama
  4. Vector Storage - Indexes embeddings in vectl for similarity search

Performance

  • Processes ~100-200 pages per hour (depending on Ollama model)
  • Vector search on 1M+ documents in <100ms
  • Smart sampling reduces processing time by 60-80% for large sites