Imported from bare git on Zurich

Go to file

Johan Jongsma fb3d5a46b5 Replace OCR with AI vision, SQLite for storage - Remove Tesseract/OCR dependencies - Use Claude vision API for document analysis - Single AI pass: extract text + classify + summarize - SQLite database for documents and embeddings - Embeddings storage ready (generation placeholder) - Full-text search via SQLite - Updated systemd service to use venv - Support .env file for API key		2026-02-01 17:24:05 +00:00
.gitignore	Replace OCR with AI vision, SQLite for storage	2026-02-01 17:24:05 +00:00
README.md	Replace OCR with AI vision, SQLite for storage	2026-02-01 17:24:05 +00:00
processor.py	Replace OCR with AI vision, SQLite for storage	2026-02-01 17:24:05 +00:00
search.py	Replace OCR with AI vision, SQLite for storage	2026-02-01 17:24:05 +00:00

README.md

Document Processor

AI-powered document management system using Claude vision for extraction and SQLite for storage/search.

Features

AI Vision Analysis: Uses Claude to read documents, extract text, classify, and summarize
No OCR dependencies: Just drop files in inbox, AI handles the rest
SQLite Storage: Full-text search via SQLite, embeddings ready (placeholder)
Auto-categorization: Taxes, bills, medical, insurance, legal, financial, etc.
Expense Tracking: Auto-exports bills/receipts to CSV

Setup

cd ~/dev/doc-processor

# Create/activate venv
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install anthropic

# Configure API key (one of these methods):
# Option 1: Environment variable
export ANTHROPIC_API_KEY=sk-ant-...

# Option 2: .env file
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env

Usage

# Activate venv first
source ~/dev/doc-processor/venv/bin/activate

# Process all documents in inbox
python processor.py

# Watch inbox continuously
python processor.py --watch

# Process single file
python processor.py --file /path/to/document.pdf

# Search documents
python search.py "query"
python search.py -c medical              # By category
python search.py -t receipt              # By type
python search.py -s abc123               # Show full document
python search.py --stats                 # Statistics
python search.py -l                      # List all

Directory Structure

~/documents/
├── inbox/           # Drop files here (SMB share for scanner)
├── store/           # Original files (hash-named)
├── records/         # Markdown records by category
│   ├── taxes/
│   ├── bills/
│   ├── medical/
│   └── ...
├── index/
│   ├── master.json  # JSON index
│   └── embeddings.db  # SQLite (documents + embeddings)
└── exports/
    └── expenses.csv # Auto-exported expenses

Supported Formats

PDF (converted to image for vision)
Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP

Systemd Service

# Install service
systemctl --user daemon-reload
systemctl --user enable doc-processor
systemctl --user start doc-processor

# Check status
systemctl --user status doc-processor
journalctl --user -u doc-processor -f

Requirements

Python 3.10+
anthropic Python package
pdftoppm (poppler-utils) for PDF conversion
Anthropic API key

API Key

The processor looks for the API key in this order:

ANTHROPIC_API_KEY environment variable
~/dev/doc-processor/.env file

Embeddings

The embedding storage is ready but the generation is a placeholder. Options:

OpenAI text-embedding-3-small (cheap, good)
Voyage AI (optimized for documents)
Local sentence-transformers

Currently uses SQLite full-text search which works well for most use cases.