Imported from bare git on Zurich
Go to file
Johan Jongsma fb3d5a46b5 Replace OCR with AI vision, SQLite for storage
- Remove Tesseract/OCR dependencies
- Use Claude vision API for document analysis
- Single AI pass: extract text + classify + summarize
- SQLite database for documents and embeddings
- Embeddings storage ready (generation placeholder)
- Full-text search via SQLite
- Updated systemd service to use venv
- Support .env file for API key
2026-02-01 17:24:05 +00:00
.gitignore Replace OCR with AI vision, SQLite for storage 2026-02-01 17:24:05 +00:00
README.md Replace OCR with AI vision, SQLite for storage 2026-02-01 17:24:05 +00:00
processor.py Replace OCR with AI vision, SQLite for storage 2026-02-01 17:24:05 +00:00
search.py Replace OCR with AI vision, SQLite for storage 2026-02-01 17:24:05 +00:00

README.md

Document Processor

AI-powered document management system using Claude vision for extraction and SQLite for storage/search.

Features

  • AI Vision Analysis: Uses Claude to read documents, extract text, classify, and summarize
  • No OCR dependencies: Just drop files in inbox, AI handles the rest
  • SQLite Storage: Full-text search via SQLite, embeddings ready (placeholder)
  • Auto-categorization: Taxes, bills, medical, insurance, legal, financial, etc.
  • Expense Tracking: Auto-exports bills/receipts to CSV

Setup

cd ~/dev/doc-processor

# Create/activate venv
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install anthropic

# Configure API key (one of these methods):
# Option 1: Environment variable
export ANTHROPIC_API_KEY=sk-ant-...

# Option 2: .env file
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env

Usage

# Activate venv first
source ~/dev/doc-processor/venv/bin/activate

# Process all documents in inbox
python processor.py

# Watch inbox continuously
python processor.py --watch

# Process single file
python processor.py --file /path/to/document.pdf

# Search documents
python search.py "query"
python search.py -c medical              # By category
python search.py -t receipt              # By type
python search.py -s abc123               # Show full document
python search.py --stats                 # Statistics
python search.py -l                      # List all

Directory Structure

~/documents/
├── inbox/           # Drop files here (SMB share for scanner)
├── store/           # Original files (hash-named)
├── records/         # Markdown records by category
│   ├── taxes/
│   ├── bills/
│   ├── medical/
│   └── ...
├── index/
│   ├── master.json  # JSON index
│   └── embeddings.db  # SQLite (documents + embeddings)
└── exports/
    └── expenses.csv # Auto-exported expenses

Supported Formats

  • PDF (converted to image for vision)
  • Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP

Categories

  • taxes, bills, medical, insurance, legal
  • financial, expenses, vehicles, home
  • personal, contacts, uncategorized

Systemd Service

# Install service
systemctl --user daemon-reload
systemctl --user enable doc-processor
systemctl --user start doc-processor

# Check status
systemctl --user status doc-processor
journalctl --user -u doc-processor -f

Requirements

  • Python 3.10+
  • anthropic Python package
  • pdftoppm (poppler-utils) for PDF conversion
  • Anthropic API key

API Key

The processor looks for the API key in this order:

  1. ANTHROPIC_API_KEY environment variable
  2. ~/dev/doc-processor/.env file

Embeddings

The embedding storage is ready but the generation is a placeholder. Options:

  • OpenAI text-embedding-3-small (cheap, good)
  • Voyage AI (optimized for documents)
  • Local sentence-transformers

Currently uses SQLite full-text search which works well for most use cases.