doc-processor/README.md

# Document Processor

AI-powered document management system using Claude vision for extraction and SQLite for storage/search.

## Features

- **AI Vision Analysis**: Uses Claude to read documents, extract text, classify, and summarize
- **No OCR dependencies**: Just drop files in inbox, AI handles the rest
- **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
- **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
- **Expense Tracking**: Auto-exports bills/receipts to CSV

## Setup

```bash
cd ~/dev/doc-processor

# Create/activate venv
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install anthropic

# Configure API key (one of these methods):
# Option 1: Environment variable
export ANTHROPIC_API_KEY=sk-ant-...

# Option 2: .env file
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
```

## Usage

```bash
# Activate venv first
source ~/dev/doc-processor/venv/bin/activate

# Process all documents in inbox
python processor.py

# Watch inbox continuously
python processor.py --watch

# Process single file
python processor.py --file /path/to/document.pdf

# Search documents
python search.py "query"
python search.py -c medical              # By category
python search.py -t receipt              # By type
python search.py -s abc123               # Show full document
python search.py --stats                 # Statistics
python search.py -l                      # List all
```

## Directory Structure

```
~/documents/
├── inbox/           # Drop files here (SMB share for scanner)
├── store/           # Original files (hash-named)
├── records/         # Markdown records by category
│   ├── taxes/
│   ├── bills/
│   ├── medical/
│   └── ...
├── index/
│   ├── master.json  # JSON index
│   └── embeddings.db  # SQLite (documents + embeddings)
└── exports/
    └── expenses.csv # Auto-exported expenses
```

## Supported Formats

- PDF (converted to image for vision)
- Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP

## Categories

- taxes, bills, medical, insurance, legal
- financial, expenses, vehicles, home
- personal, contacts, uncategorized

## Systemd Service

```bash
# Install service
systemctl --user daemon-reload
systemctl --user enable doc-processor
systemctl --user start doc-processor

# Check status
systemctl --user status doc-processor
journalctl --user -u doc-processor -f
```

## Requirements

- Python 3.10+
- `anthropic` Python package
- `pdftoppm` (poppler-utils) for PDF conversion
- Anthropic API key

## API Key

The processor looks for the API key in this order:
1. `ANTHROPIC_API_KEY` environment variable
2. `~/dev/doc-processor/.env` file

## Embeddings

The embedding storage is ready but the generation is a placeholder. Options:
- OpenAI text-embedding-3-small (cheap, good)
- Voyage AI (optimized for documents)
- Local sentence-transformers

Currently uses SQLite full-text search which works well for most use cases.