120 lines
3.0 KiB
Markdown
120 lines
3.0 KiB
Markdown
# Document Processor
|
|
|
|
AI-powered document management system using Claude vision for extraction and SQLite for storage/search.
|
|
|
|
## Features
|
|
|
|
- **AI Vision Analysis**: Uses Claude to read documents, extract text, classify, and summarize
|
|
- **No OCR dependencies**: Just drop files in inbox, AI handles the rest
|
|
- **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
|
|
- **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
|
|
- **Expense Tracking**: Auto-exports bills/receipts to CSV
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
cd ~/dev/doc-processor
|
|
|
|
# Create/activate venv
|
|
python3 -m venv venv
|
|
source venv/bin/activate
|
|
|
|
# Install dependencies
|
|
pip install anthropic
|
|
|
|
# Configure API key (one of these methods):
|
|
# Option 1: Environment variable
|
|
export ANTHROPIC_API_KEY=sk-ant-...
|
|
|
|
# Option 2: .env file
|
|
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Activate venv first
|
|
source ~/dev/doc-processor/venv/bin/activate
|
|
|
|
# Process all documents in inbox
|
|
python processor.py
|
|
|
|
# Watch inbox continuously
|
|
python processor.py --watch
|
|
|
|
# Process single file
|
|
python processor.py --file /path/to/document.pdf
|
|
|
|
# Search documents
|
|
python search.py "query"
|
|
python search.py -c medical # By category
|
|
python search.py -t receipt # By type
|
|
python search.py -s abc123 # Show full document
|
|
python search.py --stats # Statistics
|
|
python search.py -l # List all
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
~/documents/
|
|
├── inbox/ # Drop files here (SMB share for scanner)
|
|
├── store/ # Original files (hash-named)
|
|
├── records/ # Markdown records by category
|
|
│ ├── taxes/
|
|
│ ├── bills/
|
|
│ ├── medical/
|
|
│ └── ...
|
|
├── index/
|
|
│ ├── master.json # JSON index
|
|
│ └── embeddings.db # SQLite (documents + embeddings)
|
|
└── exports/
|
|
└── expenses.csv # Auto-exported expenses
|
|
```
|
|
|
|
## Supported Formats
|
|
|
|
- PDF (converted to image for vision)
|
|
- Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP
|
|
|
|
## Categories
|
|
|
|
- taxes, bills, medical, insurance, legal
|
|
- financial, expenses, vehicles, home
|
|
- personal, contacts, uncategorized
|
|
|
|
## Systemd Service
|
|
|
|
```bash
|
|
# Install service
|
|
systemctl --user daemon-reload
|
|
systemctl --user enable doc-processor
|
|
systemctl --user start doc-processor
|
|
|
|
# Check status
|
|
systemctl --user status doc-processor
|
|
journalctl --user -u doc-processor -f
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- Python 3.10+
|
|
- `anthropic` Python package
|
|
- `pdftoppm` (poppler-utils) for PDF conversion
|
|
- Anthropic API key
|
|
|
|
## API Key
|
|
|
|
The processor looks for the API key in this order:
|
|
1. `ANTHROPIC_API_KEY` environment variable
|
|
2. `~/dev/doc-processor/.env` file
|
|
|
|
## Embeddings
|
|
|
|
The embedding storage is ready but the generation is a placeholder. Options:
|
|
- OpenAI text-embedding-3-small (cheap, good)
|
|
- Voyage AI (optimized for documents)
|
|
- Local sentence-transformers
|
|
|
|
Currently uses SQLite full-text search which works well for most use cases.
|