doc-processor/README.md

120 lines
3.0 KiB
Markdown

# Document Processor
AI-powered document management system using K2.5 (via Fireworks) for extraction and SQLite for storage/search.
## Features
- **AI Vision Analysis**: Uses K2.5 (Kimi via Fireworks) to read documents, extract text, classify, and summarize
- **No OCR dependencies**: Just drop files in inbox, AI handles the rest
- **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
- **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
- **Expense Tracking**: Auto-exports bills/receipts to CSV
## Setup
```bash
cd ~/dev/doc-processor
# Create/activate venv
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install openai
# Configure API key (one of these methods):
# Option 1: Environment variable
export FIREWORKS_API_KEY=...
# Option 2: .env file
echo 'FIREWORKS_API_KEY=...' > .env
```
## Usage
```bash
# Activate venv first
source ~/dev/doc-processor/venv/bin/activate
# Process all documents in inbox
python processor.py
# Watch inbox continuously
python processor.py --watch
# Process single file
python processor.py --file /path/to/document.pdf
# Search documents
python search.py "query"
python search.py -c medical # By category
python search.py -t receipt # By type
python search.py -s abc123 # Show full document
python search.py --stats # Statistics
python search.py -l # List all
```
## Directory Structure
```
~/documents/
├── inbox/ # Drop files here (SMB share for scanner)
├── store/ # Original files (hash-named)
├── records/ # Markdown records by category
│ ├── taxes/
│ ├── bills/
│ ├── medical/
│ └── ...
├── index/
│ ├── master.json # JSON index
│ └── embeddings.db # SQLite (documents + embeddings)
└── exports/
└── expenses.csv # Auto-exported expenses
```
## Supported Formats
- PDF (converted to image for vision)
- Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP
## Categories
- taxes, bills, medical, insurance, legal
- financial, expenses, vehicles, home
- personal, contacts, uncategorized
## Systemd Service
```bash
# Install service
systemctl --user daemon-reload
systemctl --user enable doc-processor
systemctl --user start doc-processor
# Check status
systemctl --user status doc-processor
journalctl --user -u doc-processor -f
```
## Requirements
- Python 3.10+
- `openai` Python package (for Fireworks API)
- `pdftoppm` (poppler-utils) for PDF conversion
- Fireworks API key
## API Key
The processor looks for the API key in this order:
1. `FIREWORKS_API_KEY` environment variable
2. `~/dev/doc-processor/.env` file
## Embeddings
The embedding storage is ready but the generation is a placeholder. Options:
- OpenAI text-embedding-3-small (cheap, good)
- Voyage AI (optimized for documents)
- Local sentence-transformers
Currently uses SQLite full-text search which works well for most use cases.