docproc/README.md

96 lines
2.4 KiB
Markdown

# Document Processor
Go service that watches `~/documents/inbox/` for PDFs and images, uses Kimi K2.5 (via Fireworks API) for OCR and classification, then stores and indexes them.
## Features
- **File watcher**: Monitors inbox for new documents
- **OCR + Classification**: Kimi K2.5 extracts text and categorizes documents
- **Storage**: PDFs stored in `~/documents/store/`
- **Records**: Markdown records in `~/documents/records/{category}/`
- **Index**: JSON index at `~/documents/index/master.json`
- **Expense export**: Auto-exports expenses to `~/documents/exports/expenses.csv`
- **HTTP API**: REST endpoints for manual ingestion and search
## Setup
1. Set your Fireworks API key:
```bash
export FIREWORKS_API_KEY=your_key_here
```
2. Run the service:
```bash
./docproc
```
3. Or install as systemd service:
```bash
sudo cp docproc.service /etc/systemd/system/
# Edit /etc/systemd/system/docproc.service to add your API key
sudo systemctl daemon-reload
sudo systemctl enable --now docproc
```
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/ingest` | POST | Upload and process a document (multipart form, field: `file`) |
| `/search?q=query` | GET | Search documents by content |
| `/docs` | GET | List all documents |
| `/doc/{id}` | GET | Get single document by ID |
## Directory Structure
```
~/documents/
├── inbox/ # Drop files here for processing
├── store/ # Processed PDFs stored by hash
├── records/ # Markdown records by category
│ ├── tax/
│ ├── expense/
│ ├── medical/
│ └── ...
├── index/
│ └── master.json # Document index
└── exports/
└── expenses.csv # Expense export
```
## Categories
Documents are classified into:
- tax
- expense
- bill
- invoice
- medical
- receipt
- bank
- insurance
- legal
- correspondence
- other
## Usage
Drop a PDF or image into `~/documents/inbox/` and the service will:
1. OCR and classify it
2. Store the original in `store/`
3. Create a markdown record in `records/{category}/`
4. Update the master index
5. Export to CSV if it's an expense
6. Delete from inbox
Or POST to `/ingest`:
```bash
curl -X POST http://localhost:9900/ingest -F "file=@receipt.pdf"
```
Search documents:
```bash
curl "http://localhost:9900/search?q=amazon"
```