docproc/README.md

# Document Processor

Go service that watches `~/documents/inbox/` for PDFs and images, uses Kimi K2.5 (via Fireworks API) for OCR and classification, then stores and indexes them.

## Features

- **File watcher**: Monitors inbox for new documents
- **OCR + Classification**: Kimi K2.5 extracts text and categorizes documents
- **Storage**: PDFs stored in `~/documents/store/`
- **Records**: Markdown records in `~/documents/records/{category}/`
- **Index**: JSON index at `~/documents/index/master.json`
- **Expense export**: Auto-exports expenses to `~/documents/exports/expenses.csv`
- **HTTP API**: REST endpoints for manual ingestion and search

## Setup

1. Set your Fireworks API key:
   ```bash
   export FIREWORKS_API_KEY=your_key_here
   ```

2. Run the service:
   ```bash
   ./docproc
   ```

3. Or install as systemd service:
   ```bash
   sudo cp docproc.service /etc/systemd/system/
   # Edit /etc/systemd/system/docproc.service to add your API key
   sudo systemctl daemon-reload
   sudo systemctl enable --now docproc
   ```

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/ingest` | POST | Upload and process a document (multipart form, field: `file`) |
| `/search?q=query` | GET | Search documents by content |
| `/docs` | GET | List all documents |
| `/doc/{id}` | GET | Get single document by ID |

## Directory Structure

```
~/documents/
├── inbox/      # Drop files here for processing
├── store/      # Processed PDFs stored by hash
├── records/    # Markdown records by category
│   ├── tax/
│   ├── expense/
│   ├── medical/
│   └── ...
├── index/
│   └── master.json  # Document index
└── exports/
    └── expenses.csv  # Expense export
```

## Categories

Documents are classified into:
- tax
- expense
- bill
- invoice
- medical
- receipt
- bank
- insurance
- legal
- correspondence
- other

## Usage

Drop a PDF or image into `~/documents/inbox/` and the service will:
1. OCR and classify it
2. Store the original in `store/`
3. Create a markdown record in `records/{category}/`
4. Update the master index
5. Export to CSV if it's an expense
6. Delete from inbox

Or POST to `/ingest`:
```bash
curl -X POST http://localhost:9900/ingest -F "file=@receipt.pdf"
```

Search documents:
```bash
curl "http://localhost:9900/search?q=amazon"
```