96 lines
2.4 KiB
Markdown
96 lines
2.4 KiB
Markdown
# Document Processor
|
|
|
|
Go service that watches `~/documents/inbox/` for PDFs and images, uses Kimi K2.5 (via Fireworks API) for OCR and classification, then stores and indexes them.
|
|
|
|
## Features
|
|
|
|
- **File watcher**: Monitors inbox for new documents
|
|
- **OCR + Classification**: Kimi K2.5 extracts text and categorizes documents
|
|
- **Storage**: PDFs stored in `~/documents/store/`
|
|
- **Records**: Markdown records in `~/documents/records/{category}/`
|
|
- **Index**: JSON index at `~/documents/index/master.json`
|
|
- **Expense export**: Auto-exports expenses to `~/documents/exports/expenses.csv`
|
|
- **HTTP API**: REST endpoints for manual ingestion and search
|
|
|
|
## Setup
|
|
|
|
1. Set your Fireworks API key:
|
|
```bash
|
|
export FIREWORKS_API_KEY=your_key_here
|
|
```
|
|
|
|
2. Run the service:
|
|
```bash
|
|
./docproc
|
|
```
|
|
|
|
3. Or install as systemd service:
|
|
```bash
|
|
sudo cp docproc.service /etc/systemd/system/
|
|
# Edit /etc/systemd/system/docproc.service to add your API key
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable --now docproc
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/health` | GET | Health check |
|
|
| `/ingest` | POST | Upload and process a document (multipart form, field: `file`) |
|
|
| `/search?q=query` | GET | Search documents by content |
|
|
| `/docs` | GET | List all documents |
|
|
| `/doc/{id}` | GET | Get single document by ID |
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
~/documents/
|
|
├── inbox/ # Drop files here for processing
|
|
├── store/ # Processed PDFs stored by hash
|
|
├── records/ # Markdown records by category
|
|
│ ├── tax/
|
|
│ ├── expense/
|
|
│ ├── medical/
|
|
│ └── ...
|
|
├── index/
|
|
│ └── master.json # Document index
|
|
└── exports/
|
|
└── expenses.csv # Expense export
|
|
```
|
|
|
|
## Categories
|
|
|
|
Documents are classified into:
|
|
- tax
|
|
- expense
|
|
- bill
|
|
- invoice
|
|
- medical
|
|
- receipt
|
|
- bank
|
|
- insurance
|
|
- legal
|
|
- correspondence
|
|
- other
|
|
|
|
## Usage
|
|
|
|
Drop a PDF or image into `~/documents/inbox/` and the service will:
|
|
1. OCR and classify it
|
|
2. Store the original in `store/`
|
|
3. Create a markdown record in `records/{category}/`
|
|
4. Update the master index
|
|
5. Export to CSV if it's an expense
|
|
6. Delete from inbox
|
|
|
|
Or POST to `/ingest`:
|
|
```bash
|
|
curl -X POST http://localhost:9900/ingest -F "file=@receipt.pdf"
|
|
```
|
|
|
|
Search documents:
|
|
```bash
|
|
curl "http://localhost:9900/search?q=amazon"
|
|
```
|