|
|
||
|---|---|---|
| api | ||
| processor | ||
| watcher | ||
| .gitignore | ||
| README.md | ||
| docproc | ||
| docproc.service | ||
| go.mod | ||
| go.sum | ||
| main.go | ||
README.md
Document Processor
Go service that watches ~/documents/inbox/ for PDFs and images, uses Kimi K2.5 (via Fireworks API) for OCR and classification, then stores and indexes them.
Features
- File watcher: Monitors inbox for new documents
- OCR + Classification: Kimi K2.5 extracts text and categorizes documents
- Storage: PDFs stored in
~/documents/store/ - Records: Markdown records in
~/documents/records/{category}/ - Index: JSON index at
~/documents/index/master.json - Expense export: Auto-exports expenses to
~/documents/exports/expenses.csv - HTTP API: REST endpoints for manual ingestion and search
Setup
-
Set your Fireworks API key:
export FIREWORKS_API_KEY=your_key_here -
Run the service:
./docproc -
Or install as systemd service:
sudo cp docproc.service /etc/systemd/system/ # Edit /etc/systemd/system/docproc.service to add your API key sudo systemctl daemon-reload sudo systemctl enable --now docproc
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/ingest |
POST | Upload and process a document (multipart form, field: file) |
/search?q=query |
GET | Search documents by content |
/docs |
GET | List all documents |
/doc/{id} |
GET | Get single document by ID |
Directory Structure
~/documents/
├── inbox/ # Drop files here for processing
├── store/ # Processed PDFs stored by hash
├── records/ # Markdown records by category
│ ├── tax/
│ ├── expense/
│ ├── medical/
│ └── ...
├── index/
│ └── master.json # Document index
└── exports/
└── expenses.csv # Expense export
Categories
Documents are classified into:
- tax
- expense
- bill
- invoice
- medical
- receipt
- bank
- insurance
- legal
- correspondence
- other
Usage
Drop a PDF or image into ~/documents/inbox/ and the service will:
- OCR and classify it
- Store the original in
store/ - Create a markdown record in
records/{category}/ - Update the master index
- Export to CSV if it's an expense
- Delete from inbox
Or POST to /ingest:
curl -X POST http://localhost:9900/ingest -F "file=@receipt.pdf"
Search documents:
curl "http://localhost:9900/search?q=amazon"