docproc/README.md

2.4 KiB

Document Processor

Go service that watches ~/documents/inbox/ for PDFs and images, uses Kimi K2.5 (via Fireworks API) for OCR and classification, then stores and indexes them.

Features

  • File watcher: Monitors inbox for new documents
  • OCR + Classification: Kimi K2.5 extracts text and categorizes documents
  • Storage: PDFs stored in ~/documents/store/
  • Records: Markdown records in ~/documents/records/{category}/
  • Index: JSON index at ~/documents/index/master.json
  • Expense export: Auto-exports expenses to ~/documents/exports/expenses.csv
  • HTTP API: REST endpoints for manual ingestion and search

Setup

  1. Set your Fireworks API key:

    export FIREWORKS_API_KEY=your_key_here
    
  2. Run the service:

    ./docproc
    
  3. Or install as systemd service:

    sudo cp docproc.service /etc/systemd/system/
    # Edit /etc/systemd/system/docproc.service to add your API key
    sudo systemctl daemon-reload
    sudo systemctl enable --now docproc
    

API Endpoints

Endpoint Method Description
/health GET Health check
/ingest POST Upload and process a document (multipart form, field: file)
/search?q=query GET Search documents by content
/docs GET List all documents
/doc/{id} GET Get single document by ID

Directory Structure

~/documents/
├── inbox/      # Drop files here for processing
├── store/      # Processed PDFs stored by hash
├── records/    # Markdown records by category
│   ├── tax/
│   ├── expense/
│   ├── medical/
│   └── ...
├── index/
│   └── master.json  # Document index
└── exports/
    └── expenses.csv  # Expense export

Categories

Documents are classified into:

  • tax
  • expense
  • bill
  • invoice
  • medical
  • receipt
  • bank
  • insurance
  • legal
  • correspondence
  • other

Usage

Drop a PDF or image into ~/documents/inbox/ and the service will:

  1. OCR and classify it
  2. Store the original in store/
  3. Create a markdown record in records/{category}/
  4. Update the master index
  5. Export to CSV if it's an expense
  6. Delete from inbox

Or POST to /ingest:

curl -X POST http://localhost:9900/ingest -F "file=@receipt.pdf"

Search documents:

curl "http://localhost:9900/search?q=amazon"