inou/doc-processor/CLAUDE.md

5.5 KiB

doc-processor

Standalone tool to process uploaded medical documents (PDFs) into structured entries.

Current State

  • main.go - basic doc processor (decrypts PDF, sends to Gemini, outputs JSON)
  • restore/restore.go - tool that restored 53 orphaned files for Anastasiia
  • restore/test_summary.go - tested summary extraction prompt with Gemini

Next steps:

  1. Integrate summary prompt into main.go output schema
  2. Test on more document types (labs, imaging reports)
  3. Handle re-upload flow or direct entry creation

What This Does

Encrypted PDF → Decrypt → Extract text (vision API) → Categorize/Summarize → JSON output

The output will be used to create entries in the inou database. After processing, the original file is deleted (inou is a data provider for LLMs, not a backup service).

Context: inou Ecosystem

inou is a medical imaging platform at ~/dev/inou/. Key components:

  • lib/ - shared library (crypto, db, signal, errors, files)
  • portal/ - web frontend
  • viewer/ - DICOM viewer
  • api/ - API server
  • mcp-client/ - MCP integration for Claude

Production: /tank/inou/ Master key: /tank/inou/master.key

Already Built (in lib/)

// lib/files.go
lib.DecryptFile(srcPath string) ([]byte, error)  // decrypts file, alerts on error
lib.EncryptFile(content []byte, destPath string) error

// lib/errors.go  
lib.SendErrorForAnalysis(context string, err error, details map[string]interface{})
// Saves incident to /tank/inou/errors/{id}.json and alerts via Signal

// lib/signal.go
lib.SendSignal(message string)  // sends Signal message

// lib/crypto.go
lib.CryptoInit(keyPath string) error
lib.CryptoDecryptBytes(ciphertext []byte) ([]byte, error)

What To Build

Input

doc-processor <encrypted-file-path>

Output (JSON to stdout)

{
  "title": "MRI Brain Protocol - Fondation Lenval",
  "type": "radiology_report",
  "document_date": "2025-04-14",
  "summary": "MRI protocol describing brain imaging findings, sequences used, and radiologist assessment",
  "tags": ["mri", "brain", "fondation lenval", "dr. dupont"],
  "text": "Full extracted text...",
  "structured_data": null
}

Processing Flow

  1. lib.CryptoInit("/tank/inou/master.key")
  2. lib.DecryptFile(path) → PDF bytes
  3. PDF to images (each page)
  4. Send to vision API (Gemini Flash for now, make provider switchable)
  5. Vision API prompt asks for: title, type, document_date, summary, tags, full text
  6. For lab reports: also extract structured values in structured_data
  7. Output JSON

Document Types

  • consultation - doctor visit notes
  • radiology_report - MRI/CT/X-ray reports
  • lab_report - blood work, biochemistry
  • ultrasound - ultrasound protocols
  • other - anything else

Summary Guidelines

Summary describes WHAT information is in the document, not the findings:

"Dr. Smith thinks her leg is broken and plans to fix next month" "Consultation with Dr. Smith about possible leg fracture"

This lets LLMs decide if they need to read the full text.

Tags

Extract searchable terms: doctor names, body parts, conditions, institutions, dates.

Document Date

Extract the date FROM the document content (not upload date). Fallback: PDF metadata → file timestamp.

Test Files

RESTORED: 53 files recovered to /tank/inou/anastasiia-restored/

  • documents/ - 14 files (consultations, ultrasound protocols, MRI reports)
  • labs/ - 39 files (blood work, biochemistry)

Original filenames recovered by matching file sizes to Signal upload notifications. Restore tool: restore/restore.go

Vision API

Use Gemini Flash initially. Structure for easy provider switching:

type VisionProvider interface {
    ExtractDocument(images [][]byte) (*ProcessedDoc, error)
}

API keys: check how other inou components handle this (likely env vars or config file).

Entries Schema (for context)

Processed documents become entries:

category:  "document"
type:      "radiology_report" (from output)
value:     "MRI Brain Protocol..." (title)
summary:   "MRI protocol describing..." (NEW FIELD - needs migration)
tags:      "mri,brain,fondation lenval"
timestamp: 1713100800 (document_date as epoch)
data:      {"text": "...", "original_filename": "..."}

Note: summary field needs to be added to entries table:

ALTER TABLE entries ADD COLUMN summary TEXT;

Error Handling

Use lib.SendErrorForAnalysis() for failures - it logs details and sends Signal alert.

Summary Design (for LLM triage)

The summary field helps LLMs decide whether to fetch full document content. Key insight: describe WHAT information is in the document, not the findings.

Tested prompt (see restore/test_summary.go):

{
  "document_type": "consultation | lab_report | imaging_report | ...",
  "specialty": "neurology | cardiology | ...",
  "date": "YYYY-MM-DD",
  "patient_age_at_doc": "9 months",
  "institution": "hospital name",
  "provider": "doctor name",
  "topics": ["prematurity", "hydrocephalus", "motor development"],
  "has_recommendations": true,
  "has_measurements": false,
  "has_diagnosis": true,
  "summary": "Neurology consultation for premature infant with hydrocephalus..."
}

Topics are searchable keywords. Flags let LLM skip docs quickly.

Example triage:

  • Query "shunt history" → topics include "hydrocephalus" → fetch
  • Query "cardiac issues" → specialty is "neurology" → skip

Don't

  • Don't integrate with portal/database yet - just output JSON
  • Don't delete files yet - that's for integration phase
  • Don't build CLI flags yet - just hardcode Gemini for now