inou/doc-processor/CLAUDE.md

190 lines
5.5 KiB
Markdown

# doc-processor
Standalone tool to process uploaded medical documents (PDFs) into structured entries.
## Current State
- `main.go` - basic doc processor (decrypts PDF, sends to Gemini, outputs JSON)
- `restore/restore.go` - tool that restored 53 orphaned files for Anastasiia
- `restore/test_summary.go` - tested summary extraction prompt with Gemini
**Next steps:**
1. Integrate summary prompt into main.go output schema
2. Test on more document types (labs, imaging reports)
3. Handle re-upload flow or direct entry creation
## What This Does
```
Encrypted PDF → Decrypt → Extract text (vision API) → Categorize/Summarize → JSON output
```
The output will be used to create entries in the inou database. After processing, the original file is deleted (inou is a data provider for LLMs, not a backup service).
## Context: inou Ecosystem
inou is a medical imaging platform at `~/dev/inou/`. Key components:
- `lib/` - shared library (crypto, db, signal, errors, files)
- `portal/` - web frontend
- `viewer/` - DICOM viewer
- `api/` - API server
- `mcp-client/` - MCP integration for Claude
Production: `/tank/inou/`
Master key: `/tank/inou/master.key`
## Already Built (in lib/)
```go
// lib/files.go
lib.DecryptFile(srcPath string) ([]byte, error) // decrypts file, alerts on error
lib.EncryptFile(content []byte, destPath string) error
// lib/errors.go
lib.SendErrorForAnalysis(context string, err error, details map[string]interface{})
// Saves incident to /tank/inou/errors/{id}.json and alerts via Signal
// lib/signal.go
lib.SendSignal(message string) // sends Signal message
// lib/crypto.go
lib.CryptoInit(keyPath string) error
lib.CryptoDecryptBytes(ciphertext []byte) ([]byte, error)
```
## What To Build
### Input
```bash
doc-processor <encrypted-file-path>
```
### Output (JSON to stdout)
```json
{
"title": "MRI Brain Protocol - Fondation Lenval",
"type": "radiology_report",
"document_date": "2025-04-14",
"summary": "MRI protocol describing brain imaging findings, sequences used, and radiologist assessment",
"tags": ["mri", "brain", "fondation lenval", "dr. dupont"],
"text": "Full extracted text...",
"structured_data": null
}
```
### Processing Flow
1. `lib.CryptoInit("/tank/inou/master.key")`
2. `lib.DecryptFile(path)` → PDF bytes
3. PDF to images (each page)
4. Send to vision API (Gemini Flash for now, make provider switchable)
5. Vision API prompt asks for: title, type, document_date, summary, tags, full text
6. For lab reports: also extract structured values in `structured_data`
7. Output JSON
### Document Types
- `consultation` - doctor visit notes
- `radiology_report` - MRI/CT/X-ray reports
- `lab_report` - blood work, biochemistry
- `ultrasound` - ultrasound protocols
- `other` - anything else
### Summary Guidelines
Summary describes WHAT information is in the document, not the findings:
❌ "Dr. Smith thinks her leg is broken and plans to fix next month"
✅ "Consultation with Dr. Smith about possible leg fracture"
This lets LLMs decide if they need to read the full text.
### Tags
Extract searchable terms: doctor names, body parts, conditions, institutions, dates.
### Document Date
Extract the date FROM the document content (not upload date). Fallback: PDF metadata → file timestamp.
## Test Files
**RESTORED:** 53 files recovered to `/tank/inou/anastasiia-restored/`
- `documents/` - 14 files (consultations, ultrasound protocols, MRI reports)
- `labs/` - 39 files (blood work, biochemistry)
Original filenames recovered by matching file sizes to Signal upload notifications.
Restore tool: `restore/restore.go`
## Vision API
Use Gemini Flash initially. Structure for easy provider switching:
```go
type VisionProvider interface {
ExtractDocument(images [][]byte) (*ProcessedDoc, error)
}
```
API keys: check how other inou components handle this (likely env vars or config file).
## Entries Schema (for context)
Processed documents become entries:
```
category: "document"
type: "radiology_report" (from output)
value: "MRI Brain Protocol..." (title)
summary: "MRI protocol describing..." (NEW FIELD - needs migration)
tags: "mri,brain,fondation lenval"
timestamp: 1713100800 (document_date as epoch)
data: {"text": "...", "original_filename": "..."}
```
Note: `summary` field needs to be added to entries table:
```sql
ALTER TABLE entries ADD COLUMN summary TEXT;
```
## Error Handling
Use `lib.SendErrorForAnalysis()` for failures - it logs details and sends Signal alert.
## Summary Design (for LLM triage)
The summary field helps LLMs decide whether to fetch full document content. Key insight: describe WHAT information is in the document, not the findings.
**Tested prompt** (see `restore/test_summary.go`):
```json
{
"document_type": "consultation | lab_report | imaging_report | ...",
"specialty": "neurology | cardiology | ...",
"date": "YYYY-MM-DD",
"patient_age_at_doc": "9 months",
"institution": "hospital name",
"provider": "doctor name",
"topics": ["prematurity", "hydrocephalus", "motor development"],
"has_recommendations": true,
"has_measurements": false,
"has_diagnosis": true,
"summary": "Neurology consultation for premature infant with hydrocephalus..."
}
```
**Topics** are searchable keywords. **Flags** let LLM skip docs quickly.
Example triage:
- Query "shunt history" → topics include "hydrocephalus" → fetch
- Query "cardiac issues" → specialty is "neurology" → skip
## Don't
- Don't integrate with portal/database yet - just output JSON
- Don't delete files yet - that's for integration phase
- Don't build CLI flags yet - just hardcode Gemini for now