# doc-processor Standalone tool to process uploaded medical documents (PDFs) into structured entries. ## Current State - `main.go` - basic doc processor (decrypts PDF, sends to Gemini, outputs JSON) - `restore/restore.go` - tool that restored 53 orphaned files for Anastasiia - `restore/test_summary.go` - tested summary extraction prompt with Gemini **Next steps:** 1. Integrate summary prompt into main.go output schema 2. Test on more document types (labs, imaging reports) 3. Handle re-upload flow or direct entry creation ## What This Does ``` Encrypted PDF → Decrypt → Extract text (vision API) → Categorize/Summarize → JSON output ``` The output will be used to create entries in the inou database. After processing, the original file is deleted (inou is a data provider for LLMs, not a backup service). ## Context: inou Ecosystem inou is a medical imaging platform at `~/dev/inou/`. Key components: - `lib/` - shared library (crypto, db, signal, errors, files) - `portal/` - web frontend - `viewer/` - DICOM viewer - `api/` - API server - `mcp-client/` - MCP integration for Claude Production: `/tank/inou/` Master key: `/tank/inou/master.key` ## Already Built (in lib/) ```go // lib/files.go lib.DecryptFile(srcPath string) ([]byte, error) // decrypts file, alerts on error lib.EncryptFile(content []byte, destPath string) error // lib/errors.go lib.SendErrorForAnalysis(context string, err error, details map[string]interface{}) // Saves incident to /tank/inou/errors/{id}.json and alerts via Signal // lib/signal.go lib.SendSignal(message string) // sends Signal message // lib/crypto.go lib.CryptoInit(keyPath string) error lib.CryptoDecryptBytes(ciphertext []byte) ([]byte, error) ``` ## What To Build ### Input ```bash doc-processor ``` ### Output (JSON to stdout) ```json { "title": "MRI Brain Protocol - Fondation Lenval", "type": "radiology_report", "document_date": "2025-04-14", "summary": "MRI protocol describing brain imaging findings, sequences used, and radiologist assessment", "tags": ["mri", "brain", "fondation lenval", "dr. dupont"], "text": "Full extracted text...", "structured_data": null } ``` ### Processing Flow 1. `lib.CryptoInit("/tank/inou/master.key")` 2. `lib.DecryptFile(path)` → PDF bytes 3. PDF to images (each page) 4. Send to vision API (Gemini Flash for now, make provider switchable) 5. Vision API prompt asks for: title, type, document_date, summary, tags, full text 6. For lab reports: also extract structured values in `structured_data` 7. Output JSON ### Document Types - `consultation` - doctor visit notes - `radiology_report` - MRI/CT/X-ray reports - `lab_report` - blood work, biochemistry - `ultrasound` - ultrasound protocols - `other` - anything else ### Summary Guidelines Summary describes WHAT information is in the document, not the findings: ❌ "Dr. Smith thinks her leg is broken and plans to fix next month" ✅ "Consultation with Dr. Smith about possible leg fracture" This lets LLMs decide if they need to read the full text. ### Tags Extract searchable terms: doctor names, body parts, conditions, institutions, dates. ### Document Date Extract the date FROM the document content (not upload date). Fallback: PDF metadata → file timestamp. ## Test Files **RESTORED:** 53 files recovered to `/tank/inou/anastasiia-restored/` - `documents/` - 14 files (consultations, ultrasound protocols, MRI reports) - `labs/` - 39 files (blood work, biochemistry) Original filenames recovered by matching file sizes to Signal upload notifications. Restore tool: `restore/restore.go` ## Vision API Use Gemini Flash initially. Structure for easy provider switching: ```go type VisionProvider interface { ExtractDocument(images [][]byte) (*ProcessedDoc, error) } ``` API keys: check how other inou components handle this (likely env vars or config file). ## Entries Schema (for context) Processed documents become entries: ``` category: "document" type: "radiology_report" (from output) value: "MRI Brain Protocol..." (title) summary: "MRI protocol describing..." (NEW FIELD - needs migration) tags: "mri,brain,fondation lenval" timestamp: 1713100800 (document_date as epoch) data: {"text": "...", "original_filename": "..."} ``` Note: `summary` field needs to be added to entries table: ```sql ALTER TABLE entries ADD COLUMN summary TEXT; ``` ## Error Handling Use `lib.SendErrorForAnalysis()` for failures - it logs details and sends Signal alert. ## Summary Design (for LLM triage) The summary field helps LLMs decide whether to fetch full document content. Key insight: describe WHAT information is in the document, not the findings. **Tested prompt** (see `restore/test_summary.go`): ```json { "document_type": "consultation | lab_report | imaging_report | ...", "specialty": "neurology | cardiology | ...", "date": "YYYY-MM-DD", "patient_age_at_doc": "9 months", "institution": "hospital name", "provider": "doctor name", "topics": ["prematurity", "hydrocephalus", "motor development"], "has_recommendations": true, "has_measurements": false, "has_diagnosis": true, "summary": "Neurology consultation for premature infant with hydrocephalus..." } ``` **Topics** are searchable keywords. **Flags** let LLM skip docs quickly. Example triage: - Query "shunt history" → topics include "hydrocephalus" → fetch - Query "cardiac issues" → specialty is "neurology" → skip ## Don't - Don't integrate with portal/database yet - just output JSON - Don't delete files yet - that's for integration phase - Don't build CLI flags yet - just hardcode Gemini for now