inou/doc-processor/CLAUDE.md

# doc-processor

Standalone tool to process uploaded medical documents (PDFs) into structured entries.

## Current State

- `main.go` - basic doc processor (decrypts PDF, sends to Gemini, outputs JSON)
- `restore/restore.go` - tool that restored 53 orphaned files for Anastasiia
- `restore/test_summary.go` - tested summary extraction prompt with Gemini

**Next steps:**
1. Integrate summary prompt into main.go output schema
2. Test on more document types (labs, imaging reports)
3. Handle re-upload flow or direct entry creation

## What This Does

```
Encrypted PDF → Decrypt → Extract text (vision API) → Categorize/Summarize → JSON output
```

The output will be used to create entries in the inou database. After processing, the original file is deleted (inou is a data provider for LLMs, not a backup service).

## Context: inou Ecosystem

inou is a medical imaging platform at `~/dev/inou/`. Key components:

- `lib/` - shared library (crypto, db, signal, errors, files)
- `portal/` - web frontend
- `viewer/` - DICOM viewer
- `api/` - API server
- `mcp-client/` - MCP integration for Claude

Production: `/tank/inou/`
Master key: `/tank/inou/master.key`

## Already Built (in lib/)

```go
// lib/files.go
lib.DecryptFile(srcPath string) ([]byte, error)  // decrypts file, alerts on error
lib.EncryptFile(content []byte, destPath string) error

// lib/errors.go
lib.SendErrorForAnalysis(context string, err error, details map[string]interface{})
// Saves incident to /tank/inou/errors/{id}.json and alerts via Signal

// lib/signal.go
lib.SendSignal(message string)  // sends Signal message

// lib/crypto.go
lib.CryptoInit(keyPath string) error
lib.CryptoDecryptBytes(ciphertext []byte) ([]byte, error)
```

## What To Build

### Input

```bash
doc-processor <encrypted-file-path>
```

### Output (JSON to stdout)

```json
{
  "title": "MRI Brain Protocol - Fondation Lenval",
  "type": "radiology_report",
  "document_date": "2025-04-14",
  "summary": "MRI protocol describing brain imaging findings, sequences used, and radiologist assessment",
  "tags": ["mri", "brain", "fondation lenval", "dr. dupont"],
  "text": "Full extracted text...",
  "structured_data": null
}
```

### Processing Flow

1. `lib.CryptoInit("/tank/inou/master.key")`
2. `lib.DecryptFile(path)` → PDF bytes
3. PDF to images (each page)
4. Send to vision API (Gemini Flash for now, make provider switchable)
5. Vision API prompt asks for: title, type, document_date, summary, tags, full text
6. For lab reports: also extract structured values in `structured_data`
7. Output JSON

### Document Types

- `consultation` - doctor visit notes
- `radiology_report` - MRI/CT/X-ray reports
- `lab_report` - blood work, biochemistry
- `ultrasound` - ultrasound protocols
- `other` - anything else

### Summary Guidelines

Summary describes WHAT information is in the document, not the findings:

❌ "Dr. Smith thinks her leg is broken and plans to fix next month"
✅ "Consultation with Dr. Smith about possible leg fracture"

This lets LLMs decide if they need to read the full text.

### Tags

Extract searchable terms: doctor names, body parts, conditions, institutions, dates.

### Document Date

Extract the date FROM the document content (not upload date). Fallback: PDF metadata → file timestamp.

## Test Files

**RESTORED:** 53 files recovered to `/tank/inou/anastasiia-restored/`
- `documents/` - 14 files (consultations, ultrasound protocols, MRI reports)
- `labs/` - 39 files (blood work, biochemistry)

Original filenames recovered by matching file sizes to Signal upload notifications.
Restore tool: `restore/restore.go`

## Vision API

Use Gemini Flash initially. Structure for easy provider switching:

```go
type VisionProvider interface {
    ExtractDocument(images [][]byte) (*ProcessedDoc, error)
}
```

API keys: check how other inou components handle this (likely env vars or config file).

## Entries Schema (for context)

Processed documents become entries:

```
category:  "document"
type:      "radiology_report" (from output)
value:     "MRI Brain Protocol..." (title)
summary:   "MRI protocol describing..." (NEW FIELD - needs migration)
tags:      "mri,brain,fondation lenval"
timestamp: 1713100800 (document_date as epoch)
data:      {"text": "...", "original_filename": "..."}
```

Note: `summary` field needs to be added to entries table:
```sql
ALTER TABLE entries ADD COLUMN summary TEXT;
```

## Error Handling

Use `lib.SendErrorForAnalysis()` for failures - it logs details and sends Signal alert.

## Summary Design (for LLM triage)

The summary field helps LLMs decide whether to fetch full document content. Key insight: describe WHAT information is in the document, not the findings.

**Tested prompt** (see `restore/test_summary.go`):

```json
{
  "document_type": "consultation | lab_report | imaging_report | ...",
  "specialty": "neurology | cardiology | ...",
  "date": "YYYY-MM-DD",
  "patient_age_at_doc": "9 months",
  "institution": "hospital name",
  "provider": "doctor name",
  "topics": ["prematurity", "hydrocephalus", "motor development"],
  "has_recommendations": true,
  "has_measurements": false,
  "has_diagnosis": true,
  "summary": "Neurology consultation for premature infant with hydrocephalus..."
}
```

**Topics** are searchable keywords. **Flags** let LLM skip docs quickly.

Example triage:
- Query "shunt history" → topics include "hydrocephalus" → fetch
- Query "cardiac issues" → specialty is "neurology" → skip

## Don't

- Don't integrate with portal/database yet - just output JSON
- Don't delete files yet - that's for integration phase
- Don't build CLI flags yet - just hardcode Gemini for now