190 lines
5.5 KiB
Markdown
190 lines
5.5 KiB
Markdown
# doc-processor
|
|
|
|
Standalone tool to process uploaded medical documents (PDFs) into structured entries.
|
|
|
|
## Current State
|
|
|
|
- `main.go` - basic doc processor (decrypts PDF, sends to Gemini, outputs JSON)
|
|
- `restore/restore.go` - tool that restored 53 orphaned files for Anastasiia
|
|
- `restore/test_summary.go` - tested summary extraction prompt with Gemini
|
|
|
|
**Next steps:**
|
|
1. Integrate summary prompt into main.go output schema
|
|
2. Test on more document types (labs, imaging reports)
|
|
3. Handle re-upload flow or direct entry creation
|
|
|
|
## What This Does
|
|
|
|
```
|
|
Encrypted PDF → Decrypt → Extract text (vision API) → Categorize/Summarize → JSON output
|
|
```
|
|
|
|
The output will be used to create entries in the inou database. After processing, the original file is deleted (inou is a data provider for LLMs, not a backup service).
|
|
|
|
## Context: inou Ecosystem
|
|
|
|
inou is a medical imaging platform at `~/dev/inou/`. Key components:
|
|
|
|
- `lib/` - shared library (crypto, db, signal, errors, files)
|
|
- `portal/` - web frontend
|
|
- `viewer/` - DICOM viewer
|
|
- `api/` - API server
|
|
- `mcp-client/` - MCP integration for Claude
|
|
|
|
Production: `/tank/inou/`
|
|
Master key: `/tank/inou/master.key`
|
|
|
|
## Already Built (in lib/)
|
|
|
|
```go
|
|
// lib/files.go
|
|
lib.DecryptFile(srcPath string) ([]byte, error) // decrypts file, alerts on error
|
|
lib.EncryptFile(content []byte, destPath string) error
|
|
|
|
// lib/errors.go
|
|
lib.SendErrorForAnalysis(context string, err error, details map[string]interface{})
|
|
// Saves incident to /tank/inou/errors/{id}.json and alerts via Signal
|
|
|
|
// lib/signal.go
|
|
lib.SendSignal(message string) // sends Signal message
|
|
|
|
// lib/crypto.go
|
|
lib.CryptoInit(keyPath string) error
|
|
lib.CryptoDecryptBytes(ciphertext []byte) ([]byte, error)
|
|
```
|
|
|
|
## What To Build
|
|
|
|
### Input
|
|
|
|
```bash
|
|
doc-processor <encrypted-file-path>
|
|
```
|
|
|
|
### Output (JSON to stdout)
|
|
|
|
```json
|
|
{
|
|
"title": "MRI Brain Protocol - Fondation Lenval",
|
|
"type": "radiology_report",
|
|
"document_date": "2025-04-14",
|
|
"summary": "MRI protocol describing brain imaging findings, sequences used, and radiologist assessment",
|
|
"tags": ["mri", "brain", "fondation lenval", "dr. dupont"],
|
|
"text": "Full extracted text...",
|
|
"structured_data": null
|
|
}
|
|
```
|
|
|
|
### Processing Flow
|
|
|
|
1. `lib.CryptoInit("/tank/inou/master.key")`
|
|
2. `lib.DecryptFile(path)` → PDF bytes
|
|
3. PDF to images (each page)
|
|
4. Send to vision API (Gemini Flash for now, make provider switchable)
|
|
5. Vision API prompt asks for: title, type, document_date, summary, tags, full text
|
|
6. For lab reports: also extract structured values in `structured_data`
|
|
7. Output JSON
|
|
|
|
### Document Types
|
|
|
|
- `consultation` - doctor visit notes
|
|
- `radiology_report` - MRI/CT/X-ray reports
|
|
- `lab_report` - blood work, biochemistry
|
|
- `ultrasound` - ultrasound protocols
|
|
- `other` - anything else
|
|
|
|
### Summary Guidelines
|
|
|
|
Summary describes WHAT information is in the document, not the findings:
|
|
|
|
❌ "Dr. Smith thinks her leg is broken and plans to fix next month"
|
|
✅ "Consultation with Dr. Smith about possible leg fracture"
|
|
|
|
This lets LLMs decide if they need to read the full text.
|
|
|
|
### Tags
|
|
|
|
Extract searchable terms: doctor names, body parts, conditions, institutions, dates.
|
|
|
|
### Document Date
|
|
|
|
Extract the date FROM the document content (not upload date). Fallback: PDF metadata → file timestamp.
|
|
|
|
## Test Files
|
|
|
|
**RESTORED:** 53 files recovered to `/tank/inou/anastasiia-restored/`
|
|
- `documents/` - 14 files (consultations, ultrasound protocols, MRI reports)
|
|
- `labs/` - 39 files (blood work, biochemistry)
|
|
|
|
Original filenames recovered by matching file sizes to Signal upload notifications.
|
|
Restore tool: `restore/restore.go`
|
|
|
|
## Vision API
|
|
|
|
Use Gemini Flash initially. Structure for easy provider switching:
|
|
|
|
```go
|
|
type VisionProvider interface {
|
|
ExtractDocument(images [][]byte) (*ProcessedDoc, error)
|
|
}
|
|
```
|
|
|
|
API keys: check how other inou components handle this (likely env vars or config file).
|
|
|
|
## Entries Schema (for context)
|
|
|
|
Processed documents become entries:
|
|
|
|
```
|
|
category: "document"
|
|
type: "radiology_report" (from output)
|
|
value: "MRI Brain Protocol..." (title)
|
|
summary: "MRI protocol describing..." (NEW FIELD - needs migration)
|
|
tags: "mri,brain,fondation lenval"
|
|
timestamp: 1713100800 (document_date as epoch)
|
|
data: {"text": "...", "original_filename": "..."}
|
|
```
|
|
|
|
Note: `summary` field needs to be added to entries table:
|
|
```sql
|
|
ALTER TABLE entries ADD COLUMN summary TEXT;
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
Use `lib.SendErrorForAnalysis()` for failures - it logs details and sends Signal alert.
|
|
|
|
## Summary Design (for LLM triage)
|
|
|
|
The summary field helps LLMs decide whether to fetch full document content. Key insight: describe WHAT information is in the document, not the findings.
|
|
|
|
**Tested prompt** (see `restore/test_summary.go`):
|
|
|
|
```json
|
|
{
|
|
"document_type": "consultation | lab_report | imaging_report | ...",
|
|
"specialty": "neurology | cardiology | ...",
|
|
"date": "YYYY-MM-DD",
|
|
"patient_age_at_doc": "9 months",
|
|
"institution": "hospital name",
|
|
"provider": "doctor name",
|
|
"topics": ["prematurity", "hydrocephalus", "motor development"],
|
|
"has_recommendations": true,
|
|
"has_measurements": false,
|
|
"has_diagnosis": true,
|
|
"summary": "Neurology consultation for premature infant with hydrocephalus..."
|
|
}
|
|
```
|
|
|
|
**Topics** are searchable keywords. **Flags** let LLM skip docs quickly.
|
|
|
|
Example triage:
|
|
- Query "shunt history" → topics include "hydrocephalus" → fetch
|
|
- Query "cardiac issues" → specialty is "neurology" → skip
|
|
|
|
## Don't
|
|
|
|
- Don't integrate with portal/database yet - just output JSON
|
|
- Don't delete files yet - that's for integration phase
|
|
- Don't build CLI flags yet - just hardcode Gemini for now
|