doc-processor/README.md

2.9 KiB

Document Management System

Automated document processing pipeline for scanning, OCR, classification, and indexing.

Architecture

~/documents/
├── inbox/          # Drop documents here (SMB share for scanner)
├── store/          # Original files stored by hash
├── records/        # Markdown records by category
│   ├── bills/
│   ├── taxes/
│   ├── medical/
│   ├── expenses/
│   └── ...
├── index/          # Search index
│   └── master.json
└── exports/        # CSV exports
    └── expenses.csv

How It Works

  1. Drop a document in ~/documents/inbox/ (via SMB, phone scan, or manually)
  2. Daemon processes it (runs every 60 seconds):
    • Extracts text (pdftotext or tesseract OCR)
    • Classifies document type and category
    • Extracts key fields (date, vendor, amount)
    • Stores original file by content hash
    • Creates markdown record
    • Updates searchable index
    • Exports expenses to CSV
  3. Search your documents anytime

Commands

# Process inbox manually
python3 ~/dev/doc-processor/processor.py

# Process single file
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf

# Watch mode (manual, daemon does this automatically)
python3 ~/dev/doc-processor/processor.py --watch --interval 30

# Search documents
python3 ~/dev/doc-processor/search.py "duke energy"
python3 ~/dev/doc-processor/search.py -c bills        # By category
python3 ~/dev/doc-processor/search.py -t receipt      # By type
python3 ~/dev/doc-processor/search.py --stats         # Statistics
python3 ~/dev/doc-processor/search.py -l              # List all
python3 ~/dev/doc-processor/search.py -s <doc_id>     # Show full record

Daemon

# Status
systemctl --user status doc-processor

# Restart
systemctl --user restart doc-processor

# Logs
journalctl --user -u doc-processor -f

Scanner Setup

  1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
  2. Configure scanner to save to SMB share: \\192.168.1.16\documents\inbox\
  3. Feed paper, press scan
  4. Documents auto-process within 60 seconds

Categories

Category Documents
taxes W-2, 1099, tax returns, IRS forms
bills Utility bills, invoices
medical Medical records, prescriptions
insurance Policies, claims
legal Contracts, agreements
financial Bank statements, investments
expenses Receipts, purchases
vehicles Registration, maintenance
home Mortgage, HOA, property
personal General documents
contacts Business cards
uncategorized Unclassified

SMB Share Setup

Already configured on james server:

[documents]
   path = /home/johan/documents
   browsable = yes
   writable = yes
   valid users = scanner, johan

Scanner user can write to inbox, processed files go to other directories.