# Document Management System Automated document processing pipeline for scanning, OCR, classification, and indexing. ## Architecture ``` ~/documents/ ├── inbox/ # Drop documents here (SMB share for scanner) ├── store/ # Original files stored by hash ├── records/ # Markdown records by category │ ├── bills/ │ ├── taxes/ │ ├── medical/ │ ├── expenses/ │ └── ... ├── index/ # Search index │ └── master.json └── exports/ # CSV exports └── expenses.csv ``` ## How It Works 1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually) 2. **Daemon processes it** (runs every 60 seconds): - Extracts text (pdftotext or tesseract OCR) - Classifies document type and category - Extracts key fields (date, vendor, amount) - Stores original file by content hash - Creates markdown record - Updates searchable index - Exports expenses to CSV 3. **Search** your documents anytime ## Commands ```bash # Process inbox manually python3 ~/dev/doc-processor/processor.py # Process single file python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf # Watch mode (manual, daemon does this automatically) python3 ~/dev/doc-processor/processor.py --watch --interval 30 # Search documents python3 ~/dev/doc-processor/search.py "duke energy" python3 ~/dev/doc-processor/search.py -c bills # By category python3 ~/dev/doc-processor/search.py -t receipt # By type python3 ~/dev/doc-processor/search.py --stats # Statistics python3 ~/dev/doc-processor/search.py -l # List all python3 ~/dev/doc-processor/search.py -s # Show full record ``` ## Daemon ```bash # Status systemctl --user status doc-processor # Restart systemctl --user restart doc-processor # Logs journalctl --user -u doc-processor -f ``` ## Scanner Setup 1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.) 2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\` 3. Feed paper, press scan 4. Documents auto-process within 60 seconds ## Categories | Category | Documents | |----------|-----------| | taxes | W-2, 1099, tax returns, IRS forms | | bills | Utility bills, invoices | | medical | Medical records, prescriptions | | insurance | Policies, claims | | legal | Contracts, agreements | | financial | Bank statements, investments | | expenses | Receipts, purchases | | vehicles | Registration, maintenance | | home | Mortgage, HOA, property | | personal | General documents | | contacts | Business cards | | uncategorized | Unclassified | ## SMB Share Setup Already configured on james server: ``` [documents] path = /home/johan/documents browsable = yes writable = yes valid users = scanner, johan ``` Scanner user can write to inbox, processed files go to other directories.