2.9 KiB
2.9 KiB
Document Management System
Automated document processing pipeline for scanning, OCR, classification, and indexing.
Architecture
~/documents/
├── inbox/ # Drop documents here (SMB share for scanner)
├── store/ # Original files stored by hash
├── records/ # Markdown records by category
│ ├── bills/
│ ├── taxes/
│ ├── medical/
│ ├── expenses/
│ └── ...
├── index/ # Search index
│ └── master.json
└── exports/ # CSV exports
└── expenses.csv
How It Works
- Drop a document in
~/documents/inbox/(via SMB, phone scan, or manually) - Daemon processes it (runs every 60 seconds):
- Extracts text (pdftotext or tesseract OCR)
- Classifies document type and category
- Extracts key fields (date, vendor, amount)
- Stores original file by content hash
- Creates markdown record
- Updates searchable index
- Exports expenses to CSV
- Search your documents anytime
Commands
# Process inbox manually
python3 ~/dev/doc-processor/processor.py
# Process single file
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
# Watch mode (manual, daemon does this automatically)
python3 ~/dev/doc-processor/processor.py --watch --interval 30
# Search documents
python3 ~/dev/doc-processor/search.py "duke energy"
python3 ~/dev/doc-processor/search.py -c bills # By category
python3 ~/dev/doc-processor/search.py -t receipt # By type
python3 ~/dev/doc-processor/search.py --stats # Statistics
python3 ~/dev/doc-processor/search.py -l # List all
python3 ~/dev/doc-processor/search.py -s <doc_id> # Show full record
Daemon
# Status
systemctl --user status doc-processor
# Restart
systemctl --user restart doc-processor
# Logs
journalctl --user -u doc-processor -f
Scanner Setup
- Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
- Configure scanner to save to SMB share:
\\192.168.1.16\documents\inbox\ - Feed paper, press scan
- Documents auto-process within 60 seconds
Categories
| Category | Documents |
|---|---|
| taxes | W-2, 1099, tax returns, IRS forms |
| bills | Utility bills, invoices |
| medical | Medical records, prescriptions |
| insurance | Policies, claims |
| legal | Contracts, agreements |
| financial | Bank statements, investments |
| expenses | Receipts, purchases |
| vehicles | Registration, maintenance |
| home | Mortgage, HOA, property |
| personal | General documents |
| contacts | Business cards |
| uncategorized | Unclassified |
SMB Share Setup
Already configured on james server:
[documents]
path = /home/johan/documents
browsable = yes
writable = yes
valid users = scanner, johan
Scanner user can write to inbox, processed files go to other directories.