doc-processor/README.md

106 lines
2.9 KiB
Markdown

# Document Management System
Automated document processing pipeline for scanning, OCR, classification, and indexing.
## Architecture
```
~/documents/
├── inbox/ # Drop documents here (SMB share for scanner)
├── store/ # Original files stored by hash
├── records/ # Markdown records by category
│ ├── bills/
│ ├── taxes/
│ ├── medical/
│ ├── expenses/
│ └── ...
├── index/ # Search index
│ └── master.json
└── exports/ # CSV exports
└── expenses.csv
```
## How It Works
1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually)
2. **Daemon processes it** (runs every 60 seconds):
- Extracts text (pdftotext or tesseract OCR)
- Classifies document type and category
- Extracts key fields (date, vendor, amount)
- Stores original file by content hash
- Creates markdown record
- Updates searchable index
- Exports expenses to CSV
3. **Search** your documents anytime
## Commands
```bash
# Process inbox manually
python3 ~/dev/doc-processor/processor.py
# Process single file
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
# Watch mode (manual, daemon does this automatically)
python3 ~/dev/doc-processor/processor.py --watch --interval 30
# Search documents
python3 ~/dev/doc-processor/search.py "duke energy"
python3 ~/dev/doc-processor/search.py -c bills # By category
python3 ~/dev/doc-processor/search.py -t receipt # By type
python3 ~/dev/doc-processor/search.py --stats # Statistics
python3 ~/dev/doc-processor/search.py -l # List all
python3 ~/dev/doc-processor/search.py -s <doc_id> # Show full record
```
## Daemon
```bash
# Status
systemctl --user status doc-processor
# Restart
systemctl --user restart doc-processor
# Logs
journalctl --user -u doc-processor -f
```
## Scanner Setup
1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\`
3. Feed paper, press scan
4. Documents auto-process within 60 seconds
## Categories
| Category | Documents |
|----------|-----------|
| taxes | W-2, 1099, tax returns, IRS forms |
| bills | Utility bills, invoices |
| medical | Medical records, prescriptions |
| insurance | Policies, claims |
| legal | Contracts, agreements |
| financial | Bank statements, investments |
| expenses | Receipts, purchases |
| vehicles | Registration, maintenance |
| home | Mortgage, HOA, property |
| personal | General documents |
| contacts | Business cards |
| uncategorized | Unclassified |
## SMB Share Setup
Already configured on james server:
```
[documents]
path = /home/johan/documents
browsable = yes
writable = yes
valid users = scanner, johan
```
Scanner user can write to inbox, processed files go to other directories.