106 lines
2.9 KiB
Markdown
106 lines
2.9 KiB
Markdown
# Document Management System
|
|
|
|
Automated document processing pipeline for scanning, OCR, classification, and indexing.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
~/documents/
|
|
├── inbox/ # Drop documents here (SMB share for scanner)
|
|
├── store/ # Original files stored by hash
|
|
├── records/ # Markdown records by category
|
|
│ ├── bills/
|
|
│ ├── taxes/
|
|
│ ├── medical/
|
|
│ ├── expenses/
|
|
│ └── ...
|
|
├── index/ # Search index
|
|
│ └── master.json
|
|
└── exports/ # CSV exports
|
|
└── expenses.csv
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually)
|
|
2. **Daemon processes it** (runs every 60 seconds):
|
|
- Extracts text (pdftotext or tesseract OCR)
|
|
- Classifies document type and category
|
|
- Extracts key fields (date, vendor, amount)
|
|
- Stores original file by content hash
|
|
- Creates markdown record
|
|
- Updates searchable index
|
|
- Exports expenses to CSV
|
|
3. **Search** your documents anytime
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
# Process inbox manually
|
|
python3 ~/dev/doc-processor/processor.py
|
|
|
|
# Process single file
|
|
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
|
|
|
|
# Watch mode (manual, daemon does this automatically)
|
|
python3 ~/dev/doc-processor/processor.py --watch --interval 30
|
|
|
|
# Search documents
|
|
python3 ~/dev/doc-processor/search.py "duke energy"
|
|
python3 ~/dev/doc-processor/search.py -c bills # By category
|
|
python3 ~/dev/doc-processor/search.py -t receipt # By type
|
|
python3 ~/dev/doc-processor/search.py --stats # Statistics
|
|
python3 ~/dev/doc-processor/search.py -l # List all
|
|
python3 ~/dev/doc-processor/search.py -s <doc_id> # Show full record
|
|
```
|
|
|
|
## Daemon
|
|
|
|
```bash
|
|
# Status
|
|
systemctl --user status doc-processor
|
|
|
|
# Restart
|
|
systemctl --user restart doc-processor
|
|
|
|
# Logs
|
|
journalctl --user -u doc-processor -f
|
|
```
|
|
|
|
## Scanner Setup
|
|
|
|
1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
|
|
2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\`
|
|
3. Feed paper, press scan
|
|
4. Documents auto-process within 60 seconds
|
|
|
|
## Categories
|
|
|
|
| Category | Documents |
|
|
|----------|-----------|
|
|
| taxes | W-2, 1099, tax returns, IRS forms |
|
|
| bills | Utility bills, invoices |
|
|
| medical | Medical records, prescriptions |
|
|
| insurance | Policies, claims |
|
|
| legal | Contracts, agreements |
|
|
| financial | Bank statements, investments |
|
|
| expenses | Receipts, purchases |
|
|
| vehicles | Registration, maintenance |
|
|
| home | Mortgage, HOA, property |
|
|
| personal | General documents |
|
|
| contacts | Business cards |
|
|
| uncategorized | Unclassified |
|
|
|
|
## SMB Share Setup
|
|
|
|
Already configured on james server:
|
|
```
|
|
[documents]
|
|
path = /home/johan/documents
|
|
browsable = yes
|
|
writable = yes
|
|
valid users = scanner, johan
|
|
```
|
|
|
|
Scanner user can write to inbox, processed files go to other directories.
|