Replace OCR with AI vision, SQLite for storage
- Remove Tesseract/OCR dependencies - Use Claude vision API for document analysis - Single AI pass: extract text + classify + summarize - SQLite database for documents and embeddings - Embeddings storage ready (generation placeholder) - Full-text search via SQLite - Updated systemd service to use venv - Support .env file for API key
This commit is contained in:
parent
9dac36681c
commit
fb3d5a46b5
|
|
@ -0,0 +1,4 @@
|
||||||
|
venv/
|
||||||
|
.env
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
188
README.md
188
README.md
|
|
@ -1,105 +1,119 @@
|
||||||
# Document Management System
|
# Document Processor
|
||||||
|
|
||||||
Automated document processing pipeline for scanning, OCR, classification, and indexing.
|
AI-powered document management system using Claude vision for extraction and SQLite for storage/search.
|
||||||
|
|
||||||
## Architecture
|
## Features
|
||||||
|
|
||||||
|
- **AI Vision Analysis**: Uses Claude to read documents, extract text, classify, and summarize
|
||||||
|
- **No OCR dependencies**: Just drop files in inbox, AI handles the rest
|
||||||
|
- **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
|
||||||
|
- **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
|
||||||
|
- **Expense Tracking**: Auto-exports bills/receipts to CSV
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/dev/doc-processor
|
||||||
|
|
||||||
|
# Create/activate venv
|
||||||
|
python3 -m venv venv
|
||||||
|
source venv/bin/activate
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
pip install anthropic
|
||||||
|
|
||||||
|
# Configure API key (one of these methods):
|
||||||
|
# Option 1: Environment variable
|
||||||
|
export ANTHROPIC_API_KEY=sk-ant-...
|
||||||
|
|
||||||
|
# Option 2: .env file
|
||||||
|
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Activate venv first
|
||||||
|
source ~/dev/doc-processor/venv/bin/activate
|
||||||
|
|
||||||
|
# Process all documents in inbox
|
||||||
|
python processor.py
|
||||||
|
|
||||||
|
# Watch inbox continuously
|
||||||
|
python processor.py --watch
|
||||||
|
|
||||||
|
# Process single file
|
||||||
|
python processor.py --file /path/to/document.pdf
|
||||||
|
|
||||||
|
# Search documents
|
||||||
|
python search.py "query"
|
||||||
|
python search.py -c medical # By category
|
||||||
|
python search.py -t receipt # By type
|
||||||
|
python search.py -s abc123 # Show full document
|
||||||
|
python search.py --stats # Statistics
|
||||||
|
python search.py -l # List all
|
||||||
|
```
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
~/documents/
|
~/documents/
|
||||||
├── inbox/ # Drop documents here (SMB share for scanner)
|
├── inbox/ # Drop files here (SMB share for scanner)
|
||||||
├── store/ # Original files stored by hash
|
├── store/ # Original files (hash-named)
|
||||||
├── records/ # Markdown records by category
|
├── records/ # Markdown records by category
|
||||||
│ ├── bills/
|
|
||||||
│ ├── taxes/
|
│ ├── taxes/
|
||||||
|
│ ├── bills/
|
||||||
│ ├── medical/
|
│ ├── medical/
|
||||||
│ ├── expenses/
|
|
||||||
│ └── ...
|
│ └── ...
|
||||||
├── index/ # Search index
|
├── index/
|
||||||
│ └── master.json
|
│ ├── master.json # JSON index
|
||||||
└── exports/ # CSV exports
|
│ └── embeddings.db # SQLite (documents + embeddings)
|
||||||
└── expenses.csv
|
└── exports/
|
||||||
|
└── expenses.csv # Auto-exported expenses
|
||||||
```
|
```
|
||||||
|
|
||||||
## How It Works
|
## Supported Formats
|
||||||
|
|
||||||
1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually)
|
- PDF (converted to image for vision)
|
||||||
2. **Daemon processes it** (runs every 60 seconds):
|
- Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP
|
||||||
- Extracts text (pdftotext or tesseract OCR)
|
|
||||||
- Classifies document type and category
|
|
||||||
- Extracts key fields (date, vendor, amount)
|
|
||||||
- Stores original file by content hash
|
|
||||||
- Creates markdown record
|
|
||||||
- Updates searchable index
|
|
||||||
- Exports expenses to CSV
|
|
||||||
3. **Search** your documents anytime
|
|
||||||
|
|
||||||
## Commands
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Process inbox manually
|
|
||||||
python3 ~/dev/doc-processor/processor.py
|
|
||||||
|
|
||||||
# Process single file
|
|
||||||
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
|
|
||||||
|
|
||||||
# Watch mode (manual, daemon does this automatically)
|
|
||||||
python3 ~/dev/doc-processor/processor.py --watch --interval 30
|
|
||||||
|
|
||||||
# Search documents
|
|
||||||
python3 ~/dev/doc-processor/search.py "duke energy"
|
|
||||||
python3 ~/dev/doc-processor/search.py -c bills # By category
|
|
||||||
python3 ~/dev/doc-processor/search.py -t receipt # By type
|
|
||||||
python3 ~/dev/doc-processor/search.py --stats # Statistics
|
|
||||||
python3 ~/dev/doc-processor/search.py -l # List all
|
|
||||||
python3 ~/dev/doc-processor/search.py -s <doc_id> # Show full record
|
|
||||||
```
|
|
||||||
|
|
||||||
## Daemon
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Status
|
|
||||||
systemctl --user status doc-processor
|
|
||||||
|
|
||||||
# Restart
|
|
||||||
systemctl --user restart doc-processor
|
|
||||||
|
|
||||||
# Logs
|
|
||||||
journalctl --user -u doc-processor -f
|
|
||||||
```
|
|
||||||
|
|
||||||
## Scanner Setup
|
|
||||||
|
|
||||||
1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
|
|
||||||
2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\`
|
|
||||||
3. Feed paper, press scan
|
|
||||||
4. Documents auto-process within 60 seconds
|
|
||||||
|
|
||||||
## Categories
|
## Categories
|
||||||
|
|
||||||
| Category | Documents |
|
- taxes, bills, medical, insurance, legal
|
||||||
|----------|-----------|
|
- financial, expenses, vehicles, home
|
||||||
| taxes | W-2, 1099, tax returns, IRS forms |
|
- personal, contacts, uncategorized
|
||||||
| bills | Utility bills, invoices |
|
|
||||||
| medical | Medical records, prescriptions |
|
|
||||||
| insurance | Policies, claims |
|
|
||||||
| legal | Contracts, agreements |
|
|
||||||
| financial | Bank statements, investments |
|
|
||||||
| expenses | Receipts, purchases |
|
|
||||||
| vehicles | Registration, maintenance |
|
|
||||||
| home | Mortgage, HOA, property |
|
|
||||||
| personal | General documents |
|
|
||||||
| contacts | Business cards |
|
|
||||||
| uncategorized | Unclassified |
|
|
||||||
|
|
||||||
## SMB Share Setup
|
## Systemd Service
|
||||||
|
|
||||||
Already configured on james server:
|
```bash
|
||||||
```
|
# Install service
|
||||||
[documents]
|
systemctl --user daemon-reload
|
||||||
path = /home/johan/documents
|
systemctl --user enable doc-processor
|
||||||
browsable = yes
|
systemctl --user start doc-processor
|
||||||
writable = yes
|
|
||||||
valid users = scanner, johan
|
# Check status
|
||||||
|
systemctl --user status doc-processor
|
||||||
|
journalctl --user -u doc-processor -f
|
||||||
```
|
```
|
||||||
|
|
||||||
Scanner user can write to inbox, processed files go to other directories.
|
## Requirements
|
||||||
|
|
||||||
|
- Python 3.10+
|
||||||
|
- `anthropic` Python package
|
||||||
|
- `pdftoppm` (poppler-utils) for PDF conversion
|
||||||
|
- Anthropic API key
|
||||||
|
|
||||||
|
## API Key
|
||||||
|
|
||||||
|
The processor looks for the API key in this order:
|
||||||
|
1. `ANTHROPIC_API_KEY` environment variable
|
||||||
|
2. `~/dev/doc-processor/.env` file
|
||||||
|
|
||||||
|
## Embeddings
|
||||||
|
|
||||||
|
The embedding storage is ready but the generation is a placeholder. Options:
|
||||||
|
- OpenAI text-embedding-3-small (cheap, good)
|
||||||
|
- Voyage AI (optimized for documents)
|
||||||
|
- Local sentence-transformers
|
||||||
|
|
||||||
|
Currently uses SQLite full-text search which works well for most use cases.
|
||||||
|
|
|
||||||
501
processor.py
501
processor.py
|
|
@ -1,22 +1,31 @@
|
||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Document Processor for ~/documents/inbox/
|
Document Processor for ~/documents/inbox/
|
||||||
Watches for new documents, OCRs them, classifies, and files them.
|
Uses AI vision (Claude) for document analysis. Stores embeddings in SQLite.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
import json
|
import json
|
||||||
import hashlib
|
import hashlib
|
||||||
import subprocess
|
|
||||||
import shutil
|
import shutil
|
||||||
import sqlite3
|
import sqlite3
|
||||||
import csv
|
import csv
|
||||||
|
import base64
|
||||||
|
import struct
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Optional, Dict, Any
|
from typing import Optional, Dict, Any, List
|
||||||
import re
|
|
||||||
import time
|
import time
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
# Try to import anthropic, fail gracefully with helpful message
|
||||||
|
try:
|
||||||
|
import anthropic
|
||||||
|
except ImportError:
|
||||||
|
print("ERROR: anthropic package not installed")
|
||||||
|
print("Run: cd ~/dev/doc-processor && source venv/bin/activate && pip install anthropic")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
# Paths
|
# Paths
|
||||||
DOCUMENTS_ROOT = Path.home() / "documents"
|
DOCUMENTS_ROOT = Path.home() / "documents"
|
||||||
|
|
@ -25,6 +34,7 @@ STORE = DOCUMENTS_ROOT / "store"
|
||||||
RECORDS = DOCUMENTS_ROOT / "records"
|
RECORDS = DOCUMENTS_ROOT / "records"
|
||||||
INDEX = DOCUMENTS_ROOT / "index"
|
INDEX = DOCUMENTS_ROOT / "index"
|
||||||
EXPORTS = DOCUMENTS_ROOT / "exports"
|
EXPORTS = DOCUMENTS_ROOT / "exports"
|
||||||
|
EMBEDDINGS_DB = INDEX / "embeddings.db"
|
||||||
|
|
||||||
# Categories
|
# Categories
|
||||||
CATEGORIES = [
|
CATEGORIES = [
|
||||||
|
|
@ -40,149 +50,272 @@ for cat in CATEGORIES:
|
||||||
(RECORDS / cat).mkdir(parents=True, exist_ok=True)
|
(RECORDS / cat).mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
|
||||||
|
def get_anthropic_client() -> anthropic.Anthropic:
|
||||||
|
"""Get Anthropic client, checking for API key."""
|
||||||
|
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||||
|
if not api_key:
|
||||||
|
# Try reading from config file
|
||||||
|
config_path = Path.home() / "dev/doc-processor/.env"
|
||||||
|
if config_path.exists():
|
||||||
|
for line in config_path.read_text().splitlines():
|
||||||
|
if line.startswith("ANTHROPIC_API_KEY="):
|
||||||
|
api_key = line.split("=", 1)[1].strip().strip('"\'')
|
||||||
|
break
|
||||||
|
|
||||||
|
if not api_key:
|
||||||
|
raise RuntimeError(
|
||||||
|
"ANTHROPIC_API_KEY not set. Either:\n"
|
||||||
|
" 1. Set ANTHROPIC_API_KEY environment variable\n"
|
||||||
|
" 2. Create ~/dev/doc-processor/.env with ANTHROPIC_API_KEY=sk-ant-..."
|
||||||
|
)
|
||||||
|
|
||||||
|
return anthropic.Anthropic(api_key=api_key)
|
||||||
|
|
||||||
|
|
||||||
|
def init_embeddings_db():
|
||||||
|
"""Initialize SQLite database for embeddings."""
|
||||||
|
conn = sqlite3.connect(EMBEDDINGS_DB)
|
||||||
|
conn.execute("""
|
||||||
|
CREATE TABLE IF NOT EXISTS embeddings (
|
||||||
|
doc_id TEXT PRIMARY KEY,
|
||||||
|
embedding BLOB,
|
||||||
|
text_hash TEXT,
|
||||||
|
created_at TEXT
|
||||||
|
)
|
||||||
|
""")
|
||||||
|
conn.execute("""
|
||||||
|
CREATE TABLE IF NOT EXISTS documents (
|
||||||
|
doc_id TEXT PRIMARY KEY,
|
||||||
|
filename TEXT,
|
||||||
|
category TEXT,
|
||||||
|
doc_type TEXT,
|
||||||
|
date TEXT,
|
||||||
|
vendor TEXT,
|
||||||
|
amount TEXT,
|
||||||
|
summary TEXT,
|
||||||
|
full_text TEXT,
|
||||||
|
processed_at TEXT
|
||||||
|
)
|
||||||
|
""")
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
def file_hash(filepath: Path) -> str:
|
def file_hash(filepath: Path) -> str:
|
||||||
"""SHA256 hash of file contents."""
|
"""SHA256 hash of file contents."""
|
||||||
h = hashlib.sha256()
|
h = hashlib.sha256()
|
||||||
with open(filepath, 'rb') as f:
|
with open(filepath, 'rb') as f:
|
||||||
for chunk in iter(lambda: f.read(8192), b''):
|
for chunk in iter(lambda: f.read(8192), b''):
|
||||||
h.update(chunk)
|
h.update(chunk)
|
||||||
return h.hexdigest()[:16] # Short hash for filename
|
return h.hexdigest()[:16]
|
||||||
|
|
||||||
|
|
||||||
def extract_text_pdf(filepath: Path) -> str:
|
def encode_image_base64(filepath: Path) -> tuple[str, str]:
|
||||||
"""Extract text from PDF using pdftotext."""
|
"""Encode image/PDF to base64 for API. Returns (base64_data, media_type)."""
|
||||||
try:
|
|
||||||
result = subprocess.run(
|
|
||||||
['pdftotext', '-layout', str(filepath), '-'],
|
|
||||||
capture_output=True, text=True, timeout=30
|
|
||||||
)
|
|
||||||
text = result.stdout.strip()
|
|
||||||
if len(text) > 50: # Got meaningful text
|
|
||||||
return text
|
|
||||||
except Exception as e:
|
|
||||||
print(f"pdftotext failed: {e}")
|
|
||||||
|
|
||||||
# Fallback to OCR
|
|
||||||
return ocr_document(filepath)
|
|
||||||
|
|
||||||
|
|
||||||
def ocr_document(filepath: Path) -> str:
|
|
||||||
"""OCR a document using tesseract."""
|
|
||||||
try:
|
|
||||||
# For PDFs, convert to images first
|
|
||||||
if filepath.suffix.lower() == '.pdf':
|
|
||||||
# Use pdftoppm to convert to images, then OCR
|
|
||||||
result = subprocess.run(
|
|
||||||
['pdftoppm', '-png', '-r', '300', str(filepath), '/tmp/doc_page'],
|
|
||||||
capture_output=True, timeout=60
|
|
||||||
)
|
|
||||||
# OCR all pages
|
|
||||||
text_parts = []
|
|
||||||
for img in sorted(Path('/tmp').glob('doc_page-*.png')):
|
|
||||||
result = subprocess.run(
|
|
||||||
['tesseract', str(img), 'stdout'],
|
|
||||||
capture_output=True, text=True, timeout=60
|
|
||||||
)
|
|
||||||
text_parts.append(result.stdout)
|
|
||||||
img.unlink() # Clean up
|
|
||||||
return '\n'.join(text_parts).strip()
|
|
||||||
else:
|
|
||||||
# Direct image OCR
|
|
||||||
result = subprocess.run(
|
|
||||||
['tesseract', str(filepath), 'stdout'],
|
|
||||||
capture_output=True, text=True, timeout=60
|
|
||||||
)
|
|
||||||
return result.stdout.strip()
|
|
||||||
except Exception as e:
|
|
||||||
print(f"OCR failed: {e}")
|
|
||||||
return ""
|
|
||||||
|
|
||||||
|
|
||||||
def extract_text(filepath: Path) -> str:
|
|
||||||
"""Extract text from document based on type."""
|
|
||||||
suffix = filepath.suffix.lower()
|
suffix = filepath.suffix.lower()
|
||||||
|
|
||||||
if suffix == '.pdf':
|
if suffix == '.pdf':
|
||||||
return extract_text_pdf(filepath)
|
# For PDFs, convert first page to PNG using pdftoppm
|
||||||
elif suffix in ['.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp']:
|
import subprocess
|
||||||
return ocr_document(filepath)
|
result = subprocess.run(
|
||||||
elif suffix in ['.txt', '.md']:
|
['pdftoppm', '-png', '-f', '1', '-l', '1', '-r', '150', str(filepath), '-'],
|
||||||
return filepath.read_text()
|
capture_output=True, timeout=30
|
||||||
else:
|
)
|
||||||
return ""
|
if result.returncode == 0:
|
||||||
|
return base64.standard_b64encode(result.stdout).decode('utf-8'), 'image/png'
|
||||||
|
else:
|
||||||
|
raise RuntimeError(f"Failed to convert PDF: {result.stderr.decode()}")
|
||||||
|
|
||||||
|
# Image files
|
||||||
def classify_document(text: str, filename: str) -> Dict[str, Any]:
|
media_types = {
|
||||||
"""
|
'.png': 'image/png',
|
||||||
Classify document based on content.
|
'.jpg': 'image/jpeg',
|
||||||
Returns: {category, doc_type, date, vendor, amount, summary}
|
'.jpeg': 'image/jpeg',
|
||||||
"""
|
'.gif': 'image/gif',
|
||||||
text_lower = text.lower()
|
'.webp': 'image/webp',
|
||||||
result = {
|
|
||||||
"category": "uncategorized",
|
|
||||||
"doc_type": "unknown",
|
|
||||||
"date": None,
|
|
||||||
"vendor": None,
|
|
||||||
"amount": None,
|
|
||||||
"summary": None,
|
|
||||||
}
|
}
|
||||||
|
media_type = media_types.get(suffix, 'image/png')
|
||||||
|
|
||||||
# Date extraction (various formats)
|
with open(filepath, 'rb') as f:
|
||||||
date_patterns = [
|
return base64.standard_b64encode(f.read()).decode('utf-8'), media_type
|
||||||
r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
|
|
||||||
r'(\d{4}[/-]\d{1,2}[/-]\d{1,2})',
|
|
||||||
r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]* \d{1,2},? \d{4})',
|
|
||||||
]
|
|
||||||
for pattern in date_patterns:
|
|
||||||
match = re.search(pattern, text_lower)
|
|
||||||
if match:
|
|
||||||
result["date"] = match.group(1)
|
|
||||||
break
|
|
||||||
|
|
||||||
# Amount extraction
|
|
||||||
amount_match = re.search(r'\$[\d,]+\.?\d*', text)
|
|
||||||
if amount_match:
|
|
||||||
result["amount"] = amount_match.group(0)
|
|
||||||
|
|
||||||
# Classification rules
|
def analyze_document_with_ai(filepath: Path, client: anthropic.Anthropic) -> Dict[str, Any]:
|
||||||
if any(x in text_lower for x in ['w-2', 'w2', '1099', 'tax return', 'irs', '1040', 'schedule c', 'form 1098']):
|
"""
|
||||||
result["category"] = "taxes"
|
Use Claude vision to analyze document.
|
||||||
result["doc_type"] = "tax_form"
|
Returns: {category, doc_type, date, vendor, amount, summary, full_text}
|
||||||
elif any(x in text_lower for x in ['invoice', 'bill', 'amount due', 'payment due', 'account number', 'autopay']):
|
"""
|
||||||
result["category"] = "bills"
|
print(f" Analyzing with AI...")
|
||||||
result["doc_type"] = "bill"
|
|
||||||
# Try to extract vendor
|
|
||||||
vendors = ['duke energy', 'fpl', 'florida power', 'spectrum', 'at&t', 'verizon', 't-mobile', 'comcast', 'xfinity']
|
|
||||||
for v in vendors:
|
|
||||||
if v in text_lower:
|
|
||||||
result["vendor"] = v.title()
|
|
||||||
break
|
|
||||||
elif any(x in text_lower for x in ['patient', 'diagnosis', 'prescription', 'medical', 'physician', 'hospital', 'clinic', 'dr.', 'md']):
|
|
||||||
result["category"] = "medical"
|
|
||||||
result["doc_type"] = "medical_record"
|
|
||||||
elif any(x in text_lower for x in ['policy', 'coverage', 'premium', 'deductible', 'insurance', 'claim']):
|
|
||||||
result["category"] = "insurance"
|
|
||||||
result["doc_type"] = "insurance_doc"
|
|
||||||
elif any(x in text_lower for x in ['agreement', 'contract', 'terms', 'hereby', 'whereas', 'attorney', 'legal']):
|
|
||||||
result["category"] = "legal"
|
|
||||||
result["doc_type"] = "legal_doc"
|
|
||||||
elif any(x in text_lower for x in ['bank', 'statement', 'account', 'balance', 'deposit', 'withdrawal', 'investment', 'portfolio']):
|
|
||||||
result["category"] = "financial"
|
|
||||||
result["doc_type"] = "financial_statement"
|
|
||||||
elif any(x in text_lower for x in ['receipt', 'purchase', 'order', 'subtotal', 'total', 'qty', 'item']):
|
|
||||||
result["category"] = "expenses"
|
|
||||||
result["doc_type"] = "receipt"
|
|
||||||
elif any(x in text_lower for x in ['vin', 'vehicle', 'registration', 'dmv', 'license plate', 'odometer']):
|
|
||||||
result["category"] = "vehicles"
|
|
||||||
result["doc_type"] = "vehicle_doc"
|
|
||||||
elif any(x in text_lower for x in ['mortgage', 'deed', 'property', 'hoa', 'homeowner']):
|
|
||||||
result["category"] = "home"
|
|
||||||
result["doc_type"] = "property_doc"
|
|
||||||
|
|
||||||
# Generate summary (first 200 chars, cleaned)
|
try:
|
||||||
clean_text = ' '.join(text.split())[:200]
|
image_data, media_type = encode_image_base64(filepath)
|
||||||
result["summary"] = clean_text
|
except Exception as e:
|
||||||
|
print(f" Failed to encode document: {e}")
|
||||||
|
return {
|
||||||
|
"category": "uncategorized",
|
||||||
|
"doc_type": "unknown",
|
||||||
|
"full_text": f"(Failed to process: {e})",
|
||||||
|
"summary": "Document could not be processed"
|
||||||
|
}
|
||||||
|
|
||||||
return result
|
prompt = """Analyze this document image and extract:
|
||||||
|
|
||||||
|
1. **Full Text**: Transcribe ALL visible text from the document, preserving structure where possible.
|
||||||
|
|
||||||
|
2. **Classification**: Categorize into exactly ONE of:
|
||||||
|
- taxes (W-2, 1099, tax returns, IRS forms)
|
||||||
|
- bills (utilities, subscriptions, invoices)
|
||||||
|
- medical (health records, prescriptions, lab results)
|
||||||
|
- insurance (policies, claims, coverage docs)
|
||||||
|
- legal (contracts, agreements, legal notices)
|
||||||
|
- financial (bank statements, investment docs)
|
||||||
|
- expenses (receipts, purchase confirmations)
|
||||||
|
- vehicles (registration, maintenance, DMV)
|
||||||
|
- home (mortgage, HOA, property docs)
|
||||||
|
- personal (ID copies, certificates, misc)
|
||||||
|
- contacts (business cards, contact info)
|
||||||
|
- uncategorized (if none fit)
|
||||||
|
|
||||||
|
3. **Document Type**: Specific type (e.g., "utility_bill", "receipt", "tax_form_w2", "insurance_policy")
|
||||||
|
|
||||||
|
4. **Key Fields**:
|
||||||
|
- date: Document date (YYYY-MM-DD format if possible)
|
||||||
|
- vendor: Company/organization name
|
||||||
|
- amount: Dollar amount if present (e.g., "$123.45")
|
||||||
|
|
||||||
|
5. **Summary**: 1-2 sentence description of what this document is.
|
||||||
|
|
||||||
|
Respond in JSON format:
|
||||||
|
{
|
||||||
|
"category": "...",
|
||||||
|
"doc_type": "...",
|
||||||
|
"date": "...",
|
||||||
|
"vendor": "...",
|
||||||
|
"amount": "...",
|
||||||
|
"summary": "...",
|
||||||
|
"full_text": "..."
|
||||||
|
}"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = client.messages.create(
|
||||||
|
model="claude-sonnet-4-20250514",
|
||||||
|
max_tokens=4096,
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{
|
||||||
|
"type": "image",
|
||||||
|
"source": {
|
||||||
|
"type": "base64",
|
||||||
|
"media_type": media_type,
|
||||||
|
"data": image_data,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "text",
|
||||||
|
"text": prompt
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Parse JSON from response
|
||||||
|
text = response.content[0].text
|
||||||
|
|
||||||
|
# Try to extract JSON from response (handle markdown code blocks)
|
||||||
|
if "```json" in text:
|
||||||
|
text = text.split("```json")[1].split("```")[0]
|
||||||
|
elif "```" in text:
|
||||||
|
text = text.split("```")[1].split("```")[0]
|
||||||
|
|
||||||
|
result = json.loads(text.strip())
|
||||||
|
|
||||||
|
# Validate category
|
||||||
|
if result.get("category") not in CATEGORIES:
|
||||||
|
result["category"] = "uncategorized"
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
print(f" Failed to parse AI response as JSON: {e}")
|
||||||
|
print(f" Raw response: {text[:500]}")
|
||||||
|
return {
|
||||||
|
"category": "uncategorized",
|
||||||
|
"doc_type": "unknown",
|
||||||
|
"full_text": text,
|
||||||
|
"summary": "AI response could not be parsed"
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
print(f" AI analysis failed: {e}")
|
||||||
|
return {
|
||||||
|
"category": "uncategorized",
|
||||||
|
"doc_type": "unknown",
|
||||||
|
"full_text": f"(AI analysis failed: {e})",
|
||||||
|
"summary": "Document analysis failed"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def generate_embedding(text: str, client: anthropic.Anthropic) -> Optional[List[float]]:
|
||||||
|
"""
|
||||||
|
Generate text embedding using Anthropic's embedding endpoint.
|
||||||
|
Note: As of 2024, Anthropic doesn't have a public embedding API.
|
||||||
|
This is a placeholder - implement with OpenAI, Voyage, or local model.
|
||||||
|
|
||||||
|
For now, returns None and we'll use full-text search in SQLite.
|
||||||
|
"""
|
||||||
|
# TODO: Implement with preferred embedding provider
|
||||||
|
# Options:
|
||||||
|
# 1. OpenAI text-embedding-3-small (cheap, good quality)
|
||||||
|
# 2. Voyage AI (good for documents)
|
||||||
|
# 3. Local sentence-transformers
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def store_embedding(doc_id: str, embedding: Optional[List[float]], text: str):
|
||||||
|
"""Store embedding in SQLite database."""
|
||||||
|
if embedding is None:
|
||||||
|
return
|
||||||
|
|
||||||
|
conn = sqlite3.connect(EMBEDDINGS_DB)
|
||||||
|
|
||||||
|
# Pack floats as binary blob
|
||||||
|
embedding_blob = struct.pack(f'{len(embedding)}f', *embedding)
|
||||||
|
text_hash = hashlib.sha256(text.encode()).hexdigest()[:16]
|
||||||
|
|
||||||
|
conn.execute("""
|
||||||
|
INSERT OR REPLACE INTO embeddings (doc_id, embedding, text_hash, created_at)
|
||||||
|
VALUES (?, ?, ?, ?)
|
||||||
|
""", (doc_id, embedding_blob, text_hash, datetime.now().isoformat()))
|
||||||
|
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def store_document_metadata(doc_id: str, filename: str, classification: Dict, full_text: str):
|
||||||
|
"""Store document metadata in SQLite for full-text search."""
|
||||||
|
conn = sqlite3.connect(EMBEDDINGS_DB)
|
||||||
|
|
||||||
|
conn.execute("""
|
||||||
|
INSERT OR REPLACE INTO documents
|
||||||
|
(doc_id, filename, category, doc_type, date, vendor, amount, summary, full_text, processed_at)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""", (
|
||||||
|
doc_id,
|
||||||
|
filename,
|
||||||
|
classification.get("category", "uncategorized"),
|
||||||
|
classification.get("doc_type", "unknown"),
|
||||||
|
classification.get("date"),
|
||||||
|
classification.get("vendor"),
|
||||||
|
classification.get("amount"),
|
||||||
|
classification.get("summary"),
|
||||||
|
full_text[:50000], # Limit text size
|
||||||
|
datetime.now().isoformat()
|
||||||
|
))
|
||||||
|
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
def store_document(filepath: Path, hash_id: str) -> Path:
|
def store_document(filepath: Path, hash_id: str) -> Path:
|
||||||
|
|
@ -194,14 +327,16 @@ def store_document(filepath: Path, hash_id: str) -> Path:
|
||||||
return store_path
|
return store_path
|
||||||
|
|
||||||
|
|
||||||
def create_record(filepath: Path, hash_id: str, text: str, classification: Dict) -> Path:
|
def create_record(filepath: Path, hash_id: str, classification: Dict) -> Path:
|
||||||
"""Create markdown record in appropriate category folder."""
|
"""Create markdown record in appropriate category folder."""
|
||||||
cat = classification["category"]
|
cat = classification.get("category", "uncategorized")
|
||||||
now = datetime.now()
|
now = datetime.now()
|
||||||
|
|
||||||
record_name = f"{now.strftime('%Y%m%d')}_{hash_id}.md"
|
record_name = f"{now.strftime('%Y%m%d')}_{hash_id}.md"
|
||||||
record_path = RECORDS / cat / record_name
|
record_path = RECORDS / cat / record_name
|
||||||
|
|
||||||
|
full_text = classification.get("full_text", "")
|
||||||
|
|
||||||
content = f"""# Document Record
|
content = f"""# Document Record
|
||||||
|
|
||||||
**ID:** {hash_id}
|
**ID:** {hash_id}
|
||||||
|
|
@ -225,12 +360,12 @@ def create_record(filepath: Path, hash_id: str, text: str, classification: Dict)
|
||||||
## Full Text
|
## Full Text
|
||||||
|
|
||||||
```
|
```
|
||||||
{text[:5000]}
|
{full_text[:10000]}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Files
|
## Files
|
||||||
|
|
||||||
- **PDF:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
|
- **Original:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
|
||||||
"""
|
"""
|
||||||
|
|
||||||
record_path.write_text(content)
|
record_path.write_text(content)
|
||||||
|
|
@ -245,15 +380,22 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
|
||||||
with open(index_path) as f:
|
with open(index_path) as f:
|
||||||
data = json.load(f)
|
data = json.load(f)
|
||||||
else:
|
else:
|
||||||
data = {"version": "1.0", "created": datetime.now().strftime("%Y-%m-%d"), "documents": [], "stats": {"total": 0, "by_type": {}, "by_year": {}}}
|
data = {
|
||||||
|
"version": "2.0",
|
||||||
|
"created": datetime.now().strftime("%Y-%m-%d"),
|
||||||
|
"documents": [],
|
||||||
|
"stats": {"total": 0, "by_type": {}, "by_category": {}}
|
||||||
|
}
|
||||||
|
|
||||||
doc_entry = {
|
doc_entry = {
|
||||||
"id": hash_id,
|
"id": hash_id,
|
||||||
"filename": filepath.name,
|
"filename": filepath.name,
|
||||||
"category": classification["category"],
|
"category": classification.get("category", "uncategorized"),
|
||||||
"type": classification.get("doc_type", "unknown"),
|
"type": classification.get("doc_type", "unknown"),
|
||||||
"date": classification.get("date"),
|
"date": classification.get("date"),
|
||||||
|
"vendor": classification.get("vendor"),
|
||||||
"amount": classification.get("amount"),
|
"amount": classification.get("amount"),
|
||||||
|
"summary": classification.get("summary"),
|
||||||
"processed": datetime.now().isoformat(),
|
"processed": datetime.now().isoformat(),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -262,9 +404,11 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
|
||||||
data["documents"].append(doc_entry)
|
data["documents"].append(doc_entry)
|
||||||
data["stats"]["total"] = len(data["documents"])
|
data["stats"]["total"] = len(data["documents"])
|
||||||
|
|
||||||
# Update type stats
|
# Update type/category stats
|
||||||
dtype = classification.get("doc_type", "unknown")
|
dtype = classification.get("doc_type", "unknown")
|
||||||
|
cat = classification.get("category", "uncategorized")
|
||||||
data["stats"]["by_type"][dtype] = data["stats"]["by_type"].get(dtype, 0) + 1
|
data["stats"]["by_type"][dtype] = data["stats"]["by_type"].get(dtype, 0) + 1
|
||||||
|
data["stats"]["by_category"][cat] = data["stats"]["by_category"].get(cat, 0) + 1
|
||||||
|
|
||||||
with open(index_path, 'w') as f:
|
with open(index_path, 'w') as f:
|
||||||
json.dump(data, f, indent=2)
|
json.dump(data, f, indent=2)
|
||||||
|
|
@ -272,7 +416,7 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
|
||||||
|
|
||||||
def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
|
def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
|
||||||
"""Append to expenses.csv if it's an expense/receipt."""
|
"""Append to expenses.csv if it's an expense/receipt."""
|
||||||
if classification["category"] not in ["expenses", "bills"]:
|
if classification.get("category") not in ["expenses", "bills"]:
|
||||||
return
|
return
|
||||||
|
|
||||||
csv_path = EXPORTS / "expenses.csv"
|
csv_path = EXPORTS / "expenses.csv"
|
||||||
|
|
@ -287,22 +431,22 @@ def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
|
||||||
classification.get("date", ""),
|
classification.get("date", ""),
|
||||||
classification.get("vendor", ""),
|
classification.get("vendor", ""),
|
||||||
classification.get("amount", ""),
|
classification.get("amount", ""),
|
||||||
classification["category"],
|
classification.get("category", ""),
|
||||||
classification.get("doc_type", ""),
|
classification.get("doc_type", ""),
|
||||||
hash_id,
|
hash_id,
|
||||||
filepath.name,
|
filepath.name,
|
||||||
])
|
])
|
||||||
|
|
||||||
|
|
||||||
def process_document(filepath: Path) -> bool:
|
def process_document(filepath: Path, client: anthropic.Anthropic) -> bool:
|
||||||
"""Process a single document through the full pipeline."""
|
"""Process a single document through the full pipeline."""
|
||||||
print(f"Processing: {filepath.name}")
|
print(f"Processing: {filepath.name}")
|
||||||
|
|
||||||
# Skip hidden files and non-documents
|
# Skip hidden files
|
||||||
if filepath.name.startswith('.'):
|
if filepath.name.startswith('.'):
|
||||||
return False
|
return False
|
||||||
|
|
||||||
valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp', '.txt'}
|
valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.webp', '.tiff', '.tif', '.bmp'}
|
||||||
if filepath.suffix.lower() not in valid_extensions:
|
if filepath.suffix.lower() not in valid_extensions:
|
||||||
print(f" Skipping unsupported format: {filepath.suffix}")
|
print(f" Skipping unsupported format: {filepath.suffix}")
|
||||||
return False
|
return False
|
||||||
|
|
@ -318,87 +462,98 @@ def process_document(filepath: Path) -> bool:
|
||||||
filepath.unlink()
|
filepath.unlink()
|
||||||
return True
|
return True
|
||||||
|
|
||||||
# 3. Extract text (OCR if needed)
|
# 3. Analyze with AI (extracts text + classifies in one pass)
|
||||||
print(" Extracting text...")
|
classification = analyze_document_with_ai(filepath, client)
|
||||||
text = extract_text(filepath)
|
full_text = classification.get("full_text", "")
|
||||||
if not text:
|
print(f" Category: {classification.get('category')}, Type: {classification.get('doc_type')}")
|
||||||
print(" Warning: No text extracted")
|
print(f" Extracted {len(full_text)} characters")
|
||||||
text = "(No text could be extracted)"
|
|
||||||
else:
|
|
||||||
print(f" Extracted {len(text)} characters")
|
|
||||||
|
|
||||||
# 4. Classify
|
# 4. Store original document
|
||||||
print(" Classifying...")
|
|
||||||
classification = classify_document(text, filepath.name)
|
|
||||||
print(f" Category: {classification['category']}, Type: {classification.get('doc_type')}")
|
|
||||||
|
|
||||||
# 5. Store PDF
|
|
||||||
print(" Storing document...")
|
print(" Storing document...")
|
||||||
store_document(filepath, hash_id)
|
store_document(filepath, hash_id)
|
||||||
|
|
||||||
# 6. Create record
|
# 5. Create markdown record
|
||||||
print(" Creating record...")
|
print(" Creating record...")
|
||||||
record_path = create_record(filepath, hash_id, text, classification)
|
record_path = create_record(filepath, hash_id, classification)
|
||||||
print(f" Record: {record_path}")
|
print(f" Record: {record_path}")
|
||||||
|
|
||||||
# 7. Update index
|
# 6. Update JSON index
|
||||||
print(" Updating index...")
|
print(" Updating index...")
|
||||||
update_master_index(hash_id, filepath, classification)
|
update_master_index(hash_id, filepath, classification)
|
||||||
|
|
||||||
# 8. Export if expense
|
# 7. Store in SQLite (for search)
|
||||||
|
print(" Storing in SQLite...")
|
||||||
|
store_document_metadata(hash_id, filepath.name, classification, full_text)
|
||||||
|
|
||||||
|
# 8. Generate and store embedding (if implemented)
|
||||||
|
embedding = generate_embedding(full_text, client)
|
||||||
|
if embedding:
|
||||||
|
store_embedding(hash_id, embedding, full_text)
|
||||||
|
|
||||||
|
# 9. Export if expense
|
||||||
export_expense(hash_id, classification, filepath)
|
export_expense(hash_id, classification, filepath)
|
||||||
|
|
||||||
# 9. Remove from inbox
|
# 10. Remove from inbox
|
||||||
print(" Removing from inbox...")
|
print(" Removing from inbox...")
|
||||||
filepath.unlink()
|
filepath.unlink()
|
||||||
|
|
||||||
print(f" ✓ Done: {classification['category']}/{hash_id}")
|
print(f" ✓ Done: {classification.get('category')}/{hash_id}")
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
||||||
def process_inbox() -> int:
|
def process_inbox(client: anthropic.Anthropic) -> int:
|
||||||
"""Process all documents in inbox. Returns count processed."""
|
"""Process all documents in inbox. Returns count processed."""
|
||||||
count = 0
|
count = 0
|
||||||
for filepath in INBOX.iterdir():
|
for filepath in sorted(INBOX.iterdir()):
|
||||||
if filepath.is_file() and not filepath.name.startswith('.'):
|
if filepath.is_file() and not filepath.name.startswith('.'):
|
||||||
try:
|
try:
|
||||||
if process_document(filepath):
|
if process_document(filepath, client):
|
||||||
count += 1
|
count += 1
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error processing {filepath}: {e}")
|
print(f"Error processing {filepath}: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
return count
|
return count
|
||||||
|
|
||||||
|
|
||||||
def watch_inbox(interval: int = 30) -> None:
|
def watch_inbox(client: anthropic.Anthropic, interval: int = 60) -> None:
|
||||||
"""Watch inbox continuously."""
|
"""Watch inbox continuously."""
|
||||||
print(f"Watching {INBOX} (interval: {interval}s)")
|
print(f"Watching {INBOX} (interval: {interval}s)")
|
||||||
print("Press Ctrl+C to stop")
|
print("Press Ctrl+C to stop")
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
count = process_inbox()
|
count = process_inbox(client)
|
||||||
if count:
|
if count:
|
||||||
print(f"Processed {count} document(s)")
|
print(f"Processed {count} document(s)")
|
||||||
time.sleep(interval)
|
time.sleep(interval)
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
import argparse
|
parser = argparse.ArgumentParser(description="AI-powered document processor")
|
||||||
parser = argparse.ArgumentParser(description="Document processor")
|
|
||||||
parser.add_argument("--watch", action="store_true", help="Watch inbox continuously")
|
parser.add_argument("--watch", action="store_true", help="Watch inbox continuously")
|
||||||
parser.add_argument("--interval", type=int, default=30, help="Watch interval in seconds")
|
parser.add_argument("--interval", type=int, default=60, help="Watch interval in seconds")
|
||||||
parser.add_argument("--file", type=Path, help="Process single file")
|
parser.add_argument("--file", type=Path, help="Process single file")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
init_embeddings_db()
|
||||||
|
|
||||||
|
try:
|
||||||
|
client = get_anthropic_client()
|
||||||
|
except RuntimeError as e:
|
||||||
|
print(f"ERROR: {e}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
if args.file:
|
if args.file:
|
||||||
if args.file.exists():
|
if args.file.exists():
|
||||||
process_document(args.file)
|
process_document(args.file, client)
|
||||||
else:
|
else:
|
||||||
print(f"File not found: {args.file}")
|
print(f"File not found: {args.file}")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
elif args.watch:
|
elif args.watch:
|
||||||
watch_inbox(args.interval)
|
watch_inbox(client, args.interval)
|
||||||
else:
|
else:
|
||||||
count = process_inbox()
|
count = process_inbox(client)
|
||||||
print(f"Processed {count} document(s)")
|
print(f"Processed {count} document(s)")
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
249
search.py
249
search.py
|
|
@ -1,11 +1,13 @@
|
||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Search documents in the document management system.
|
Search documents in the document management system.
|
||||||
|
Uses SQLite full-text search on document content.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
import json
|
import json
|
||||||
|
import sqlite3
|
||||||
import argparse
|
import argparse
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
@ -13,108 +15,184 @@ from datetime import datetime
|
||||||
DOCUMENTS_ROOT = Path.home() / "documents"
|
DOCUMENTS_ROOT = Path.home() / "documents"
|
||||||
INDEX = DOCUMENTS_ROOT / "index"
|
INDEX = DOCUMENTS_ROOT / "index"
|
||||||
RECORDS = DOCUMENTS_ROOT / "records"
|
RECORDS = DOCUMENTS_ROOT / "records"
|
||||||
|
EMBEDDINGS_DB = INDEX / "embeddings.db"
|
||||||
|
|
||||||
|
|
||||||
def load_index() -> dict:
|
def get_db() -> sqlite3.Connection:
|
||||||
"""Load the master index."""
|
"""Get database connection."""
|
||||||
index_path = INDEX / "master.json"
|
if not EMBEDDINGS_DB.exists():
|
||||||
if index_path.exists():
|
print(f"Database not found: {EMBEDDINGS_DB}")
|
||||||
with open(index_path) as f:
|
print("Run the processor first to create the database.")
|
||||||
return json.load(f)
|
sys.exit(1)
|
||||||
return {"documents": []}
|
return sqlite3.connect(EMBEDDINGS_DB)
|
||||||
|
|
||||||
|
|
||||||
def search_documents(query: str, category: str = None, doc_type: str = None) -> list:
|
def search_documents(query: str, category: str = None, doc_type: str = None, limit: int = 20) -> list:
|
||||||
"""Search documents by query, optionally filtered by category/type."""
|
"""
|
||||||
data = load_index()
|
Search documents by query using SQLite full-text search.
|
||||||
results = []
|
Returns list of matching documents.
|
||||||
|
"""
|
||||||
|
conn = get_db()
|
||||||
|
conn.row_factory = sqlite3.Row
|
||||||
|
|
||||||
query_lower = query.lower() if query else ""
|
# Build query
|
||||||
|
conditions = []
|
||||||
|
params = []
|
||||||
|
|
||||||
for doc in data["documents"]:
|
if query:
|
||||||
# Apply filters
|
# Search in full_text, summary, vendor, filename
|
||||||
if category and doc.get("category") != category:
|
conditions.append("""(
|
||||||
continue
|
full_text LIKE ? OR
|
||||||
if doc_type and doc.get("type") != doc_type:
|
summary LIKE ? OR
|
||||||
continue
|
vendor LIKE ? OR
|
||||||
|
filename LIKE ?
|
||||||
|
)""")
|
||||||
|
like_query = f"%{query}%"
|
||||||
|
params.extend([like_query, like_query, like_query, like_query])
|
||||||
|
|
||||||
# If no query, return all matching filters
|
if category:
|
||||||
if not query:
|
conditions.append("category = ?")
|
||||||
results.append(doc)
|
params.append(category)
|
||||||
continue
|
|
||||||
|
|
||||||
# Search in indexed fields
|
if doc_type:
|
||||||
searchable = f"{doc.get('filename', '')} {doc.get('category', '')} {doc.get('type', '')} {doc.get('date', '')} {doc.get('amount', '')}".lower()
|
conditions.append("doc_type = ?")
|
||||||
if query_lower in searchable:
|
params.append(doc_type)
|
||||||
results.append(doc)
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Search in full text record
|
where_clause = " AND ".join(conditions) if conditions else "1=1"
|
||||||
record_path = find_record(doc["id"], doc["category"])
|
|
||||||
if record_path and record_path.exists():
|
sql = f"""
|
||||||
content = record_path.read_text().lower()
|
SELECT doc_id, filename, category, doc_type, date, vendor, amount, summary, processed_at
|
||||||
if query_lower in content:
|
FROM documents
|
||||||
results.append(doc)
|
WHERE {where_clause}
|
||||||
|
ORDER BY processed_at DESC
|
||||||
|
LIMIT ?
|
||||||
|
"""
|
||||||
|
params.append(limit)
|
||||||
|
|
||||||
|
cursor = conn.execute(sql, params)
|
||||||
|
results = [dict(row) for row in cursor.fetchall()]
|
||||||
|
conn.close()
|
||||||
|
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
|
||||||
def find_record(doc_id: str, category: str) -> Path:
|
def get_document(doc_id: str) -> dict:
|
||||||
"""Find the record file for a document."""
|
"""Get full document details by ID."""
|
||||||
cat_dir = RECORDS / category
|
conn = get_db()
|
||||||
if cat_dir.exists():
|
conn.row_factory = sqlite3.Row
|
||||||
for f in cat_dir.iterdir():
|
|
||||||
if doc_id in f.name:
|
cursor = conn.execute("""
|
||||||
return f
|
SELECT * FROM documents WHERE doc_id = ? OR doc_id LIKE ?
|
||||||
return None
|
""", (doc_id, f"{doc_id}%"))
|
||||||
|
|
||||||
|
row = cursor.fetchone()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
return dict(row) if row else None
|
||||||
|
|
||||||
|
|
||||||
|
def list_categories() -> dict:
|
||||||
|
"""List all categories with document counts."""
|
||||||
|
conn = get_db()
|
||||||
|
cursor = conn.execute("""
|
||||||
|
SELECT category, COUNT(*) as count
|
||||||
|
FROM documents
|
||||||
|
GROUP BY category
|
||||||
|
ORDER BY count DESC
|
||||||
|
""")
|
||||||
|
results = {row[0]: row[1] for row in cursor.fetchall()}
|
||||||
|
conn.close()
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def list_types() -> dict:
|
||||||
|
"""List all document types with counts."""
|
||||||
|
conn = get_db()
|
||||||
|
cursor = conn.execute("""
|
||||||
|
SELECT doc_type, COUNT(*) as count
|
||||||
|
FROM documents
|
||||||
|
GROUP BY doc_type
|
||||||
|
ORDER BY count DESC
|
||||||
|
""")
|
||||||
|
results = {row[0]: row[1] for row in cursor.fetchall()}
|
||||||
|
conn.close()
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def show_stats() -> None:
|
||||||
|
"""Show document statistics."""
|
||||||
|
conn = get_db()
|
||||||
|
|
||||||
|
# Total count
|
||||||
|
total = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
|
||||||
|
|
||||||
|
print("\n📊 Document Statistics")
|
||||||
|
print("=" * 40)
|
||||||
|
print(f"Total documents: {total}")
|
||||||
|
|
||||||
|
# By category
|
||||||
|
print("\nBy category:")
|
||||||
|
for cat, count in list_categories().items():
|
||||||
|
print(f" {cat}: {count}")
|
||||||
|
|
||||||
|
# By type
|
||||||
|
print("\nBy type:")
|
||||||
|
for dtype, count in list_types().items():
|
||||||
|
print(f" {dtype}: {count}")
|
||||||
|
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
def show_document(doc_id: str) -> None:
|
def show_document(doc_id: str) -> None:
|
||||||
"""Show full details of a document."""
|
"""Show full details of a document."""
|
||||||
data = load_index()
|
doc = get_document(doc_id)
|
||||||
|
|
||||||
for doc in data["documents"]:
|
if not doc:
|
||||||
if doc["id"] == doc_id or doc_id in doc.get("filename", ""):
|
print(f"Document not found: {doc_id}")
|
||||||
print(f"\n{'='*60}")
|
return
|
||||||
print(f"Document: {doc['filename']}")
|
|
||||||
print(f"ID: {doc['id']}")
|
|
||||||
print(f"Category: {doc['category']}")
|
|
||||||
print(f"Type: {doc.get('type', 'unknown')}")
|
|
||||||
print(f"Date: {doc.get('date', 'N/A')}")
|
|
||||||
print(f"Amount: {doc.get('amount', 'N/A')}")
|
|
||||||
print(f"Processed: {doc.get('processed', 'N/A')}")
|
|
||||||
print(f"{'='*60}")
|
|
||||||
|
|
||||||
# Show record content
|
print(f"\n{'=' * 60}")
|
||||||
record_path = find_record(doc["id"], doc["category"])
|
print(f"Document: {doc['filename']}")
|
||||||
if record_path:
|
print(f"ID: {doc['doc_id']}")
|
||||||
print(f"\nRecord: {record_path}")
|
print(f"Category: {doc['category']}")
|
||||||
print("-"*60)
|
print(f"Type: {doc['doc_type'] or 'unknown'}")
|
||||||
print(record_path.read_text())
|
print(f"Date: {doc['date'] or 'N/A'}")
|
||||||
return
|
print(f"Vendor: {doc['vendor'] or 'N/A'}")
|
||||||
|
print(f"Amount: {doc['amount'] or 'N/A'}")
|
||||||
|
print(f"Processed: {doc['processed_at']}")
|
||||||
|
print(f"{'=' * 60}")
|
||||||
|
|
||||||
print(f"Document not found: {doc_id}")
|
if doc['summary']:
|
||||||
|
print(f"\nSummary:\n{doc['summary']}")
|
||||||
|
|
||||||
|
if doc['full_text']:
|
||||||
|
print(f"\n--- Full Text (first 2000 chars) ---\n")
|
||||||
|
print(doc['full_text'][:2000])
|
||||||
|
if len(doc['full_text']) > 2000:
|
||||||
|
print(f"\n... [{len(doc['full_text']) - 2000} more characters]")
|
||||||
|
|
||||||
|
|
||||||
def list_stats() -> None:
|
def format_results(results: list) -> None:
|
||||||
"""Show document statistics."""
|
"""Format and print search results."""
|
||||||
data = load_index()
|
if not results:
|
||||||
|
print("No documents found")
|
||||||
|
return
|
||||||
|
|
||||||
print("\n📊 Document Statistics")
|
print(f"\nFound {len(results)} document(s):\n")
|
||||||
print("="*40)
|
|
||||||
print(f"Total documents: {data['stats']['total']}")
|
|
||||||
|
|
||||||
print("\nBy type:")
|
# Header
|
||||||
for dtype, count in sorted(data["stats"].get("by_type", {}).items()):
|
print(f"{'ID':<10} {'Category':<12} {'Type':<18} {'Date':<12} {'Amount':<10} {'Filename'}")
|
||||||
print(f" {dtype}: {count}")
|
print("-" * 90)
|
||||||
|
|
||||||
print("\nBy category:")
|
for doc in results:
|
||||||
by_cat = {}
|
doc_id = doc['doc_id'][:8]
|
||||||
for doc in data["documents"]:
|
cat = (doc['category'] or '')[:12]
|
||||||
cat = doc.get("category", "unknown")
|
dtype = (doc['doc_type'] or 'unknown')[:18]
|
||||||
by_cat[cat] = by_cat.get(cat, 0) + 1
|
date = (doc['date'] or '')[:12]
|
||||||
for cat, count in sorted(by_cat.items()):
|
amount = (doc['amount'] or '')[:10]
|
||||||
print(f" {cat}: {count}")
|
filename = doc['filename'][:30]
|
||||||
|
|
||||||
|
print(f"{doc_id:<10} {cat:<12} {dtype:<18} {date:<12} {amount:<10} {filename}")
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
|
|
@ -125,10 +203,12 @@ def main():
|
||||||
parser.add_argument("-s", "--show", help="Show full document by ID")
|
parser.add_argument("-s", "--show", help="Show full document by ID")
|
||||||
parser.add_argument("--stats", action="store_true", help="Show statistics")
|
parser.add_argument("--stats", action="store_true", help="Show statistics")
|
||||||
parser.add_argument("-l", "--list", action="store_true", help="List all documents")
|
parser.add_argument("-l", "--list", action="store_true", help="List all documents")
|
||||||
|
parser.add_argument("-n", "--limit", type=int, default=20, help="Max results (default: 20)")
|
||||||
|
parser.add_argument("--full-text", action="store_true", help="Show full text in results")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
if args.stats:
|
if args.stats:
|
||||||
list_stats()
|
show_stats()
|
||||||
return
|
return
|
||||||
|
|
||||||
if args.show:
|
if args.show:
|
||||||
|
|
@ -136,17 +216,8 @@ def main():
|
||||||
return
|
return
|
||||||
|
|
||||||
if args.list or args.query or args.category or args.type:
|
if args.list or args.query or args.category or args.type:
|
||||||
results = search_documents(args.query, args.category, args.type)
|
results = search_documents(args.query, args.category, args.type, args.limit)
|
||||||
|
format_results(results)
|
||||||
if not results:
|
|
||||||
print("No documents found")
|
|
||||||
return
|
|
||||||
|
|
||||||
print(f"\nFound {len(results)} document(s):\n")
|
|
||||||
for doc in results:
|
|
||||||
date = doc.get("date", "")[:10] if doc.get("date") else ""
|
|
||||||
amount = doc.get("amount", "")
|
|
||||||
print(f" [{doc['id'][:8]}] {doc['category']:12} {doc.get('type', ''):15} {date:12} {amount:10} {doc['filename']}")
|
|
||||||
else:
|
else:
|
||||||
parser.print_help()
|
parser.print_help()
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue