Replace OCR with AI vision, SQLite for storage

- Remove Tesseract/OCR dependencies
- Use Claude vision API for document analysis
- Single AI pass: extract text + classify + summarize
- SQLite database for documents and embeddings
- Embeddings storage ready (generation placeholder)
- Full-text search via SQLite
- Updated systemd service to use venv
- Support .env file for API key
This commit is contained in:
Johan Jongsma 2026-02-01 17:24:05 +00:00
parent 9dac36681c
commit fb3d5a46b5
4 changed files with 598 additions and 354 deletions

4
.gitignore vendored Normal file
View File

@ -0,0 +1,4 @@
venv/
.env
__pycache__/
*.pyc

188
README.md
View File

@ -1,105 +1,119 @@
# Document Management System # Document Processor
Automated document processing pipeline for scanning, OCR, classification, and indexing. AI-powered document management system using Claude vision for extraction and SQLite for storage/search.
## Architecture ## Features
- **AI Vision Analysis**: Uses Claude to read documents, extract text, classify, and summarize
- **No OCR dependencies**: Just drop files in inbox, AI handles the rest
- **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
- **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
- **Expense Tracking**: Auto-exports bills/receipts to CSV
## Setup
```bash
cd ~/dev/doc-processor
# Create/activate venv
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install anthropic
# Configure API key (one of these methods):
# Option 1: Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
# Option 2: .env file
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
```
## Usage
```bash
# Activate venv first
source ~/dev/doc-processor/venv/bin/activate
# Process all documents in inbox
python processor.py
# Watch inbox continuously
python processor.py --watch
# Process single file
python processor.py --file /path/to/document.pdf
# Search documents
python search.py "query"
python search.py -c medical # By category
python search.py -t receipt # By type
python search.py -s abc123 # Show full document
python search.py --stats # Statistics
python search.py -l # List all
```
## Directory Structure
``` ```
~/documents/ ~/documents/
├── inbox/ # Drop documents here (SMB share for scanner) ├── inbox/ # Drop files here (SMB share for scanner)
├── store/ # Original files stored by hash ├── store/ # Original files (hash-named)
├── records/ # Markdown records by category ├── records/ # Markdown records by category
│ ├── bills/
│ ├── taxes/ │ ├── taxes/
│ ├── bills/
│ ├── medical/ │ ├── medical/
│ ├── expenses/
│ └── ... │ └── ...
├── index/ # Search index ├── index/
│ └── master.json │ ├── master.json # JSON index
└── exports/ # CSV exports │ └── embeddings.db # SQLite (documents + embeddings)
└── expenses.csv └── exports/
└── expenses.csv # Auto-exported expenses
``` ```
## How It Works ## Supported Formats
1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually) - PDF (converted to image for vision)
2. **Daemon processes it** (runs every 60 seconds): - Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP
- Extracts text (pdftotext or tesseract OCR)
- Classifies document type and category
- Extracts key fields (date, vendor, amount)
- Stores original file by content hash
- Creates markdown record
- Updates searchable index
- Exports expenses to CSV
3. **Search** your documents anytime
## Commands
```bash
# Process inbox manually
python3 ~/dev/doc-processor/processor.py
# Process single file
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
# Watch mode (manual, daemon does this automatically)
python3 ~/dev/doc-processor/processor.py --watch --interval 30
# Search documents
python3 ~/dev/doc-processor/search.py "duke energy"
python3 ~/dev/doc-processor/search.py -c bills # By category
python3 ~/dev/doc-processor/search.py -t receipt # By type
python3 ~/dev/doc-processor/search.py --stats # Statistics
python3 ~/dev/doc-processor/search.py -l # List all
python3 ~/dev/doc-processor/search.py -s <doc_id> # Show full record
```
## Daemon
```bash
# Status
systemctl --user status doc-processor
# Restart
systemctl --user restart doc-processor
# Logs
journalctl --user -u doc-processor -f
```
## Scanner Setup
1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\`
3. Feed paper, press scan
4. Documents auto-process within 60 seconds
## Categories ## Categories
| Category | Documents | - taxes, bills, medical, insurance, legal
|----------|-----------| - financial, expenses, vehicles, home
| taxes | W-2, 1099, tax returns, IRS forms | - personal, contacts, uncategorized
| bills | Utility bills, invoices |
| medical | Medical records, prescriptions |
| insurance | Policies, claims |
| legal | Contracts, agreements |
| financial | Bank statements, investments |
| expenses | Receipts, purchases |
| vehicles | Registration, maintenance |
| home | Mortgage, HOA, property |
| personal | General documents |
| contacts | Business cards |
| uncategorized | Unclassified |
## SMB Share Setup ## Systemd Service
Already configured on james server: ```bash
``` # Install service
[documents] systemctl --user daemon-reload
path = /home/johan/documents systemctl --user enable doc-processor
browsable = yes systemctl --user start doc-processor
writable = yes
valid users = scanner, johan # Check status
systemctl --user status doc-processor
journalctl --user -u doc-processor -f
``` ```
Scanner user can write to inbox, processed files go to other directories. ## Requirements
- Python 3.10+
- `anthropic` Python package
- `pdftoppm` (poppler-utils) for PDF conversion
- Anthropic API key
## API Key
The processor looks for the API key in this order:
1. `ANTHROPIC_API_KEY` environment variable
2. `~/dev/doc-processor/.env` file
## Embeddings
The embedding storage is ready but the generation is a placeholder. Options:
- OpenAI text-embedding-3-small (cheap, good)
- Voyage AI (optimized for documents)
- Local sentence-transformers
Currently uses SQLite full-text search which works well for most use cases.

View File

@ -1,22 +1,31 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
Document Processor for ~/documents/inbox/ Document Processor for ~/documents/inbox/
Watches for new documents, OCRs them, classifies, and files them. Uses AI vision (Claude) for document analysis. Stores embeddings in SQLite.
""" """
import os import os
import sys import sys
import json import json
import hashlib import hashlib
import subprocess
import shutil import shutil
import sqlite3 import sqlite3
import csv import csv
import base64
import struct
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from typing import Optional, Dict, Any from typing import Optional, Dict, Any, List
import re
import time import time
import argparse
# Try to import anthropic, fail gracefully with helpful message
try:
import anthropic
except ImportError:
print("ERROR: anthropic package not installed")
print("Run: cd ~/dev/doc-processor && source venv/bin/activate && pip install anthropic")
sys.exit(1)
# Paths # Paths
DOCUMENTS_ROOT = Path.home() / "documents" DOCUMENTS_ROOT = Path.home() / "documents"
@ -25,6 +34,7 @@ STORE = DOCUMENTS_ROOT / "store"
RECORDS = DOCUMENTS_ROOT / "records" RECORDS = DOCUMENTS_ROOT / "records"
INDEX = DOCUMENTS_ROOT / "index" INDEX = DOCUMENTS_ROOT / "index"
EXPORTS = DOCUMENTS_ROOT / "exports" EXPORTS = DOCUMENTS_ROOT / "exports"
EMBEDDINGS_DB = INDEX / "embeddings.db"
# Categories # Categories
CATEGORIES = [ CATEGORIES = [
@ -40,149 +50,272 @@ for cat in CATEGORIES:
(RECORDS / cat).mkdir(parents=True, exist_ok=True) (RECORDS / cat).mkdir(parents=True, exist_ok=True)
def get_anthropic_client() -> anthropic.Anthropic:
"""Get Anthropic client, checking for API key."""
api_key = os.environ.get("ANTHROPIC_API_KEY")
if not api_key:
# Try reading from config file
config_path = Path.home() / "dev/doc-processor/.env"
if config_path.exists():
for line in config_path.read_text().splitlines():
if line.startswith("ANTHROPIC_API_KEY="):
api_key = line.split("=", 1)[1].strip().strip('"\'')
break
if not api_key:
raise RuntimeError(
"ANTHROPIC_API_KEY not set. Either:\n"
" 1. Set ANTHROPIC_API_KEY environment variable\n"
" 2. Create ~/dev/doc-processor/.env with ANTHROPIC_API_KEY=sk-ant-..."
)
return anthropic.Anthropic(api_key=api_key)
def init_embeddings_db():
"""Initialize SQLite database for embeddings."""
conn = sqlite3.connect(EMBEDDINGS_DB)
conn.execute("""
CREATE TABLE IF NOT EXISTS embeddings (
doc_id TEXT PRIMARY KEY,
embedding BLOB,
text_hash TEXT,
created_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS documents (
doc_id TEXT PRIMARY KEY,
filename TEXT,
category TEXT,
doc_type TEXT,
date TEXT,
vendor TEXT,
amount TEXT,
summary TEXT,
full_text TEXT,
processed_at TEXT
)
""")
conn.commit()
conn.close()
def file_hash(filepath: Path) -> str: def file_hash(filepath: Path) -> str:
"""SHA256 hash of file contents.""" """SHA256 hash of file contents."""
h = hashlib.sha256() h = hashlib.sha256()
with open(filepath, 'rb') as f: with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''): for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk) h.update(chunk)
return h.hexdigest()[:16] # Short hash for filename return h.hexdigest()[:16]
def extract_text_pdf(filepath: Path) -> str: def encode_image_base64(filepath: Path) -> tuple[str, str]:
"""Extract text from PDF using pdftotext.""" """Encode image/PDF to base64 for API. Returns (base64_data, media_type)."""
try:
result = subprocess.run(
['pdftotext', '-layout', str(filepath), '-'],
capture_output=True, text=True, timeout=30
)
text = result.stdout.strip()
if len(text) > 50: # Got meaningful text
return text
except Exception as e:
print(f"pdftotext failed: {e}")
# Fallback to OCR
return ocr_document(filepath)
def ocr_document(filepath: Path) -> str:
"""OCR a document using tesseract."""
try:
# For PDFs, convert to images first
if filepath.suffix.lower() == '.pdf':
# Use pdftoppm to convert to images, then OCR
result = subprocess.run(
['pdftoppm', '-png', '-r', '300', str(filepath), '/tmp/doc_page'],
capture_output=True, timeout=60
)
# OCR all pages
text_parts = []
for img in sorted(Path('/tmp').glob('doc_page-*.png')):
result = subprocess.run(
['tesseract', str(img), 'stdout'],
capture_output=True, text=True, timeout=60
)
text_parts.append(result.stdout)
img.unlink() # Clean up
return '\n'.join(text_parts).strip()
else:
# Direct image OCR
result = subprocess.run(
['tesseract', str(filepath), 'stdout'],
capture_output=True, text=True, timeout=60
)
return result.stdout.strip()
except Exception as e:
print(f"OCR failed: {e}")
return ""
def extract_text(filepath: Path) -> str:
"""Extract text from document based on type."""
suffix = filepath.suffix.lower() suffix = filepath.suffix.lower()
if suffix == '.pdf': if suffix == '.pdf':
return extract_text_pdf(filepath) # For PDFs, convert first page to PNG using pdftoppm
elif suffix in ['.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp']: import subprocess
return ocr_document(filepath) result = subprocess.run(
elif suffix in ['.txt', '.md']: ['pdftoppm', '-png', '-f', '1', '-l', '1', '-r', '150', str(filepath), '-'],
return filepath.read_text() capture_output=True, timeout=30
else: )
return "" if result.returncode == 0:
return base64.standard_b64encode(result.stdout).decode('utf-8'), 'image/png'
else:
def classify_document(text: str, filename: str) -> Dict[str, Any]: raise RuntimeError(f"Failed to convert PDF: {result.stderr.decode()}")
"""
Classify document based on content. # Image files
Returns: {category, doc_type, date, vendor, amount, summary} media_types = {
""" '.png': 'image/png',
text_lower = text.lower() '.jpg': 'image/jpeg',
result = { '.jpeg': 'image/jpeg',
"category": "uncategorized", '.gif': 'image/gif',
"doc_type": "unknown", '.webp': 'image/webp',
"date": None,
"vendor": None,
"amount": None,
"summary": None,
} }
media_type = media_types.get(suffix, 'image/png')
# Date extraction (various formats) with open(filepath, 'rb') as f:
date_patterns = [ return base64.standard_b64encode(f.read()).decode('utf-8'), media_type
r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
r'(\d{4}[/-]\d{1,2}[/-]\d{1,2})',
r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]* \d{1,2},? \d{4})', def analyze_document_with_ai(filepath: Path, client: anthropic.Anthropic) -> Dict[str, Any]:
] """
for pattern in date_patterns: Use Claude vision to analyze document.
match = re.search(pattern, text_lower) Returns: {category, doc_type, date, vendor, amount, summary, full_text}
if match: """
result["date"] = match.group(1) print(f" Analyzing with AI...")
break
# Amount extraction try:
amount_match = re.search(r'\$[\d,]+\.?\d*', text) image_data, media_type = encode_image_base64(filepath)
if amount_match: except Exception as e:
result["amount"] = amount_match.group(0) print(f" Failed to encode document: {e}")
return {
"category": "uncategorized",
"doc_type": "unknown",
"full_text": f"(Failed to process: {e})",
"summary": "Document could not be processed"
}
# Classification rules prompt = """Analyze this document image and extract:
if any(x in text_lower for x in ['w-2', 'w2', '1099', 'tax return', 'irs', '1040', 'schedule c', 'form 1098']):
result["category"] = "taxes" 1. **Full Text**: Transcribe ALL visible text from the document, preserving structure where possible.
result["doc_type"] = "tax_form"
elif any(x in text_lower for x in ['invoice', 'bill', 'amount due', 'payment due', 'account number', 'autopay']): 2. **Classification**: Categorize into exactly ONE of:
result["category"] = "bills" - taxes (W-2, 1099, tax returns, IRS forms)
result["doc_type"] = "bill" - bills (utilities, subscriptions, invoices)
# Try to extract vendor - medical (health records, prescriptions, lab results)
vendors = ['duke energy', 'fpl', 'florida power', 'spectrum', 'at&t', 'verizon', 't-mobile', 'comcast', 'xfinity'] - insurance (policies, claims, coverage docs)
for v in vendors: - legal (contracts, agreements, legal notices)
if v in text_lower: - financial (bank statements, investment docs)
result["vendor"] = v.title() - expenses (receipts, purchase confirmations)
break - vehicles (registration, maintenance, DMV)
elif any(x in text_lower for x in ['patient', 'diagnosis', 'prescription', 'medical', 'physician', 'hospital', 'clinic', 'dr.', 'md']): - home (mortgage, HOA, property docs)
result["category"] = "medical" - personal (ID copies, certificates, misc)
result["doc_type"] = "medical_record" - contacts (business cards, contact info)
elif any(x in text_lower for x in ['policy', 'coverage', 'premium', 'deductible', 'insurance', 'claim']): - uncategorized (if none fit)
result["category"] = "insurance"
result["doc_type"] = "insurance_doc" 3. **Document Type**: Specific type (e.g., "utility_bill", "receipt", "tax_form_w2", "insurance_policy")
elif any(x in text_lower for x in ['agreement', 'contract', 'terms', 'hereby', 'whereas', 'attorney', 'legal']):
result["category"] = "legal" 4. **Key Fields**:
result["doc_type"] = "legal_doc" - date: Document date (YYYY-MM-DD format if possible)
elif any(x in text_lower for x in ['bank', 'statement', 'account', 'balance', 'deposit', 'withdrawal', 'investment', 'portfolio']): - vendor: Company/organization name
result["category"] = "financial" - amount: Dollar amount if present (e.g., "$123.45")
result["doc_type"] = "financial_statement"
elif any(x in text_lower for x in ['receipt', 'purchase', 'order', 'subtotal', 'total', 'qty', 'item']): 5. **Summary**: 1-2 sentence description of what this document is.
result["category"] = "expenses"
result["doc_type"] = "receipt" Respond in JSON format:
elif any(x in text_lower for x in ['vin', 'vehicle', 'registration', 'dmv', 'license plate', 'odometer']): {
result["category"] = "vehicles" "category": "...",
result["doc_type"] = "vehicle_doc" "doc_type": "...",
elif any(x in text_lower for x in ['mortgage', 'deed', 'property', 'hoa', 'homeowner']): "date": "...",
result["category"] = "home" "vendor": "...",
result["doc_type"] = "property_doc" "amount": "...",
"summary": "...",
"full_text": "..."
}"""
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{
"type": "text",
"text": prompt
}
],
}
],
)
# Parse JSON from response
text = response.content[0].text
# Try to extract JSON from response (handle markdown code blocks)
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
elif "```" in text:
text = text.split("```")[1].split("```")[0]
result = json.loads(text.strip())
# Validate category
if result.get("category") not in CATEGORIES:
result["category"] = "uncategorized"
return result
except json.JSONDecodeError as e:
print(f" Failed to parse AI response as JSON: {e}")
print(f" Raw response: {text[:500]}")
return {
"category": "uncategorized",
"doc_type": "unknown",
"full_text": text,
"summary": "AI response could not be parsed"
}
except Exception as e:
print(f" AI analysis failed: {e}")
return {
"category": "uncategorized",
"doc_type": "unknown",
"full_text": f"(AI analysis failed: {e})",
"summary": "Document analysis failed"
}
def generate_embedding(text: str, client: anthropic.Anthropic) -> Optional[List[float]]:
"""
Generate text embedding using Anthropic's embedding endpoint.
Note: As of 2024, Anthropic doesn't have a public embedding API.
This is a placeholder - implement with OpenAI, Voyage, or local model.
# Generate summary (first 200 chars, cleaned) For now, returns None and we'll use full-text search in SQLite.
clean_text = ' '.join(text.split())[:200] """
result["summary"] = clean_text # TODO: Implement with preferred embedding provider
# Options:
# 1. OpenAI text-embedding-3-small (cheap, good quality)
# 2. Voyage AI (good for documents)
# 3. Local sentence-transformers
return None
def store_embedding(doc_id: str, embedding: Optional[List[float]], text: str):
"""Store embedding in SQLite database."""
if embedding is None:
return
return result conn = sqlite3.connect(EMBEDDINGS_DB)
# Pack floats as binary blob
embedding_blob = struct.pack(f'{len(embedding)}f', *embedding)
text_hash = hashlib.sha256(text.encode()).hexdigest()[:16]
conn.execute("""
INSERT OR REPLACE INTO embeddings (doc_id, embedding, text_hash, created_at)
VALUES (?, ?, ?, ?)
""", (doc_id, embedding_blob, text_hash, datetime.now().isoformat()))
conn.commit()
conn.close()
def store_document_metadata(doc_id: str, filename: str, classification: Dict, full_text: str):
"""Store document metadata in SQLite for full-text search."""
conn = sqlite3.connect(EMBEDDINGS_DB)
conn.execute("""
INSERT OR REPLACE INTO documents
(doc_id, filename, category, doc_type, date, vendor, amount, summary, full_text, processed_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
doc_id,
filename,
classification.get("category", "uncategorized"),
classification.get("doc_type", "unknown"),
classification.get("date"),
classification.get("vendor"),
classification.get("amount"),
classification.get("summary"),
full_text[:50000], # Limit text size
datetime.now().isoformat()
))
conn.commit()
conn.close()
def store_document(filepath: Path, hash_id: str) -> Path: def store_document(filepath: Path, hash_id: str) -> Path:
@ -194,14 +327,16 @@ def store_document(filepath: Path, hash_id: str) -> Path:
return store_path return store_path
def create_record(filepath: Path, hash_id: str, text: str, classification: Dict) -> Path: def create_record(filepath: Path, hash_id: str, classification: Dict) -> Path:
"""Create markdown record in appropriate category folder.""" """Create markdown record in appropriate category folder."""
cat = classification["category"] cat = classification.get("category", "uncategorized")
now = datetime.now() now = datetime.now()
record_name = f"{now.strftime('%Y%m%d')}_{hash_id}.md" record_name = f"{now.strftime('%Y%m%d')}_{hash_id}.md"
record_path = RECORDS / cat / record_name record_path = RECORDS / cat / record_name
full_text = classification.get("full_text", "")
content = f"""# Document Record content = f"""# Document Record
**ID:** {hash_id} **ID:** {hash_id}
@ -225,12 +360,12 @@ def create_record(filepath: Path, hash_id: str, text: str, classification: Dict)
## Full Text ## Full Text
``` ```
{text[:5000]} {full_text[:10000]}
``` ```
## Files ## Files
- **PDF:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix}) - **Original:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
""" """
record_path.write_text(content) record_path.write_text(content)
@ -245,15 +380,22 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
with open(index_path) as f: with open(index_path) as f:
data = json.load(f) data = json.load(f)
else: else:
data = {"version": "1.0", "created": datetime.now().strftime("%Y-%m-%d"), "documents": [], "stats": {"total": 0, "by_type": {}, "by_year": {}}} data = {
"version": "2.0",
"created": datetime.now().strftime("%Y-%m-%d"),
"documents": [],
"stats": {"total": 0, "by_type": {}, "by_category": {}}
}
doc_entry = { doc_entry = {
"id": hash_id, "id": hash_id,
"filename": filepath.name, "filename": filepath.name,
"category": classification["category"], "category": classification.get("category", "uncategorized"),
"type": classification.get("doc_type", "unknown"), "type": classification.get("doc_type", "unknown"),
"date": classification.get("date"), "date": classification.get("date"),
"vendor": classification.get("vendor"),
"amount": classification.get("amount"), "amount": classification.get("amount"),
"summary": classification.get("summary"),
"processed": datetime.now().isoformat(), "processed": datetime.now().isoformat(),
} }
@ -262,9 +404,11 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
data["documents"].append(doc_entry) data["documents"].append(doc_entry)
data["stats"]["total"] = len(data["documents"]) data["stats"]["total"] = len(data["documents"])
# Update type stats # Update type/category stats
dtype = classification.get("doc_type", "unknown") dtype = classification.get("doc_type", "unknown")
cat = classification.get("category", "uncategorized")
data["stats"]["by_type"][dtype] = data["stats"]["by_type"].get(dtype, 0) + 1 data["stats"]["by_type"][dtype] = data["stats"]["by_type"].get(dtype, 0) + 1
data["stats"]["by_category"][cat] = data["stats"]["by_category"].get(cat, 0) + 1
with open(index_path, 'w') as f: with open(index_path, 'w') as f:
json.dump(data, f, indent=2) json.dump(data, f, indent=2)
@ -272,7 +416,7 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None: def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
"""Append to expenses.csv if it's an expense/receipt.""" """Append to expenses.csv if it's an expense/receipt."""
if classification["category"] not in ["expenses", "bills"]: if classification.get("category") not in ["expenses", "bills"]:
return return
csv_path = EXPORTS / "expenses.csv" csv_path = EXPORTS / "expenses.csv"
@ -287,22 +431,22 @@ def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
classification.get("date", ""), classification.get("date", ""),
classification.get("vendor", ""), classification.get("vendor", ""),
classification.get("amount", ""), classification.get("amount", ""),
classification["category"], classification.get("category", ""),
classification.get("doc_type", ""), classification.get("doc_type", ""),
hash_id, hash_id,
filepath.name, filepath.name,
]) ])
def process_document(filepath: Path) -> bool: def process_document(filepath: Path, client: anthropic.Anthropic) -> bool:
"""Process a single document through the full pipeline.""" """Process a single document through the full pipeline."""
print(f"Processing: {filepath.name}") print(f"Processing: {filepath.name}")
# Skip hidden files and non-documents # Skip hidden files
if filepath.name.startswith('.'): if filepath.name.startswith('.'):
return False return False
valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp', '.txt'} valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.webp', '.tiff', '.tif', '.bmp'}
if filepath.suffix.lower() not in valid_extensions: if filepath.suffix.lower() not in valid_extensions:
print(f" Skipping unsupported format: {filepath.suffix}") print(f" Skipping unsupported format: {filepath.suffix}")
return False return False
@ -318,87 +462,98 @@ def process_document(filepath: Path) -> bool:
filepath.unlink() filepath.unlink()
return True return True
# 3. Extract text (OCR if needed) # 3. Analyze with AI (extracts text + classifies in one pass)
print(" Extracting text...") classification = analyze_document_with_ai(filepath, client)
text = extract_text(filepath) full_text = classification.get("full_text", "")
if not text: print(f" Category: {classification.get('category')}, Type: {classification.get('doc_type')}")
print(" Warning: No text extracted") print(f" Extracted {len(full_text)} characters")
text = "(No text could be extracted)"
else:
print(f" Extracted {len(text)} characters")
# 4. Classify # 4. Store original document
print(" Classifying...")
classification = classify_document(text, filepath.name)
print(f" Category: {classification['category']}, Type: {classification.get('doc_type')}")
# 5. Store PDF
print(" Storing document...") print(" Storing document...")
store_document(filepath, hash_id) store_document(filepath, hash_id)
# 6. Create record # 5. Create markdown record
print(" Creating record...") print(" Creating record...")
record_path = create_record(filepath, hash_id, text, classification) record_path = create_record(filepath, hash_id, classification)
print(f" Record: {record_path}") print(f" Record: {record_path}")
# 7. Update index # 6. Update JSON index
print(" Updating index...") print(" Updating index...")
update_master_index(hash_id, filepath, classification) update_master_index(hash_id, filepath, classification)
# 8. Export if expense # 7. Store in SQLite (for search)
print(" Storing in SQLite...")
store_document_metadata(hash_id, filepath.name, classification, full_text)
# 8. Generate and store embedding (if implemented)
embedding = generate_embedding(full_text, client)
if embedding:
store_embedding(hash_id, embedding, full_text)
# 9. Export if expense
export_expense(hash_id, classification, filepath) export_expense(hash_id, classification, filepath)
# 9. Remove from inbox # 10. Remove from inbox
print(" Removing from inbox...") print(" Removing from inbox...")
filepath.unlink() filepath.unlink()
print(f" ✓ Done: {classification['category']}/{hash_id}") print(f" ✓ Done: {classification.get('category')}/{hash_id}")
return True return True
def process_inbox() -> int: def process_inbox(client: anthropic.Anthropic) -> int:
"""Process all documents in inbox. Returns count processed.""" """Process all documents in inbox. Returns count processed."""
count = 0 count = 0
for filepath in INBOX.iterdir(): for filepath in sorted(INBOX.iterdir()):
if filepath.is_file() and not filepath.name.startswith('.'): if filepath.is_file() and not filepath.name.startswith('.'):
try: try:
if process_document(filepath): if process_document(filepath, client):
count += 1 count += 1
except Exception as e: except Exception as e:
print(f"Error processing {filepath}: {e}") print(f"Error processing {filepath}: {e}")
import traceback
traceback.print_exc()
return count return count
def watch_inbox(interval: int = 30) -> None: def watch_inbox(client: anthropic.Anthropic, interval: int = 60) -> None:
"""Watch inbox continuously.""" """Watch inbox continuously."""
print(f"Watching {INBOX} (interval: {interval}s)") print(f"Watching {INBOX} (interval: {interval}s)")
print("Press Ctrl+C to stop") print("Press Ctrl+C to stop")
while True: while True:
count = process_inbox() count = process_inbox(client)
if count: if count:
print(f"Processed {count} document(s)") print(f"Processed {count} document(s)")
time.sleep(interval) time.sleep(interval)
def main(): def main():
import argparse parser = argparse.ArgumentParser(description="AI-powered document processor")
parser = argparse.ArgumentParser(description="Document processor")
parser.add_argument("--watch", action="store_true", help="Watch inbox continuously") parser.add_argument("--watch", action="store_true", help="Watch inbox continuously")
parser.add_argument("--interval", type=int, default=30, help="Watch interval in seconds") parser.add_argument("--interval", type=int, default=60, help="Watch interval in seconds")
parser.add_argument("--file", type=Path, help="Process single file") parser.add_argument("--file", type=Path, help="Process single file")
args = parser.parse_args() args = parser.parse_args()
# Initialize
init_embeddings_db()
try:
client = get_anthropic_client()
except RuntimeError as e:
print(f"ERROR: {e}")
sys.exit(1)
if args.file: if args.file:
if args.file.exists(): if args.file.exists():
process_document(args.file) process_document(args.file, client)
else: else:
print(f"File not found: {args.file}") print(f"File not found: {args.file}")
sys.exit(1) sys.exit(1)
elif args.watch: elif args.watch:
watch_inbox(args.interval) watch_inbox(client, args.interval)
else: else:
count = process_inbox() count = process_inbox(client)
print(f"Processed {count} document(s)") print(f"Processed {count} document(s)")

257
search.py
View File

@ -1,11 +1,13 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
Search documents in the document management system. Search documents in the document management system.
Uses SQLite full-text search on document content.
""" """
import os import os
import sys import sys
import json import json
import sqlite3
import argparse import argparse
from pathlib import Path from pathlib import Path
from datetime import datetime from datetime import datetime
@ -13,108 +15,184 @@ from datetime import datetime
DOCUMENTS_ROOT = Path.home() / "documents" DOCUMENTS_ROOT = Path.home() / "documents"
INDEX = DOCUMENTS_ROOT / "index" INDEX = DOCUMENTS_ROOT / "index"
RECORDS = DOCUMENTS_ROOT / "records" RECORDS = DOCUMENTS_ROOT / "records"
EMBEDDINGS_DB = INDEX / "embeddings.db"
def load_index() -> dict: def get_db() -> sqlite3.Connection:
"""Load the master index.""" """Get database connection."""
index_path = INDEX / "master.json" if not EMBEDDINGS_DB.exists():
if index_path.exists(): print(f"Database not found: {EMBEDDINGS_DB}")
with open(index_path) as f: print("Run the processor first to create the database.")
return json.load(f) sys.exit(1)
return {"documents": []} return sqlite3.connect(EMBEDDINGS_DB)
def search_documents(query: str, category: str = None, doc_type: str = None) -> list: def search_documents(query: str, category: str = None, doc_type: str = None, limit: int = 20) -> list:
"""Search documents by query, optionally filtered by category/type.""" """
data = load_index() Search documents by query using SQLite full-text search.
results = [] Returns list of matching documents.
"""
conn = get_db()
conn.row_factory = sqlite3.Row
query_lower = query.lower() if query else "" # Build query
conditions = []
params = []
for doc in data["documents"]: if query:
# Apply filters # Search in full_text, summary, vendor, filename
if category and doc.get("category") != category: conditions.append("""(
continue full_text LIKE ? OR
if doc_type and doc.get("type") != doc_type: summary LIKE ? OR
continue vendor LIKE ? OR
filename LIKE ?
# If no query, return all matching filters )""")
if not query: like_query = f"%{query}%"
results.append(doc) params.extend([like_query, like_query, like_query, like_query])
continue
if category:
# Search in indexed fields conditions.append("category = ?")
searchable = f"{doc.get('filename', '')} {doc.get('category', '')} {doc.get('type', '')} {doc.get('date', '')} {doc.get('amount', '')}".lower() params.append(category)
if query_lower in searchable:
results.append(doc) if doc_type:
continue conditions.append("doc_type = ?")
params.append(doc_type)
# Search in full text record
record_path = find_record(doc["id"], doc["category"]) where_clause = " AND ".join(conditions) if conditions else "1=1"
if record_path and record_path.exists():
content = record_path.read_text().lower() sql = f"""
if query_lower in content: SELECT doc_id, filename, category, doc_type, date, vendor, amount, summary, processed_at
results.append(doc) FROM documents
WHERE {where_clause}
ORDER BY processed_at DESC
LIMIT ?
"""
params.append(limit)
cursor = conn.execute(sql, params)
results = [dict(row) for row in cursor.fetchall()]
conn.close()
return results return results
def find_record(doc_id: str, category: str) -> Path: def get_document(doc_id: str) -> dict:
"""Find the record file for a document.""" """Get full document details by ID."""
cat_dir = RECORDS / category conn = get_db()
if cat_dir.exists(): conn.row_factory = sqlite3.Row
for f in cat_dir.iterdir():
if doc_id in f.name: cursor = conn.execute("""
return f SELECT * FROM documents WHERE doc_id = ? OR doc_id LIKE ?
return None """, (doc_id, f"{doc_id}%"))
row = cursor.fetchone()
conn.close()
return dict(row) if row else None
def list_categories() -> dict:
"""List all categories with document counts."""
conn = get_db()
cursor = conn.execute("""
SELECT category, COUNT(*) as count
FROM documents
GROUP BY category
ORDER BY count DESC
""")
results = {row[0]: row[1] for row in cursor.fetchall()}
conn.close()
return results
def list_types() -> dict:
"""List all document types with counts."""
conn = get_db()
cursor = conn.execute("""
SELECT doc_type, COUNT(*) as count
FROM documents
GROUP BY doc_type
ORDER BY count DESC
""")
results = {row[0]: row[1] for row in cursor.fetchall()}
conn.close()
return results
def show_stats() -> None:
"""Show document statistics."""
conn = get_db()
# Total count
total = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
print("\n📊 Document Statistics")
print("=" * 40)
print(f"Total documents: {total}")
# By category
print("\nBy category:")
for cat, count in list_categories().items():
print(f" {cat}: {count}")
# By type
print("\nBy type:")
for dtype, count in list_types().items():
print(f" {dtype}: {count}")
conn.close()
def show_document(doc_id: str) -> None: def show_document(doc_id: str) -> None:
"""Show full details of a document.""" """Show full details of a document."""
data = load_index() doc = get_document(doc_id)
for doc in data["documents"]: if not doc:
if doc["id"] == doc_id or doc_id in doc.get("filename", ""): print(f"Document not found: {doc_id}")
print(f"\n{'='*60}") return
print(f"Document: {doc['filename']}")
print(f"ID: {doc['id']}")
print(f"Category: {doc['category']}")
print(f"Type: {doc.get('type', 'unknown')}")
print(f"Date: {doc.get('date', 'N/A')}")
print(f"Amount: {doc.get('amount', 'N/A')}")
print(f"Processed: {doc.get('processed', 'N/A')}")
print(f"{'='*60}")
# Show record content
record_path = find_record(doc["id"], doc["category"])
if record_path:
print(f"\nRecord: {record_path}")
print("-"*60)
print(record_path.read_text())
return
print(f"Document not found: {doc_id}") print(f"\n{'=' * 60}")
print(f"Document: {doc['filename']}")
print(f"ID: {doc['doc_id']}")
print(f"Category: {doc['category']}")
print(f"Type: {doc['doc_type'] or 'unknown'}")
print(f"Date: {doc['date'] or 'N/A'}")
print(f"Vendor: {doc['vendor'] or 'N/A'}")
print(f"Amount: {doc['amount'] or 'N/A'}")
print(f"Processed: {doc['processed_at']}")
print(f"{'=' * 60}")
if doc['summary']:
print(f"\nSummary:\n{doc['summary']}")
if doc['full_text']:
print(f"\n--- Full Text (first 2000 chars) ---\n")
print(doc['full_text'][:2000])
if len(doc['full_text']) > 2000:
print(f"\n... [{len(doc['full_text']) - 2000} more characters]")
def list_stats() -> None: def format_results(results: list) -> None:
"""Show document statistics.""" """Format and print search results."""
data = load_index() if not results:
print("No documents found")
return
print("\n📊 Document Statistics") print(f"\nFound {len(results)} document(s):\n")
print("="*40)
print(f"Total documents: {data['stats']['total']}")
print("\nBy type:") # Header
for dtype, count in sorted(data["stats"].get("by_type", {}).items()): print(f"{'ID':<10} {'Category':<12} {'Type':<18} {'Date':<12} {'Amount':<10} {'Filename'}")
print(f" {dtype}: {count}") print("-" * 90)
print("\nBy category:") for doc in results:
by_cat = {} doc_id = doc['doc_id'][:8]
for doc in data["documents"]: cat = (doc['category'] or '')[:12]
cat = doc.get("category", "unknown") dtype = (doc['doc_type'] or 'unknown')[:18]
by_cat[cat] = by_cat.get(cat, 0) + 1 date = (doc['date'] or '')[:12]
for cat, count in sorted(by_cat.items()): amount = (doc['amount'] or '')[:10]
print(f" {cat}: {count}") filename = doc['filename'][:30]
print(f"{doc_id:<10} {cat:<12} {dtype:<18} {date:<12} {amount:<10} {filename}")
def main(): def main():
@ -125,10 +203,12 @@ def main():
parser.add_argument("-s", "--show", help="Show full document by ID") parser.add_argument("-s", "--show", help="Show full document by ID")
parser.add_argument("--stats", action="store_true", help="Show statistics") parser.add_argument("--stats", action="store_true", help="Show statistics")
parser.add_argument("-l", "--list", action="store_true", help="List all documents") parser.add_argument("-l", "--list", action="store_true", help="List all documents")
parser.add_argument("-n", "--limit", type=int, default=20, help="Max results (default: 20)")
parser.add_argument("--full-text", action="store_true", help="Show full text in results")
args = parser.parse_args() args = parser.parse_args()
if args.stats: if args.stats:
list_stats() show_stats()
return return
if args.show: if args.show:
@ -136,17 +216,8 @@ def main():
return return
if args.list or args.query or args.category or args.type: if args.list or args.query or args.category or args.type:
results = search_documents(args.query, args.category, args.type) results = search_documents(args.query, args.category, args.type, args.limit)
format_results(results)
if not results:
print("No documents found")
return
print(f"\nFound {len(results)} document(s):\n")
for doc in results:
date = doc.get("date", "")[:10] if doc.get("date") else ""
amount = doc.get("amount", "")
print(f" [{doc['id'][:8]}] {doc['category']:12} {doc.get('type', ''):15} {date:12} {amount:10} {doc['filename']}")
else: else:
parser.print_help() parser.print_help()