Replace OCR with AI vision, SQLite for storage

- Remove Tesseract/OCR dependencies
- Use Claude vision API for document analysis
- Single AI pass: extract text + classify + summarize
- SQLite database for documents and embeddings
- Embeddings storage ready (generation placeholder)
- Full-text search via SQLite
- Updated systemd service to use venv
- Support .env file for API key
This commit is contained in:
Johan Jongsma 2026-02-01 17:24:05 +00:00
parent 9dac36681c
commit fb3d5a46b5
4 changed files with 598 additions and 354 deletions

4
.gitignore vendored Normal file
View File

@ -0,0 +1,4 @@
venv/
.env
__pycache__/
*.pyc

188
README.md
View File

@ -1,105 +1,119 @@
# Document Management System
# Document Processor
Automated document processing pipeline for scanning, OCR, classification, and indexing.
AI-powered document management system using Claude vision for extraction and SQLite for storage/search.
## Architecture
## Features
- **AI Vision Analysis**: Uses Claude to read documents, extract text, classify, and summarize
- **No OCR dependencies**: Just drop files in inbox, AI handles the rest
- **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
- **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
- **Expense Tracking**: Auto-exports bills/receipts to CSV
## Setup
```bash
cd ~/dev/doc-processor
# Create/activate venv
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install anthropic
# Configure API key (one of these methods):
# Option 1: Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
# Option 2: .env file
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
```
## Usage
```bash
# Activate venv first
source ~/dev/doc-processor/venv/bin/activate
# Process all documents in inbox
python processor.py
# Watch inbox continuously
python processor.py --watch
# Process single file
python processor.py --file /path/to/document.pdf
# Search documents
python search.py "query"
python search.py -c medical # By category
python search.py -t receipt # By type
python search.py -s abc123 # Show full document
python search.py --stats # Statistics
python search.py -l # List all
```
## Directory Structure
```
~/documents/
├── inbox/ # Drop documents here (SMB share for scanner)
├── store/ # Original files stored by hash
├── records/ # Markdown records by category
│ ├── bills/
├── inbox/ # Drop files here (SMB share for scanner)
├── store/ # Original files (hash-named)
├── records/ # Markdown records by category
│ ├── taxes/
│ ├── bills/
│ ├── medical/
│ ├── expenses/
│ └── ...
├── index/ # Search index
│ └── master.json
└── exports/ # CSV exports
└── expenses.csv
├── index/
│ ├── master.json # JSON index
│ └── embeddings.db # SQLite (documents + embeddings)
└── exports/
└── expenses.csv # Auto-exported expenses
```
## How It Works
## Supported Formats
1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually)
2. **Daemon processes it** (runs every 60 seconds):
- Extracts text (pdftotext or tesseract OCR)
- Classifies document type and category
- Extracts key fields (date, vendor, amount)
- Stores original file by content hash
- Creates markdown record
- Updates searchable index
- Exports expenses to CSV
3. **Search** your documents anytime
## Commands
```bash
# Process inbox manually
python3 ~/dev/doc-processor/processor.py
# Process single file
python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
# Watch mode (manual, daemon does this automatically)
python3 ~/dev/doc-processor/processor.py --watch --interval 30
# Search documents
python3 ~/dev/doc-processor/search.py "duke energy"
python3 ~/dev/doc-processor/search.py -c bills # By category
python3 ~/dev/doc-processor/search.py -t receipt # By type
python3 ~/dev/doc-processor/search.py --stats # Statistics
python3 ~/dev/doc-processor/search.py -l # List all
python3 ~/dev/doc-processor/search.py -s <doc_id> # Show full record
```
## Daemon
```bash
# Status
systemctl --user status doc-processor
# Restart
systemctl --user restart doc-processor
# Logs
journalctl --user -u doc-processor -f
```
## Scanner Setup
1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\`
3. Feed paper, press scan
4. Documents auto-process within 60 seconds
- PDF (converted to image for vision)
- Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP
## Categories
| Category | Documents |
|----------|-----------|
| taxes | W-2, 1099, tax returns, IRS forms |
| bills | Utility bills, invoices |
| medical | Medical records, prescriptions |
| insurance | Policies, claims |
| legal | Contracts, agreements |
| financial | Bank statements, investments |
| expenses | Receipts, purchases |
| vehicles | Registration, maintenance |
| home | Mortgage, HOA, property |
| personal | General documents |
| contacts | Business cards |
| uncategorized | Unclassified |
- taxes, bills, medical, insurance, legal
- financial, expenses, vehicles, home
- personal, contacts, uncategorized
## SMB Share Setup
## Systemd Service
Already configured on james server:
```
[documents]
path = /home/johan/documents
browsable = yes
writable = yes
valid users = scanner, johan
```bash
# Install service
systemctl --user daemon-reload
systemctl --user enable doc-processor
systemctl --user start doc-processor
# Check status
systemctl --user status doc-processor
journalctl --user -u doc-processor -f
```
Scanner user can write to inbox, processed files go to other directories.
## Requirements
- Python 3.10+
- `anthropic` Python package
- `pdftoppm` (poppler-utils) for PDF conversion
- Anthropic API key
## API Key
The processor looks for the API key in this order:
1. `ANTHROPIC_API_KEY` environment variable
2. `~/dev/doc-processor/.env` file
## Embeddings
The embedding storage is ready but the generation is a placeholder. Options:
- OpenAI text-embedding-3-small (cheap, good)
- Voyage AI (optimized for documents)
- Local sentence-transformers
Currently uses SQLite full-text search which works well for most use cases.

View File

@ -1,22 +1,31 @@
#!/usr/bin/env python3
"""
Document Processor for ~/documents/inbox/
Watches for new documents, OCRs them, classifies, and files them.
Uses AI vision (Claude) for document analysis. Stores embeddings in SQLite.
"""
import os
import sys
import json
import hashlib
import subprocess
import shutil
import sqlite3
import csv
import base64
import struct
from datetime import datetime
from pathlib import Path
from typing import Optional, Dict, Any
import re
from typing import Optional, Dict, Any, List
import time
import argparse
# Try to import anthropic, fail gracefully with helpful message
try:
import anthropic
except ImportError:
print("ERROR: anthropic package not installed")
print("Run: cd ~/dev/doc-processor && source venv/bin/activate && pip install anthropic")
sys.exit(1)
# Paths
DOCUMENTS_ROOT = Path.home() / "documents"
@ -25,6 +34,7 @@ STORE = DOCUMENTS_ROOT / "store"
RECORDS = DOCUMENTS_ROOT / "records"
INDEX = DOCUMENTS_ROOT / "index"
EXPORTS = DOCUMENTS_ROOT / "exports"
EMBEDDINGS_DB = INDEX / "embeddings.db"
# Categories
CATEGORIES = [
@ -40,149 +50,272 @@ for cat in CATEGORIES:
(RECORDS / cat).mkdir(parents=True, exist_ok=True)
def get_anthropic_client() -> anthropic.Anthropic:
"""Get Anthropic client, checking for API key."""
api_key = os.environ.get("ANTHROPIC_API_KEY")
if not api_key:
# Try reading from config file
config_path = Path.home() / "dev/doc-processor/.env"
if config_path.exists():
for line in config_path.read_text().splitlines():
if line.startswith("ANTHROPIC_API_KEY="):
api_key = line.split("=", 1)[1].strip().strip('"\'')
break
if not api_key:
raise RuntimeError(
"ANTHROPIC_API_KEY not set. Either:\n"
" 1. Set ANTHROPIC_API_KEY environment variable\n"
" 2. Create ~/dev/doc-processor/.env with ANTHROPIC_API_KEY=sk-ant-..."
)
return anthropic.Anthropic(api_key=api_key)
def init_embeddings_db():
"""Initialize SQLite database for embeddings."""
conn = sqlite3.connect(EMBEDDINGS_DB)
conn.execute("""
CREATE TABLE IF NOT EXISTS embeddings (
doc_id TEXT PRIMARY KEY,
embedding BLOB,
text_hash TEXT,
created_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS documents (
doc_id TEXT PRIMARY KEY,
filename TEXT,
category TEXT,
doc_type TEXT,
date TEXT,
vendor TEXT,
amount TEXT,
summary TEXT,
full_text TEXT,
processed_at TEXT
)
""")
conn.commit()
conn.close()
def file_hash(filepath: Path) -> str:
"""SHA256 hash of file contents."""
h = hashlib.sha256()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk)
return h.hexdigest()[:16] # Short hash for filename
return h.hexdigest()[:16]
def extract_text_pdf(filepath: Path) -> str:
"""Extract text from PDF using pdftotext."""
try:
result = subprocess.run(
['pdftotext', '-layout', str(filepath), '-'],
capture_output=True, text=True, timeout=30
)
text = result.stdout.strip()
if len(text) > 50: # Got meaningful text
return text
except Exception as e:
print(f"pdftotext failed: {e}")
# Fallback to OCR
return ocr_document(filepath)
def ocr_document(filepath: Path) -> str:
"""OCR a document using tesseract."""
try:
# For PDFs, convert to images first
if filepath.suffix.lower() == '.pdf':
# Use pdftoppm to convert to images, then OCR
result = subprocess.run(
['pdftoppm', '-png', '-r', '300', str(filepath), '/tmp/doc_page'],
capture_output=True, timeout=60
)
# OCR all pages
text_parts = []
for img in sorted(Path('/tmp').glob('doc_page-*.png')):
result = subprocess.run(
['tesseract', str(img), 'stdout'],
capture_output=True, text=True, timeout=60
)
text_parts.append(result.stdout)
img.unlink() # Clean up
return '\n'.join(text_parts).strip()
else:
# Direct image OCR
result = subprocess.run(
['tesseract', str(filepath), 'stdout'],
capture_output=True, text=True, timeout=60
)
return result.stdout.strip()
except Exception as e:
print(f"OCR failed: {e}")
return ""
def extract_text(filepath: Path) -> str:
"""Extract text from document based on type."""
def encode_image_base64(filepath: Path) -> tuple[str, str]:
"""Encode image/PDF to base64 for API. Returns (base64_data, media_type)."""
suffix = filepath.suffix.lower()
if suffix == '.pdf':
return extract_text_pdf(filepath)
elif suffix in ['.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp']:
return ocr_document(filepath)
elif suffix in ['.txt', '.md']:
return filepath.read_text()
else:
return ""
def classify_document(text: str, filename: str) -> Dict[str, Any]:
"""
Classify document based on content.
Returns: {category, doc_type, date, vendor, amount, summary}
"""
text_lower = text.lower()
result = {
"category": "uncategorized",
"doc_type": "unknown",
"date": None,
"vendor": None,
"amount": None,
"summary": None,
# For PDFs, convert first page to PNG using pdftoppm
import subprocess
result = subprocess.run(
['pdftoppm', '-png', '-f', '1', '-l', '1', '-r', '150', str(filepath), '-'],
capture_output=True, timeout=30
)
if result.returncode == 0:
return base64.standard_b64encode(result.stdout).decode('utf-8'), 'image/png'
else:
raise RuntimeError(f"Failed to convert PDF: {result.stderr.decode()}")
# Image files
media_types = {
'.png': 'image/png',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.gif': 'image/gif',
'.webp': 'image/webp',
}
media_type = media_types.get(suffix, 'image/png')
# Date extraction (various formats)
date_patterns = [
r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
r'(\d{4}[/-]\d{1,2}[/-]\d{1,2})',
r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]* \d{1,2},? \d{4})',
]
for pattern in date_patterns:
match = re.search(pattern, text_lower)
if match:
result["date"] = match.group(1)
break
with open(filepath, 'rb') as f:
return base64.standard_b64encode(f.read()).decode('utf-8'), media_type
def analyze_document_with_ai(filepath: Path, client: anthropic.Anthropic) -> Dict[str, Any]:
"""
Use Claude vision to analyze document.
Returns: {category, doc_type, date, vendor, amount, summary, full_text}
"""
print(f" Analyzing with AI...")
# Amount extraction
amount_match = re.search(r'\$[\d,]+\.?\d*', text)
if amount_match:
result["amount"] = amount_match.group(0)
try:
image_data, media_type = encode_image_base64(filepath)
except Exception as e:
print(f" Failed to encode document: {e}")
return {
"category": "uncategorized",
"doc_type": "unknown",
"full_text": f"(Failed to process: {e})",
"summary": "Document could not be processed"
}
# Classification rules
if any(x in text_lower for x in ['w-2', 'w2', '1099', 'tax return', 'irs', '1040', 'schedule c', 'form 1098']):
result["category"] = "taxes"
result["doc_type"] = "tax_form"
elif any(x in text_lower for x in ['invoice', 'bill', 'amount due', 'payment due', 'account number', 'autopay']):
result["category"] = "bills"
result["doc_type"] = "bill"
# Try to extract vendor
vendors = ['duke energy', 'fpl', 'florida power', 'spectrum', 'at&t', 'verizon', 't-mobile', 'comcast', 'xfinity']
for v in vendors:
if v in text_lower:
result["vendor"] = v.title()
break
elif any(x in text_lower for x in ['patient', 'diagnosis', 'prescription', 'medical', 'physician', 'hospital', 'clinic', 'dr.', 'md']):
result["category"] = "medical"
result["doc_type"] = "medical_record"
elif any(x in text_lower for x in ['policy', 'coverage', 'premium', 'deductible', 'insurance', 'claim']):
result["category"] = "insurance"
result["doc_type"] = "insurance_doc"
elif any(x in text_lower for x in ['agreement', 'contract', 'terms', 'hereby', 'whereas', 'attorney', 'legal']):
result["category"] = "legal"
result["doc_type"] = "legal_doc"
elif any(x in text_lower for x in ['bank', 'statement', 'account', 'balance', 'deposit', 'withdrawal', 'investment', 'portfolio']):
result["category"] = "financial"
result["doc_type"] = "financial_statement"
elif any(x in text_lower for x in ['receipt', 'purchase', 'order', 'subtotal', 'total', 'qty', 'item']):
result["category"] = "expenses"
result["doc_type"] = "receipt"
elif any(x in text_lower for x in ['vin', 'vehicle', 'registration', 'dmv', 'license plate', 'odometer']):
result["category"] = "vehicles"
result["doc_type"] = "vehicle_doc"
elif any(x in text_lower for x in ['mortgage', 'deed', 'property', 'hoa', 'homeowner']):
result["category"] = "home"
result["doc_type"] = "property_doc"
prompt = """Analyze this document image and extract:
1. **Full Text**: Transcribe ALL visible text from the document, preserving structure where possible.
2. **Classification**: Categorize into exactly ONE of:
- taxes (W-2, 1099, tax returns, IRS forms)
- bills (utilities, subscriptions, invoices)
- medical (health records, prescriptions, lab results)
- insurance (policies, claims, coverage docs)
- legal (contracts, agreements, legal notices)
- financial (bank statements, investment docs)
- expenses (receipts, purchase confirmations)
- vehicles (registration, maintenance, DMV)
- home (mortgage, HOA, property docs)
- personal (ID copies, certificates, misc)
- contacts (business cards, contact info)
- uncategorized (if none fit)
3. **Document Type**: Specific type (e.g., "utility_bill", "receipt", "tax_form_w2", "insurance_policy")
4. **Key Fields**:
- date: Document date (YYYY-MM-DD format if possible)
- vendor: Company/organization name
- amount: Dollar amount if present (e.g., "$123.45")
5. **Summary**: 1-2 sentence description of what this document is.
Respond in JSON format:
{
"category": "...",
"doc_type": "...",
"date": "...",
"vendor": "...",
"amount": "...",
"summary": "...",
"full_text": "..."
}"""
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{
"type": "text",
"text": prompt
}
],
}
],
)
# Parse JSON from response
text = response.content[0].text
# Try to extract JSON from response (handle markdown code blocks)
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
elif "```" in text:
text = text.split("```")[1].split("```")[0]
result = json.loads(text.strip())
# Validate category
if result.get("category") not in CATEGORIES:
result["category"] = "uncategorized"
return result
except json.JSONDecodeError as e:
print(f" Failed to parse AI response as JSON: {e}")
print(f" Raw response: {text[:500]}")
return {
"category": "uncategorized",
"doc_type": "unknown",
"full_text": text,
"summary": "AI response could not be parsed"
}
except Exception as e:
print(f" AI analysis failed: {e}")
return {
"category": "uncategorized",
"doc_type": "unknown",
"full_text": f"(AI analysis failed: {e})",
"summary": "Document analysis failed"
}
def generate_embedding(text: str, client: anthropic.Anthropic) -> Optional[List[float]]:
"""
Generate text embedding using Anthropic's embedding endpoint.
Note: As of 2024, Anthropic doesn't have a public embedding API.
This is a placeholder - implement with OpenAI, Voyage, or local model.
# Generate summary (first 200 chars, cleaned)
clean_text = ' '.join(text.split())[:200]
result["summary"] = clean_text
For now, returns None and we'll use full-text search in SQLite.
"""
# TODO: Implement with preferred embedding provider
# Options:
# 1. OpenAI text-embedding-3-small (cheap, good quality)
# 2. Voyage AI (good for documents)
# 3. Local sentence-transformers
return None
def store_embedding(doc_id: str, embedding: Optional[List[float]], text: str):
"""Store embedding in SQLite database."""
if embedding is None:
return
return result
conn = sqlite3.connect(EMBEDDINGS_DB)
# Pack floats as binary blob
embedding_blob = struct.pack(f'{len(embedding)}f', *embedding)
text_hash = hashlib.sha256(text.encode()).hexdigest()[:16]
conn.execute("""
INSERT OR REPLACE INTO embeddings (doc_id, embedding, text_hash, created_at)
VALUES (?, ?, ?, ?)
""", (doc_id, embedding_blob, text_hash, datetime.now().isoformat()))
conn.commit()
conn.close()
def store_document_metadata(doc_id: str, filename: str, classification: Dict, full_text: str):
"""Store document metadata in SQLite for full-text search."""
conn = sqlite3.connect(EMBEDDINGS_DB)
conn.execute("""
INSERT OR REPLACE INTO documents
(doc_id, filename, category, doc_type, date, vendor, amount, summary, full_text, processed_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
doc_id,
filename,
classification.get("category", "uncategorized"),
classification.get("doc_type", "unknown"),
classification.get("date"),
classification.get("vendor"),
classification.get("amount"),
classification.get("summary"),
full_text[:50000], # Limit text size
datetime.now().isoformat()
))
conn.commit()
conn.close()
def store_document(filepath: Path, hash_id: str) -> Path:
@ -194,14 +327,16 @@ def store_document(filepath: Path, hash_id: str) -> Path:
return store_path
def create_record(filepath: Path, hash_id: str, text: str, classification: Dict) -> Path:
def create_record(filepath: Path, hash_id: str, classification: Dict) -> Path:
"""Create markdown record in appropriate category folder."""
cat = classification["category"]
cat = classification.get("category", "uncategorized")
now = datetime.now()
record_name = f"{now.strftime('%Y%m%d')}_{hash_id}.md"
record_path = RECORDS / cat / record_name
full_text = classification.get("full_text", "")
content = f"""# Document Record
**ID:** {hash_id}
@ -225,12 +360,12 @@ def create_record(filepath: Path, hash_id: str, text: str, classification: Dict)
## Full Text
```
{text[:5000]}
{full_text[:10000]}
```
## Files
- **PDF:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
- **Original:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
"""
record_path.write_text(content)
@ -245,15 +380,22 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
with open(index_path) as f:
data = json.load(f)
else:
data = {"version": "1.0", "created": datetime.now().strftime("%Y-%m-%d"), "documents": [], "stats": {"total": 0, "by_type": {}, "by_year": {}}}
data = {
"version": "2.0",
"created": datetime.now().strftime("%Y-%m-%d"),
"documents": [],
"stats": {"total": 0, "by_type": {}, "by_category": {}}
}
doc_entry = {
"id": hash_id,
"filename": filepath.name,
"category": classification["category"],
"category": classification.get("category", "uncategorized"),
"type": classification.get("doc_type", "unknown"),
"date": classification.get("date"),
"vendor": classification.get("vendor"),
"amount": classification.get("amount"),
"summary": classification.get("summary"),
"processed": datetime.now().isoformat(),
}
@ -262,9 +404,11 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
data["documents"].append(doc_entry)
data["stats"]["total"] = len(data["documents"])
# Update type stats
# Update type/category stats
dtype = classification.get("doc_type", "unknown")
cat = classification.get("category", "uncategorized")
data["stats"]["by_type"][dtype] = data["stats"]["by_type"].get(dtype, 0) + 1
data["stats"]["by_category"][cat] = data["stats"]["by_category"].get(cat, 0) + 1
with open(index_path, 'w') as f:
json.dump(data, f, indent=2)
@ -272,7 +416,7 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
"""Append to expenses.csv if it's an expense/receipt."""
if classification["category"] not in ["expenses", "bills"]:
if classification.get("category") not in ["expenses", "bills"]:
return
csv_path = EXPORTS / "expenses.csv"
@ -287,22 +431,22 @@ def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
classification.get("date", ""),
classification.get("vendor", ""),
classification.get("amount", ""),
classification["category"],
classification.get("category", ""),
classification.get("doc_type", ""),
hash_id,
filepath.name,
])
def process_document(filepath: Path) -> bool:
def process_document(filepath: Path, client: anthropic.Anthropic) -> bool:
"""Process a single document through the full pipeline."""
print(f"Processing: {filepath.name}")
# Skip hidden files and non-documents
# Skip hidden files
if filepath.name.startswith('.'):
return False
valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp', '.txt'}
valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.webp', '.tiff', '.tif', '.bmp'}
if filepath.suffix.lower() not in valid_extensions:
print(f" Skipping unsupported format: {filepath.suffix}")
return False
@ -318,87 +462,98 @@ def process_document(filepath: Path) -> bool:
filepath.unlink()
return True
# 3. Extract text (OCR if needed)
print(" Extracting text...")
text = extract_text(filepath)
if not text:
print(" Warning: No text extracted")
text = "(No text could be extracted)"
else:
print(f" Extracted {len(text)} characters")
# 3. Analyze with AI (extracts text + classifies in one pass)
classification = analyze_document_with_ai(filepath, client)
full_text = classification.get("full_text", "")
print(f" Category: {classification.get('category')}, Type: {classification.get('doc_type')}")
print(f" Extracted {len(full_text)} characters")
# 4. Classify
print(" Classifying...")
classification = classify_document(text, filepath.name)
print(f" Category: {classification['category']}, Type: {classification.get('doc_type')}")
# 5. Store PDF
# 4. Store original document
print(" Storing document...")
store_document(filepath, hash_id)
# 6. Create record
# 5. Create markdown record
print(" Creating record...")
record_path = create_record(filepath, hash_id, text, classification)
record_path = create_record(filepath, hash_id, classification)
print(f" Record: {record_path}")
# 7. Update index
# 6. Update JSON index
print(" Updating index...")
update_master_index(hash_id, filepath, classification)
# 8. Export if expense
# 7. Store in SQLite (for search)
print(" Storing in SQLite...")
store_document_metadata(hash_id, filepath.name, classification, full_text)
# 8. Generate and store embedding (if implemented)
embedding = generate_embedding(full_text, client)
if embedding:
store_embedding(hash_id, embedding, full_text)
# 9. Export if expense
export_expense(hash_id, classification, filepath)
# 9. Remove from inbox
# 10. Remove from inbox
print(" Removing from inbox...")
filepath.unlink()
print(f" ✓ Done: {classification['category']}/{hash_id}")
print(f" ✓ Done: {classification.get('category')}/{hash_id}")
return True
def process_inbox() -> int:
def process_inbox(client: anthropic.Anthropic) -> int:
"""Process all documents in inbox. Returns count processed."""
count = 0
for filepath in INBOX.iterdir():
for filepath in sorted(INBOX.iterdir()):
if filepath.is_file() and not filepath.name.startswith('.'):
try:
if process_document(filepath):
if process_document(filepath, client):
count += 1
except Exception as e:
print(f"Error processing {filepath}: {e}")
import traceback
traceback.print_exc()
return count
def watch_inbox(interval: int = 30) -> None:
def watch_inbox(client: anthropic.Anthropic, interval: int = 60) -> None:
"""Watch inbox continuously."""
print(f"Watching {INBOX} (interval: {interval}s)")
print("Press Ctrl+C to stop")
while True:
count = process_inbox()
count = process_inbox(client)
if count:
print(f"Processed {count} document(s)")
time.sleep(interval)
def main():
import argparse
parser = argparse.ArgumentParser(description="Document processor")
parser = argparse.ArgumentParser(description="AI-powered document processor")
parser.add_argument("--watch", action="store_true", help="Watch inbox continuously")
parser.add_argument("--interval", type=int, default=30, help="Watch interval in seconds")
parser.add_argument("--interval", type=int, default=60, help="Watch interval in seconds")
parser.add_argument("--file", type=Path, help="Process single file")
args = parser.parse_args()
# Initialize
init_embeddings_db()
try:
client = get_anthropic_client()
except RuntimeError as e:
print(f"ERROR: {e}")
sys.exit(1)
if args.file:
if args.file.exists():
process_document(args.file)
process_document(args.file, client)
else:
print(f"File not found: {args.file}")
sys.exit(1)
elif args.watch:
watch_inbox(args.interval)
watch_inbox(client, args.interval)
else:
count = process_inbox()
count = process_inbox(client)
print(f"Processed {count} document(s)")

257
search.py
View File

@ -1,11 +1,13 @@
#!/usr/bin/env python3
"""
Search documents in the document management system.
Uses SQLite full-text search on document content.
"""
import os
import sys
import json
import sqlite3
import argparse
from pathlib import Path
from datetime import datetime
@ -13,108 +15,184 @@ from datetime import datetime
DOCUMENTS_ROOT = Path.home() / "documents"
INDEX = DOCUMENTS_ROOT / "index"
RECORDS = DOCUMENTS_ROOT / "records"
EMBEDDINGS_DB = INDEX / "embeddings.db"
def load_index() -> dict:
"""Load the master index."""
index_path = INDEX / "master.json"
if index_path.exists():
with open(index_path) as f:
return json.load(f)
return {"documents": []}
def get_db() -> sqlite3.Connection:
"""Get database connection."""
if not EMBEDDINGS_DB.exists():
print(f"Database not found: {EMBEDDINGS_DB}")
print("Run the processor first to create the database.")
sys.exit(1)
return sqlite3.connect(EMBEDDINGS_DB)
def search_documents(query: str, category: str = None, doc_type: str = None) -> list:
"""Search documents by query, optionally filtered by category/type."""
data = load_index()
results = []
def search_documents(query: str, category: str = None, doc_type: str = None, limit: int = 20) -> list:
"""
Search documents by query using SQLite full-text search.
Returns list of matching documents.
"""
conn = get_db()
conn.row_factory = sqlite3.Row
query_lower = query.lower() if query else ""
# Build query
conditions = []
params = []
for doc in data["documents"]:
# Apply filters
if category and doc.get("category") != category:
continue
if doc_type and doc.get("type") != doc_type:
continue
# If no query, return all matching filters
if not query:
results.append(doc)
continue
# Search in indexed fields
searchable = f"{doc.get('filename', '')} {doc.get('category', '')} {doc.get('type', '')} {doc.get('date', '')} {doc.get('amount', '')}".lower()
if query_lower in searchable:
results.append(doc)
continue
# Search in full text record
record_path = find_record(doc["id"], doc["category"])
if record_path and record_path.exists():
content = record_path.read_text().lower()
if query_lower in content:
results.append(doc)
if query:
# Search in full_text, summary, vendor, filename
conditions.append("""(
full_text LIKE ? OR
summary LIKE ? OR
vendor LIKE ? OR
filename LIKE ?
)""")
like_query = f"%{query}%"
params.extend([like_query, like_query, like_query, like_query])
if category:
conditions.append("category = ?")
params.append(category)
if doc_type:
conditions.append("doc_type = ?")
params.append(doc_type)
where_clause = " AND ".join(conditions) if conditions else "1=1"
sql = f"""
SELECT doc_id, filename, category, doc_type, date, vendor, amount, summary, processed_at
FROM documents
WHERE {where_clause}
ORDER BY processed_at DESC
LIMIT ?
"""
params.append(limit)
cursor = conn.execute(sql, params)
results = [dict(row) for row in cursor.fetchall()]
conn.close()
return results
def find_record(doc_id: str, category: str) -> Path:
"""Find the record file for a document."""
cat_dir = RECORDS / category
if cat_dir.exists():
for f in cat_dir.iterdir():
if doc_id in f.name:
return f
return None
def get_document(doc_id: str) -> dict:
"""Get full document details by ID."""
conn = get_db()
conn.row_factory = sqlite3.Row
cursor = conn.execute("""
SELECT * FROM documents WHERE doc_id = ? OR doc_id LIKE ?
""", (doc_id, f"{doc_id}%"))
row = cursor.fetchone()
conn.close()
return dict(row) if row else None
def list_categories() -> dict:
"""List all categories with document counts."""
conn = get_db()
cursor = conn.execute("""
SELECT category, COUNT(*) as count
FROM documents
GROUP BY category
ORDER BY count DESC
""")
results = {row[0]: row[1] for row in cursor.fetchall()}
conn.close()
return results
def list_types() -> dict:
"""List all document types with counts."""
conn = get_db()
cursor = conn.execute("""
SELECT doc_type, COUNT(*) as count
FROM documents
GROUP BY doc_type
ORDER BY count DESC
""")
results = {row[0]: row[1] for row in cursor.fetchall()}
conn.close()
return results
def show_stats() -> None:
"""Show document statistics."""
conn = get_db()
# Total count
total = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
print("\n📊 Document Statistics")
print("=" * 40)
print(f"Total documents: {total}")
# By category
print("\nBy category:")
for cat, count in list_categories().items():
print(f" {cat}: {count}")
# By type
print("\nBy type:")
for dtype, count in list_types().items():
print(f" {dtype}: {count}")
conn.close()
def show_document(doc_id: str) -> None:
"""Show full details of a document."""
data = load_index()
doc = get_document(doc_id)
for doc in data["documents"]:
if doc["id"] == doc_id or doc_id in doc.get("filename", ""):
print(f"\n{'='*60}")
print(f"Document: {doc['filename']}")
print(f"ID: {doc['id']}")
print(f"Category: {doc['category']}")
print(f"Type: {doc.get('type', 'unknown')}")
print(f"Date: {doc.get('date', 'N/A')}")
print(f"Amount: {doc.get('amount', 'N/A')}")
print(f"Processed: {doc.get('processed', 'N/A')}")
print(f"{'='*60}")
# Show record content
record_path = find_record(doc["id"], doc["category"])
if record_path:
print(f"\nRecord: {record_path}")
print("-"*60)
print(record_path.read_text())
return
if not doc:
print(f"Document not found: {doc_id}")
return
print(f"Document not found: {doc_id}")
print(f"\n{'=' * 60}")
print(f"Document: {doc['filename']}")
print(f"ID: {doc['doc_id']}")
print(f"Category: {doc['category']}")
print(f"Type: {doc['doc_type'] or 'unknown'}")
print(f"Date: {doc['date'] or 'N/A'}")
print(f"Vendor: {doc['vendor'] or 'N/A'}")
print(f"Amount: {doc['amount'] or 'N/A'}")
print(f"Processed: {doc['processed_at']}")
print(f"{'=' * 60}")
if doc['summary']:
print(f"\nSummary:\n{doc['summary']}")
if doc['full_text']:
print(f"\n--- Full Text (first 2000 chars) ---\n")
print(doc['full_text'][:2000])
if len(doc['full_text']) > 2000:
print(f"\n... [{len(doc['full_text']) - 2000} more characters]")
def list_stats() -> None:
"""Show document statistics."""
data = load_index()
def format_results(results: list) -> None:
"""Format and print search results."""
if not results:
print("No documents found")
return
print("\n📊 Document Statistics")
print("="*40)
print(f"Total documents: {data['stats']['total']}")
print(f"\nFound {len(results)} document(s):\n")
print("\nBy type:")
for dtype, count in sorted(data["stats"].get("by_type", {}).items()):
print(f" {dtype}: {count}")
# Header
print(f"{'ID':<10} {'Category':<12} {'Type':<18} {'Date':<12} {'Amount':<10} {'Filename'}")
print("-" * 90)
print("\nBy category:")
by_cat = {}
for doc in data["documents"]:
cat = doc.get("category", "unknown")
by_cat[cat] = by_cat.get(cat, 0) + 1
for cat, count in sorted(by_cat.items()):
print(f" {cat}: {count}")
for doc in results:
doc_id = doc['doc_id'][:8]
cat = (doc['category'] or '')[:12]
dtype = (doc['doc_type'] or 'unknown')[:18]
date = (doc['date'] or '')[:12]
amount = (doc['amount'] or '')[:10]
filename = doc['filename'][:30]
print(f"{doc_id:<10} {cat:<12} {dtype:<18} {date:<12} {amount:<10} {filename}")
def main():
@ -125,10 +203,12 @@ def main():
parser.add_argument("-s", "--show", help="Show full document by ID")
parser.add_argument("--stats", action="store_true", help="Show statistics")
parser.add_argument("-l", "--list", action="store_true", help="List all documents")
parser.add_argument("-n", "--limit", type=int, default=20, help="Max results (default: 20)")
parser.add_argument("--full-text", action="store_true", help="Show full text in results")
args = parser.parse_args()
if args.stats:
list_stats()
show_stats()
return
if args.show:
@ -136,17 +216,8 @@ def main():
return
if args.list or args.query or args.category or args.type:
results = search_documents(args.query, args.category, args.type)
if not results:
print("No documents found")
return
print(f"\nFound {len(results)} document(s):\n")
for doc in results:
date = doc.get("date", "")[:10] if doc.get("date") else ""
amount = doc.get("amount", "")
print(f" [{doc['id'][:8]}] {doc['category']:12} {doc.get('type', ''):15} {date:12} {amount:10} {doc['filename']}")
results = search_documents(args.query, args.category, args.type, args.limit)
format_results(results)
else:
parser.print_help()