Replace OCR with AI vision, SQLite for storage

- Remove Tesseract/OCR dependencies - Use Claude vision API for document analysis - Single AI pass: extract text + classify + summarize - SQLite database for documents and embeddings - Embeddings storage ready (generation placeholder) - Full-text search via SQLite - Updated systemd service to use venv - Support .env file for API key
2026-02-01 17:24:05 +00:00 · 2026-02-01 17:24:05 +00:00 · fb3d5a46b5
parent 9dac36681c
commit fb3d5a46b5
4 changed files with 598 additions and 354 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,4 @@
 venv/
 .env
 __pycache__/
 *.pyc
--- a/README.md
+++ b/README.md
@ -1,105 +1,119 @@
-# Document Management System
+# Document Processor
-Automated document processing pipeline for scanning, OCR, classification, and indexing.
+AI-powered document management system using Claude vision for extraction and SQLite for storage/search.
-## Architecture
+## Features
 - **AI Vision Analysis**: Uses Claude to read documents, extract text, classify, and summarize
 - **No OCR dependencies**: Just drop files in inbox, AI handles the rest
 - **SQLite Storage**: Full-text search via SQLite, embeddings ready (placeholder)
 - **Auto-categorization**: Taxes, bills, medical, insurance, legal, financial, etc.
 - **Expense Tracking**: Auto-exports bills/receipts to CSV
 ## Setup
 ```bash
 cd ~/dev/doc-processor
 # Create/activate venv
 python3 -m venv venv
 source venv/bin/activate
 # Install dependencies
 pip install anthropic
 # Configure API key (one of these methods):
 # Option 1: Environment variable
 export ANTHROPIC_API_KEY=sk-ant-...
 # Option 2: .env file
 echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
 ```
 ## Usage
 ```bash
 # Activate venv first
 source ~/dev/doc-processor/venv/bin/activate
 # Process all documents in inbox
 python processor.py
 # Watch inbox continuously
 python processor.py --watch
 # Process single file
 python processor.py --file /path/to/document.pdf
 # Search documents
 python search.py "query"
 python search.py -c medical              # By category
 python search.py -t receipt              # By type
 python search.py -s abc123               # Show full document
 python search.py --stats                 # Statistics
 python search.py -l                      # List all
 ```
 ## Directory Structure
 ```
 ~/documents/
-├── inbox/          # Drop documents here (SMB share for scanner)
+├── inbox/           # Drop files here (SMB share for scanner)
-├── store/          # Original files stored by hash
+├── store/           # Original files (hash-named)
-├── records/        # Markdown records by category
+├── records/         # Markdown records by category
 │   ├── bills/
 │   ├── taxes/
 │   ├── bills/
 │   ├── medical/
 │   ├── expenses/
 │   └── ...
-├── index/          # Search index
+├── index/
-│   └── master.json
+│   ├── master.json  # JSON index
-└── exports/        # CSV exports
+│   └── embeddings.db  # SQLite (documents + embeddings)
-    └── expenses.csv
+└── exports/
    └── expenses.csv # Auto-exported expenses
 ```
-## How It Works
+## Supported Formats
-1. **Drop a document** in `~/documents/inbox/` (via SMB, phone scan, or manually)
+- PDF (converted to image for vision)
-2. **Daemon processes it** (runs every 60 seconds):
+- Images: PNG, JPG, JPEG, GIF, WebP, TIFF, BMP
   - Extracts text (pdftotext or tesseract OCR)
   - Classifies document type and category
   - Extracts key fields (date, vendor, amount)
   - Stores original file by content hash
   - Creates markdown record
   - Updates searchable index
   - Exports expenses to CSV
 3. **Search** your documents anytime
 ## Commands
 ```bash
 # Process inbox manually
 python3 ~/dev/doc-processor/processor.py
 # Process single file
 python3 ~/dev/doc-processor/processor.py --file /path/to/doc.pdf
 # Watch mode (manual, daemon does this automatically)
 python3 ~/dev/doc-processor/processor.py --watch --interval 30
 # Search documents
 python3 ~/dev/doc-processor/search.py "duke energy"
 python3 ~/dev/doc-processor/search.py -c bills        # By category
 python3 ~/dev/doc-processor/search.py -t receipt      # By type
 python3 ~/dev/doc-processor/search.py --stats         # Statistics
 python3 ~/dev/doc-processor/search.py -l              # List all
 python3 ~/dev/doc-processor/search.py -s <doc_id>     # Show full record
 ```
 ## Daemon
 ```bash
 # Status
 systemctl --user status doc-processor
 # Restart
 systemctl --user restart doc-processor
 # Logs
 journalctl --user -u doc-processor -f
 ```
 ## Scanner Setup
 1. Get a scanner with SMB support (Brother ADS-1700W, Fujitsu ScanSnap, etc.)
 2. Configure scanner to save to SMB share: `\\192.168.1.16\documents\inbox\`
 3. Feed paper, press scan
 4. Documents auto-process within 60 seconds
 ## Categories
-| Category | Documents |
+- taxes, bills, medical, insurance, legal
-|----------|-----------|
+- financial, expenses, vehicles, home
-| taxes | W-2, 1099, tax returns, IRS forms |
+- personal, contacts, uncategorized
 | bills | Utility bills, invoices |
 | medical | Medical records, prescriptions |
 | insurance | Policies, claims |
 | legal | Contracts, agreements |
 | financial | Bank statements, investments |
 | expenses | Receipts, purchases |
 | vehicles | Registration, maintenance |
 | home | Mortgage, HOA, property |
 | personal | General documents |
 | contacts | Business cards |
 | uncategorized | Unclassified |
-## SMB Share Setup
+## Systemd Service
-Already configured on james server:
+```bash
-```
+# Install service
-[documents]
+systemctl --user daemon-reload
-   path = /home/johan/documents
+systemctl --user enable doc-processor
-   browsable = yes
+systemctl --user start doc-processor
-   writable = yes
+
-   valid users = scanner, johan
+# Check status
 systemctl --user status doc-processor
 journalctl --user -u doc-processor -f
 ```
-Scanner user can write to inbox, processed files go to other directories.
+## Requirements
 - Python 3.10+
 - `anthropic` Python package
 - `pdftoppm` (poppler-utils) for PDF conversion
 - Anthropic API key
 ## API Key
 The processor looks for the API key in this order:
 1. `ANTHROPIC_API_KEY` environment variable
 2. `~/dev/doc-processor/.env` file
 ## Embeddings
 The embedding storage is ready but the generation is a placeholder. Options:
 - OpenAI text-embedding-3-small (cheap, good)
 - Voyage AI (optimized for documents)
 - Local sentence-transformers
 Currently uses SQLite full-text search which works well for most use cases.
--- a/processor.py
+++ b/processor.py
@ -1,22 +1,31 @@
 #!/usr/bin/env python3
 """
 Document Processor for ~/documents/inbox/
-Watches for new documents, OCRs them, classifies, and files them.
+Uses AI vision (Claude) for document analysis. Stores embeddings in SQLite.
 """
 import os
 import sys
 import json
 import hashlib
 import subprocess
 import shutil
 import sqlite3
 import csv
 import base64
 import struct
 from datetime import datetime
 from pathlib import Path
-from typing import Optional, Dict, Any
+from typing import Optional, Dict, Any, List
 import re
 import time
 import argparse
 # Try to import anthropic, fail gracefully with helpful message
 try:
    import anthropic
 except ImportError:
    print("ERROR: anthropic package not installed")
    print("Run: cd ~/dev/doc-processor && source venv/bin/activate && pip install anthropic")
    sys.exit(1)
 # Paths
 DOCUMENTS_ROOT = Path.home() / "documents"
@ -25,6 +34,7 @@ STORE = DOCUMENTS_ROOT / "store"
 RECORDS = DOCUMENTS_ROOT / "records"
 INDEX = DOCUMENTS_ROOT / "index"
 EXPORTS = DOCUMENTS_ROOT / "exports"
 EMBEDDINGS_DB = INDEX / "embeddings.db"
 # Categories
 CATEGORIES = [
@ -40,149 +50,272 @@ for cat in CATEGORIES:
    (RECORDS / cat).mkdir(parents=True, exist_ok=True)
 def get_anthropic_client() -> anthropic.Anthropic:
    """Get Anthropic client, checking for API key."""
    api_key = os.environ.get("ANTHROPIC_API_KEY")
    if not api_key:
        # Try reading from config file
        config_path = Path.home() / "dev/doc-processor/.env"
        if config_path.exists():
            for line in config_path.read_text().splitlines():
                if line.startswith("ANTHROPIC_API_KEY="):
                    api_key = line.split("=", 1)[1].strip().strip('"\'')
                    break
    if not api_key:
        raise RuntimeError(
            "ANTHROPIC_API_KEY not set. Either:\n"
            "  1. Set ANTHROPIC_API_KEY environment variable\n"
            "  2. Create ~/dev/doc-processor/.env with ANTHROPIC_API_KEY=sk-ant-..."
        )
    return anthropic.Anthropic(api_key=api_key)
 def init_embeddings_db():
    """Initialize SQLite database for embeddings."""
    conn = sqlite3.connect(EMBEDDINGS_DB)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS embeddings (
            doc_id TEXT PRIMARY KEY,
            embedding BLOB,
            text_hash TEXT,
            created_at TEXT
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            doc_id TEXT PRIMARY KEY,
            filename TEXT,
            category TEXT,
            doc_type TEXT,
            date TEXT,
            vendor TEXT,
            amount TEXT,
            summary TEXT,
            full_text TEXT,
            processed_at TEXT
        )
    """)
    conn.commit()
    conn.close()
 def file_hash(filepath: Path) -> str:
    """SHA256 hash of file contents."""
    h = hashlib.sha256()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            h.update(chunk)
-    return h.hexdigest()[:16]  # Short hash for filename
+    return h.hexdigest()[:16]
-def extract_text_pdf(filepath: Path) -> str:
+def encode_image_base64(filepath: Path) -> tuple[str, str]:
-    """Extract text from PDF using pdftotext."""
+    """Encode image/PDF to base64 for API. Returns (base64_data, media_type)."""
    try:
        result = subprocess.run(
            ['pdftotext', '-layout', str(filepath), '-'],
            capture_output=True, text=True, timeout=30
        )
        text = result.stdout.strip()
        if len(text) > 50:  # Got meaningful text
            return text
    except Exception as e:
        print(f"pdftotext failed: {e}")
    # Fallback to OCR
    return ocr_document(filepath)
 def ocr_document(filepath: Path) -> str:
    """OCR a document using tesseract."""
    try:
        # For PDFs, convert to images first
        if filepath.suffix.lower() == '.pdf':
            # Use pdftoppm to convert to images, then OCR
            result = subprocess.run(
                ['pdftoppm', '-png', '-r', '300', str(filepath), '/tmp/doc_page'],
                capture_output=True, timeout=60
            )
            # OCR all pages
            text_parts = []
            for img in sorted(Path('/tmp').glob('doc_page-*.png')):
                result = subprocess.run(
                    ['tesseract', str(img), 'stdout'],
                    capture_output=True, text=True, timeout=60
                )
                text_parts.append(result.stdout)
                img.unlink()  # Clean up
            return '\n'.join(text_parts).strip()
        else:
            # Direct image OCR
            result = subprocess.run(
                ['tesseract', str(filepath), 'stdout'],
                capture_output=True, text=True, timeout=60
            )
            return result.stdout.strip()
    except Exception as e:
        print(f"OCR failed: {e}")
        return ""
 def extract_text(filepath: Path) -> str:
    """Extract text from document based on type."""
    suffix = filepath.suffix.lower()
    if suffix == '.pdf':
-        return extract_text_pdf(filepath)
+        # For PDFs, convert first page to PNG using pdftoppm
-    elif suffix in ['.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp']:
+        import subprocess
-        return ocr_document(filepath)
+        result = subprocess.run(
-    elif suffix in ['.txt', '.md']:
+            ['pdftoppm', '-png', '-f', '1', '-l', '1', '-r', '150', str(filepath), '-'],
-        return filepath.read_text()
+            capture_output=True, timeout=30
-    else:
+        )
-        return ""
+        if result.returncode == 0:
            return base64.standard_b64encode(result.stdout).decode('utf-8'), 'image/png'
        else:
            raise RuntimeError(f"Failed to convert PDF: {result.stderr.decode()}")
-
+    # Image files
-def classify_document(text: str, filename: str) -> Dict[str, Any]:
+    media_types = {
-    """
+        '.png': 'image/png',
-    Classify document based on content.
+        '.jpg': 'image/jpeg',
-    Returns: {category, doc_type, date, vendor, amount, summary}
+        '.jpeg': 'image/jpeg',
-    """
+        '.gif': 'image/gif',
-    text_lower = text.lower()
+        '.webp': 'image/webp',
    result = {
        "category": "uncategorized",
        "doc_type": "unknown",
        "date": None,
        "vendor": None,
        "amount": None,
        "summary": None,
    }
    media_type = media_types.get(suffix, 'image/png')
-    # Date extraction (various formats)
+    with open(filepath, 'rb') as f:
-    date_patterns = [
+        return base64.standard_b64encode(f.read()).decode('utf-8'), media_type
        r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
        r'(\d{4}[/-]\d{1,2}[/-]\d{1,2})',
        r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]* \d{1,2},? \d{4})',
    ]
    for pattern in date_patterns:
        match = re.search(pattern, text_lower)
        if match:
            result["date"] = match.group(1)
            break
    # Amount extraction
    amount_match = re.search(r'\$[\d,]+\.?\d*', text)
    if amount_match:
        result["amount"] = amount_match.group(0)
-    # Classification rules
+def analyze_document_with_ai(filepath: Path, client: anthropic.Anthropic) -> Dict[str, Any]:
-    if any(x in text_lower for x in ['w-2', 'w2', '1099', 'tax return', 'irs', '1040', 'schedule c', 'form 1098']):
+    """
-        result["category"] = "taxes"
+    Use Claude vision to analyze document.
-        result["doc_type"] = "tax_form"
+    Returns: {category, doc_type, date, vendor, amount, summary, full_text}
-    elif any(x in text_lower for x in ['invoice', 'bill', 'amount due', 'payment due', 'account number', 'autopay']):
+    """
-        result["category"] = "bills"
+    print(f"  Analyzing with AI...")
        result["doc_type"] = "bill"
        # Try to extract vendor
        vendors = ['duke energy', 'fpl', 'florida power', 'spectrum', 'at&t', 'verizon', 't-mobile', 'comcast', 'xfinity']
        for v in vendors:
            if v in text_lower:
                result["vendor"] = v.title()
                break
    elif any(x in text_lower for x in ['patient', 'diagnosis', 'prescription', 'medical', 'physician', 'hospital', 'clinic', 'dr.', 'md']):
        result["category"] = "medical"
        result["doc_type"] = "medical_record"
    elif any(x in text_lower for x in ['policy', 'coverage', 'premium', 'deductible', 'insurance', 'claim']):
        result["category"] = "insurance"
        result["doc_type"] = "insurance_doc"
    elif any(x in text_lower for x in ['agreement', 'contract', 'terms', 'hereby', 'whereas', 'attorney', 'legal']):
        result["category"] = "legal"
        result["doc_type"] = "legal_doc"
    elif any(x in text_lower for x in ['bank', 'statement', 'account', 'balance', 'deposit', 'withdrawal', 'investment', 'portfolio']):
        result["category"] = "financial"
        result["doc_type"] = "financial_statement"
    elif any(x in text_lower for x in ['receipt', 'purchase', 'order', 'subtotal', 'total', 'qty', 'item']):
        result["category"] = "expenses"
        result["doc_type"] = "receipt"
    elif any(x in text_lower for x in ['vin', 'vehicle', 'registration', 'dmv', 'license plate', 'odometer']):
        result["category"] = "vehicles"
        result["doc_type"] = "vehicle_doc"
    elif any(x in text_lower for x in ['mortgage', 'deed', 'property', 'hoa', 'homeowner']):
        result["category"] = "home"
        result["doc_type"] = "property_doc"
-    # Generate summary (first 200 chars, cleaned)
+    try:
-    clean_text = ' '.join(text.split())[:200]
+        image_data, media_type = encode_image_base64(filepath)
-    result["summary"] = clean_text
+    except Exception as e:
        print(f"  Failed to encode document: {e}")
        return {
            "category": "uncategorized",
            "doc_type": "unknown",
            "full_text": f"(Failed to process: {e})",
            "summary": "Document could not be processed"
        }
-    return result
+    prompt = """Analyze this document image and extract:
 1. **Full Text**: Transcribe ALL visible text from the document, preserving structure where possible.
 2. **Classification**: Categorize into exactly ONE of:
   - taxes (W-2, 1099, tax returns, IRS forms)
   - bills (utilities, subscriptions, invoices)
   - medical (health records, prescriptions, lab results)
   - insurance (policies, claims, coverage docs)
   - legal (contracts, agreements, legal notices)
   - financial (bank statements, investment docs)
   - expenses (receipts, purchase confirmations)
   - vehicles (registration, maintenance, DMV)
   - home (mortgage, HOA, property docs)
   - personal (ID copies, certificates, misc)
   - contacts (business cards, contact info)
   - uncategorized (if none fit)
 3. **Document Type**: Specific type (e.g., "utility_bill", "receipt", "tax_form_w2", "insurance_policy")
 4. **Key Fields**:
   - date: Document date (YYYY-MM-DD format if possible)
   - vendor: Company/organization name
   - amount: Dollar amount if present (e.g., "$123.45")
 5. **Summary**: 1-2 sentence description of what this document is.
 Respond in JSON format:
 {
  "category": "...",
  "doc_type": "...",
  "date": "...",
  "vendor": "...",
  "amount": "...",
  "summary": "...",
  "full_text": "..."
 }"""
    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": media_type,
                                "data": image_data,
                            },
                        },
                        {
                            "type": "text",
                            "text": prompt
                        }
                    ],
                }
            ],
        )
        # Parse JSON from response
        text = response.content[0].text
        # Try to extract JSON from response (handle markdown code blocks)
        if "```json" in text:
            text = text.split("```json")[1].split("```")[0]
        elif "```" in text:
            text = text.split("```")[1].split("```")[0]
        result = json.loads(text.strip())
        # Validate category
        if result.get("category") not in CATEGORIES:
            result["category"] = "uncategorized"
        return result
    except json.JSONDecodeError as e:
        print(f"  Failed to parse AI response as JSON: {e}")
        print(f"  Raw response: {text[:500]}")
        return {
            "category": "uncategorized",
            "doc_type": "unknown",
            "full_text": text,
            "summary": "AI response could not be parsed"
        }
    except Exception as e:
        print(f"  AI analysis failed: {e}")
        return {
            "category": "uncategorized",
            "doc_type": "unknown",
            "full_text": f"(AI analysis failed: {e})",
            "summary": "Document analysis failed"
        }
 def generate_embedding(text: str, client: anthropic.Anthropic) -> Optional[List[float]]:
    """
    Generate text embedding using Anthropic's embedding endpoint.
    Note: As of 2024, Anthropic doesn't have a public embedding API.
    This is a placeholder - implement with OpenAI, Voyage, or local model.
    For now, returns None and we'll use full-text search in SQLite.
    """
    # TODO: Implement with preferred embedding provider
    # Options:
    # 1. OpenAI text-embedding-3-small (cheap, good quality)
    # 2. Voyage AI (good for documents)
    # 3. Local sentence-transformers
    return None
 def store_embedding(doc_id: str, embedding: Optional[List[float]], text: str):
    """Store embedding in SQLite database."""
    if embedding is None:
        return
    conn = sqlite3.connect(EMBEDDINGS_DB)
    # Pack floats as binary blob
    embedding_blob = struct.pack(f'{len(embedding)}f', *embedding)
    text_hash = hashlib.sha256(text.encode()).hexdigest()[:16]
    conn.execute("""
        INSERT OR REPLACE INTO embeddings (doc_id, embedding, text_hash, created_at)
        VALUES (?, ?, ?, ?)
    """, (doc_id, embedding_blob, text_hash, datetime.now().isoformat()))
    conn.commit()
    conn.close()
 def store_document_metadata(doc_id: str, filename: str, classification: Dict, full_text: str):
    """Store document metadata in SQLite for full-text search."""
    conn = sqlite3.connect(EMBEDDINGS_DB)
    conn.execute("""
        INSERT OR REPLACE INTO documents 
        (doc_id, filename, category, doc_type, date, vendor, amount, summary, full_text, processed_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        doc_id,
        filename,
        classification.get("category", "uncategorized"),
        classification.get("doc_type", "unknown"),
        classification.get("date"),
        classification.get("vendor"),
        classification.get("amount"),
        classification.get("summary"),
        full_text[:50000],  # Limit text size
        datetime.now().isoformat()
    ))
    conn.commit()
    conn.close()
 def store_document(filepath: Path, hash_id: str) -> Path:
@ -194,14 +327,16 @@ def store_document(filepath: Path, hash_id: str) -> Path:
    return store_path
-def create_record(filepath: Path, hash_id: str, text: str, classification: Dict) -> Path:
+def create_record(filepath: Path, hash_id: str, classification: Dict) -> Path:
    """Create markdown record in appropriate category folder."""
-    cat = classification["category"]
+    cat = classification.get("category", "uncategorized")
    now = datetime.now()
    record_name = f"{now.strftime('%Y%m%d')}_{hash_id}.md"
    record_path = RECORDS / cat / record_name
    full_text = classification.get("full_text", "")
    content = f"""# Document Record
 **ID:** {hash_id}
@ -225,12 +360,12 @@ def create_record(filepath: Path, hash_id: str, text: str, classification: Dict)
 ## Full Text
 ```
-{text[:5000]}
+{full_text[:10000]}
 ```
 ## Files
- **PDF:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
+- **Original:** [store/{hash_id}{filepath.suffix}](../../store/{hash_id}{filepath.suffix})
 """
    record_path.write_text(content)
@ -245,15 +380,22 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
        with open(index_path) as f:
            data = json.load(f)
    else:
-        data = {"version": "1.0", "created": datetime.now().strftime("%Y-%m-%d"), "documents": [], "stats": {"total": 0, "by_type": {}, "by_year": {}}}
+        data = {
            "version": "2.0",
            "created": datetime.now().strftime("%Y-%m-%d"),
            "documents": [],
            "stats": {"total": 0, "by_type": {}, "by_category": {}}
        }
    doc_entry = {
        "id": hash_id,
        "filename": filepath.name,
-        "category": classification["category"],
+        "category": classification.get("category", "uncategorized"),
        "type": classification.get("doc_type", "unknown"),
        "date": classification.get("date"),
        "vendor": classification.get("vendor"),
        "amount": classification.get("amount"),
        "summary": classification.get("summary"),
        "processed": datetime.now().isoformat(),
    }
@ -262,9 +404,11 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
        data["documents"].append(doc_entry)
        data["stats"]["total"] = len(data["documents"])
-        # Update type stats
+        # Update type/category stats
        dtype = classification.get("doc_type", "unknown")
        cat = classification.get("category", "uncategorized")
        data["stats"]["by_type"][dtype] = data["stats"]["by_type"].get(dtype, 0) + 1
        data["stats"]["by_category"][cat] = data["stats"]["by_category"].get(cat, 0) + 1
    with open(index_path, 'w') as f:
        json.dump(data, f, indent=2)
@ -272,7 +416,7 @@ def update_master_index(hash_id: str, filepath: Path, classification: Dict) -> N
 def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
    """Append to expenses.csv if it's an expense/receipt."""
-    if classification["category"] not in ["expenses", "bills"]:
+    if classification.get("category") not in ["expenses", "bills"]:
        return
    csv_path = EXPORTS / "expenses.csv"
@ -287,22 +431,22 @@ def export_expense(hash_id: str, classification: Dict, filepath: Path) -> None:
            classification.get("date", ""),
            classification.get("vendor", ""),
            classification.get("amount", ""),
-            classification["category"],
+            classification.get("category", ""),
            classification.get("doc_type", ""),
            hash_id,
            filepath.name,
        ])
-def process_document(filepath: Path) -> bool:
+def process_document(filepath: Path, client: anthropic.Anthropic) -> bool:
    """Process a single document through the full pipeline."""
    print(f"Processing: {filepath.name}")
-    # Skip hidden files and non-documents
+    # Skip hidden files
    if filepath.name.startswith('.'):
        return False
-    valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp', '.txt'}
+    valid_extensions = {'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.webp', '.tiff', '.tif', '.bmp'}
    if filepath.suffix.lower() not in valid_extensions:
        print(f"  Skipping unsupported format: {filepath.suffix}")
        return False
@ -318,87 +462,98 @@ def process_document(filepath: Path) -> bool:
        filepath.unlink()
        return True
-    # 3. Extract text (OCR if needed)
+    # 3. Analyze with AI (extracts text + classifies in one pass)
-    print("  Extracting text...")
+    classification = analyze_document_with_ai(filepath, client)
-    text = extract_text(filepath)
+    full_text = classification.get("full_text", "")
-    if not text:
+    print(f"  Category: {classification.get('category')}, Type: {classification.get('doc_type')}")
-        print("  Warning: No text extracted")
+    print(f"  Extracted {len(full_text)} characters")
        text = "(No text could be extracted)"
    else:
        print(f"  Extracted {len(text)} characters")
-    # 4. Classify
+    # 4. Store original document
    print("  Classifying...")
    classification = classify_document(text, filepath.name)
    print(f"  Category: {classification['category']}, Type: {classification.get('doc_type')}")
    # 5. Store PDF
    print("  Storing document...")
    store_document(filepath, hash_id)
-    # 6. Create record
+    # 5. Create markdown record
    print("  Creating record...")
-    record_path = create_record(filepath, hash_id, text, classification)
+    record_path = create_record(filepath, hash_id, classification)
    print(f"  Record: {record_path}")
-    # 7. Update index
+    # 6. Update JSON index
    print("  Updating index...")
    update_master_index(hash_id, filepath, classification)
-    # 8. Export if expense
+    # 7. Store in SQLite (for search)
    print("  Storing in SQLite...")
    store_document_metadata(hash_id, filepath.name, classification, full_text)
    # 8. Generate and store embedding (if implemented)
    embedding = generate_embedding(full_text, client)
    if embedding:
        store_embedding(hash_id, embedding, full_text)
    # 9. Export if expense
    export_expense(hash_id, classification, filepath)
-    # 9. Remove from inbox
+    # 10. Remove from inbox
    print("  Removing from inbox...")
    filepath.unlink()
-    print(f"  ✓ Done: {classification['category']}/{hash_id}")
+    print(f"  ✓ Done: {classification.get('category')}/{hash_id}")
    return True
-def process_inbox() -> int:
+def process_inbox(client: anthropic.Anthropic) -> int:
    """Process all documents in inbox. Returns count processed."""
    count = 0
-    for filepath in INBOX.iterdir():
+    for filepath in sorted(INBOX.iterdir()):
        if filepath.is_file() and not filepath.name.startswith('.'):
            try:
-                if process_document(filepath):
+                if process_document(filepath, client):
                    count += 1
            except Exception as e:
                print(f"Error processing {filepath}: {e}")
                import traceback
                traceback.print_exc()
    return count
-def watch_inbox(interval: int = 30) -> None:
+def watch_inbox(client: anthropic.Anthropic, interval: int = 60) -> None:
    """Watch inbox continuously."""
    print(f"Watching {INBOX} (interval: {interval}s)")
    print("Press Ctrl+C to stop")
    while True:
-        count = process_inbox()
+        count = process_inbox(client)
        if count:
            print(f"Processed {count} document(s)")
        time.sleep(interval)
 def main():
-    import argparse
+    parser = argparse.ArgumentParser(description="AI-powered document processor")
    parser = argparse.ArgumentParser(description="Document processor")
    parser.add_argument("--watch", action="store_true", help="Watch inbox continuously")
-    parser.add_argument("--interval", type=int, default=30, help="Watch interval in seconds")
+    parser.add_argument("--interval", type=int, default=60, help="Watch interval in seconds")
    parser.add_argument("--file", type=Path, help="Process single file")
    args = parser.parse_args()
    # Initialize
    init_embeddings_db()
    try:
        client = get_anthropic_client()
    except RuntimeError as e:
        print(f"ERROR: {e}")
        sys.exit(1)
    if args.file:
        if args.file.exists():
-            process_document(args.file)
+            process_document(args.file, client)
        else:
            print(f"File not found: {args.file}")
            sys.exit(1)
    elif args.watch:
-        watch_inbox(args.interval)
+        watch_inbox(client, args.interval)
    else:
-        count = process_inbox()
+        count = process_inbox(client)
        print(f"Processed {count} document(s)")
--- a/search.py
+++ b/search.py
@ -1,11 +1,13 @@
 #!/usr/bin/env python3
 """
 Search documents in the document management system.
 Uses SQLite full-text search on document content.
 """
 import os
 import sys
 import json
 import sqlite3
 import argparse
 from pathlib import Path
 from datetime import datetime
@ -13,108 +15,184 @@ from datetime import datetime
 DOCUMENTS_ROOT = Path.home() / "documents"
 INDEX = DOCUMENTS_ROOT / "index"
 RECORDS = DOCUMENTS_ROOT / "records"
 EMBEDDINGS_DB = INDEX / "embeddings.db"
-def load_index() -> dict:
+def get_db() -> sqlite3.Connection:
-    """Load the master index."""
+    """Get database connection."""
-    index_path = INDEX / "master.json"
+    if not EMBEDDINGS_DB.exists():
-    if index_path.exists():
+        print(f"Database not found: {EMBEDDINGS_DB}")
-        with open(index_path) as f:
+        print("Run the processor first to create the database.")
-            return json.load(f)
+        sys.exit(1)
-    return {"documents": []}
+    return sqlite3.connect(EMBEDDINGS_DB)
-def search_documents(query: str, category: str = None, doc_type: str = None) -> list:
+def search_documents(query: str, category: str = None, doc_type: str = None, limit: int = 20) -> list:
-    """Search documents by query, optionally filtered by category/type."""
+    """
-    data = load_index()
+    Search documents by query using SQLite full-text search.
-    results = []
+    Returns list of matching documents.
    """
    conn = get_db()
    conn.row_factory = sqlite3.Row
-    query_lower = query.lower() if query else ""
+    # Build query
    conditions = []
    params = []
-    for doc in data["documents"]:
+    if query:
-        # Apply filters
+        # Search in full_text, summary, vendor, filename
-        if category and doc.get("category") != category:
+        conditions.append("""(
-            continue
+            full_text LIKE ? OR 
-        if doc_type and doc.get("type") != doc_type:
+            summary LIKE ? OR 
-            continue
+            vendor LIKE ? OR 
            filename LIKE ?
        )""")
        like_query = f"%{query}%"
        params.extend([like_query, like_query, like_query, like_query])
-        # If no query, return all matching filters
+    if category:
-        if not query:
+        conditions.append("category = ?")
-            results.append(doc)
+        params.append(category)
            continue
-        # Search in indexed fields
+    if doc_type:
-        searchable = f"{doc.get('filename', '')} {doc.get('category', '')} {doc.get('type', '')} {doc.get('date', '')} {doc.get('amount', '')}".lower()
+        conditions.append("doc_type = ?")
-        if query_lower in searchable:
+        params.append(doc_type)
            results.append(doc)
            continue
-        # Search in full text record
+    where_clause = " AND ".join(conditions) if conditions else "1=1"
-        record_path = find_record(doc["id"], doc["category"])
+    
-        if record_path and record_path.exists():
+    sql = f"""
-            content = record_path.read_text().lower()
+        SELECT doc_id, filename, category, doc_type, date, vendor, amount, summary, processed_at
-            if query_lower in content:
+        FROM documents
-                results.append(doc)
+        WHERE {where_clause}
        ORDER BY processed_at DESC
        LIMIT ?
    """
    params.append(limit)
    cursor = conn.execute(sql, params)
    results = [dict(row) for row in cursor.fetchall()]
    conn.close()
    return results
-def find_record(doc_id: str, category: str) -> Path:
+def get_document(doc_id: str) -> dict:
-    """Find the record file for a document."""
+    """Get full document details by ID."""
-    cat_dir = RECORDS / category
+    conn = get_db()
-    if cat_dir.exists():
+    conn.row_factory = sqlite3.Row
-        for f in cat_dir.iterdir():
+    
-            if doc_id in f.name:
+    cursor = conn.execute("""
-                return f
+        SELECT * FROM documents WHERE doc_id = ? OR doc_id LIKE ?
-    return None
+    """, (doc_id, f"{doc_id}%"))
    row = cursor.fetchone()
    conn.close()
    return dict(row) if row else None
 def list_categories() -> dict:
    """List all categories with document counts."""
    conn = get_db()
    cursor = conn.execute("""
        SELECT category, COUNT(*) as count 
        FROM documents 
        GROUP BY category 
        ORDER BY count DESC
    """)
    results = {row[0]: row[1] for row in cursor.fetchall()}
    conn.close()
    return results
 def list_types() -> dict:
    """List all document types with counts."""
    conn = get_db()
    cursor = conn.execute("""
        SELECT doc_type, COUNT(*) as count 
        FROM documents 
        GROUP BY doc_type 
        ORDER BY count DESC
    """)
    results = {row[0]: row[1] for row in cursor.fetchall()}
    conn.close()
    return results
 def show_stats() -> None:
    """Show document statistics."""
    conn = get_db()
    # Total count
    total = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
    print("\n📊 Document Statistics")
    print("=" * 40)
    print(f"Total documents: {total}")
    # By category
    print("\nBy category:")
    for cat, count in list_categories().items():
        print(f"  {cat}: {count}")
    # By type
    print("\nBy type:")
    for dtype, count in list_types().items():
        print(f"  {dtype}: {count}")
    conn.close()
 def show_document(doc_id: str) -> None:
    """Show full details of a document."""
-    data = load_index()
+    doc = get_document(doc_id)
-    for doc in data["documents"]:
+    if not doc:
-        if doc["id"] == doc_id or doc_id in doc.get("filename", ""):
+        print(f"Document not found: {doc_id}")
-            print(f"\n{'='*60}")
+        return
            print(f"Document: {doc['filename']}")
            print(f"ID: {doc['id']}")
            print(f"Category: {doc['category']}")
            print(f"Type: {doc.get('type', 'unknown')}")
            print(f"Date: {doc.get('date', 'N/A')}")
            print(f"Amount: {doc.get('amount', 'N/A')}")
            print(f"Processed: {doc.get('processed', 'N/A')}")
            print(f"{'='*60}")
-            # Show record content
+    print(f"\n{'=' * 60}")
-            record_path = find_record(doc["id"], doc["category"])
+    print(f"Document: {doc['filename']}")
-            if record_path:
+    print(f"ID: {doc['doc_id']}")
-                print(f"\nRecord: {record_path}")
+    print(f"Category: {doc['category']}")
-                print("-"*60)
+    print(f"Type: {doc['doc_type'] or 'unknown'}")
-                print(record_path.read_text())
+    print(f"Date: {doc['date'] or 'N/A'}")
-            return
+    print(f"Vendor: {doc['vendor'] or 'N/A'}")
    print(f"Amount: {doc['amount'] or 'N/A'}")
    print(f"Processed: {doc['processed_at']}")
    print(f"{'=' * 60}")
-    print(f"Document not found: {doc_id}")
+    if doc['summary']:
        print(f"\nSummary:\n{doc['summary']}")
    if doc['full_text']:
        print(f"\n--- Full Text (first 2000 chars) ---\n")
        print(doc['full_text'][:2000])
        if len(doc['full_text']) > 2000:
            print(f"\n... [{len(doc['full_text']) - 2000} more characters]")
-def list_stats() -> None:
+def format_results(results: list) -> None:
-    """Show document statistics."""
+    """Format and print search results."""
-    data = load_index()
+    if not results:
        print("No documents found")
        return
-    print("\n📊 Document Statistics")
+    print(f"\nFound {len(results)} document(s):\n")
    print("="*40)
    print(f"Total documents: {data['stats']['total']}")
-    print("\nBy type:")
+    # Header
-    for dtype, count in sorted(data["stats"].get("by_type", {}).items()):
+    print(f"{'ID':<10} {'Category':<12} {'Type':<18} {'Date':<12} {'Amount':<10} {'Filename'}")
-        print(f"  {dtype}: {count}")
+    print("-" * 90)
-    print("\nBy category:")
+    for doc in results:
-    by_cat = {}
+        doc_id = doc['doc_id'][:8]
-    for doc in data["documents"]:
+        cat = (doc['category'] or '')[:12]
-        cat = doc.get("category", "unknown")
+        dtype = (doc['doc_type'] or 'unknown')[:18]
-        by_cat[cat] = by_cat.get(cat, 0) + 1
+        date = (doc['date'] or '')[:12]
-    for cat, count in sorted(by_cat.items()):
+        amount = (doc['amount'] or '')[:10]
-        print(f"  {cat}: {count}")
+        filename = doc['filename'][:30]
        print(f"{doc_id:<10} {cat:<12} {dtype:<18} {date:<12} {amount:<10} {filename}")
 def main():
@ -125,10 +203,12 @@ def main():
    parser.add_argument("-s", "--show", help="Show full document by ID")
    parser.add_argument("--stats", action="store_true", help="Show statistics")
    parser.add_argument("-l", "--list", action="store_true", help="List all documents")
    parser.add_argument("-n", "--limit", type=int, default=20, help="Max results (default: 20)")
    parser.add_argument("--full-text", action="store_true", help="Show full text in results")
    args = parser.parse_args()
    if args.stats:
-        list_stats()
+        show_stats()
        return
    if args.show:
@ -136,17 +216,8 @@ def main():
        return
    if args.list or args.query or args.category or args.type:
-        results = search_documents(args.query, args.category, args.type)
+        results = search_documents(args.query, args.category, args.type, args.limit)
-        
+        format_results(results)
        if not results:
            print("No documents found")
            return
        print(f"\nFound {len(results)} document(s):\n")
        for doc in results:
            date = doc.get("date", "")[:10] if doc.get("date") else ""
            amount = doc.get("amount", "")
            print(f"  [{doc['id'][:8]}] {doc['category']:12} {doc.get('type', ''):15} {date:12} {amount:10} {doc['filename']}")
    else:
        parser.print_help()