Commit Graph

29 Commits

Author SHA1 Message Date
James 883f118d66 fix: pdftoppm output filename glob instead of hardcoded page-1.png
pdftoppm zero-pads the page number based on total page count:
- <10 pages: page-1.png
- <100 pages: page-01.png
- <1000 pages: page-001.png

The code hardcoded 'page-1.png' and 'page-N.png', which fails for any
multi-page document. Use filepath.Glob('page-*.png') to find the actual
output regardless of padding width.

Fixed in both ConvertToImage() (first-page preview) and the multi-page
OCR loop in ProcessDocument().
2026-03-23 14:14:28 -04:00
James 9622ab9390 fix: format=md endpoint now returns full OCR text (full_text field)
SearchDocuments excludes full_text for performance. The MD endpoint
needs the actual OCR content, not just the summary.

Added SearchDocumentsWithFullText() and SearchDocumentsWithFullTextFallback()
that select full_text explicitly. apiSearchMDHandler now uses these,
so format=md returns the complete OCR/markdown text for each document.
2026-03-23 14:07:20 -04:00
James 405a6f697f feat: add GET /api/search?q=...&format=md for AI/LLM consumption
New endpoint returns all matching documents as concatenated plain-text
markdown, one section per document separated by ---.

Format:
  # Document: {title}
  ID: {id} | Category: {category} | Date: {date} | Vendor: {vendor}

  {full_text or summary}

  ---

Parameters:
  q      - search query (required)
  format - must be 'md' (required; distinguishes from HTML search)

Uses same FTS5 search as existing endpoints, limit raised to 200.
Falls back to LIKE search if FTS5 fails. Returns text/markdown content type.
POST /api/search (HTML partial) unchanged.
2026-03-23 13:58:47 -04:00
James 63d4e5e5ca chore: auto-commit uncommitted changes 2026-02-28 06:01:28 -05:00
James 2c91d5649e chore: auto-commit uncommitted changes 2026-02-28 00:01:21 -05:00
James bbc029196a chore: auto-commit uncommitted changes 2026-02-25 18:01:27 -05:00
James 83373885d4 Add vocabulary hints for handwriting: Jongsma, Johan, Tatyana, St. Petersburg FL 2026-02-25 14:24:04 -05:00
James 1b4c82ab83 Improve title prompt: require specific, identifying titles with sender+topic+date 2026-02-25 14:21:43 -05:00
James 4970157690 Switch vision model to qwen3-vl-30b-a3b-instruct
Replaces kimi-k2p5 for all vision tasks. K2.5 was outputting chain-of-thought
reasoning instead of JSON for non-English docs, requiring a fallback path.
qwen3-vl works first try, no retry needed, preserves original language correctly.
2026-02-25 14:17:54 -05:00
James 193d88afef Add delete button to category list view 2026-02-25 14:09:05 -05:00
James d962c9839d Fix extraction: don't translate, fallback OCR+classify path for non-JSON responses
- Add 'DO NOT translate, preserve original language' to vision prompts
- Shorter/tighter JSON prompt to reduce K2.5 reasoning verbosity
- Fallback: when AnalyzeWithVision returns no JSON, do AnalyzePageOnly (plain text) then AnalyzeText (classify)
- Fallback to AnalyzePageOnly for single-page PDFs with empty/placeholder full_text
- Switch model back to kimi-k2p5 (only vision model on this Fireworks account)
- Build with CGO_ENABLED=1 -tags fts5 (required for SQLite FTS5)
2026-02-25 14:01:59 -05:00
James 00d8f7c94a chore: auto-commit uncommitted changes 2026-02-15 12:00:36 -05:00
James 9d6ad09b53 Use proper content-type for downloads to avoid Chrome insecure download block 2026-02-12 17:53:51 -05:00
James a1d156bbd5 Fix download: serve file manually to avoid http.ServeFile header conflicts 2026-02-12 17:50:12 -05:00
James 99b39ee737 Add download attribute to download link to prevent inline viewing 2026-02-12 17:29:48 -05:00
James f59c12e25c Add download link with pretty filename from document title
- servePDF now supports ?download=1 query param
- Looks up document title and uses it as the Content-Disposition filename
- Download button on document page triggers actual download (not tab open)
- Added sanitizeFilename helper for safe Content-Disposition values
2026-02-12 17:27:12 -05:00
James 1b49dac87f Document page: two-row layout - details|summary+notes top, OCR|PDF bottom 2026-02-10 04:14:51 -05:00
James 6adfefff7a Change processed date format to 'Jan 02, 2006 3:04 PM EST' 2026-02-10 04:09:35 -05:00
James a52ab6e20d Document details: two-column layout (category/vendor/amount | date/processed/filename) 2026-02-10 04:06:59 -05:00
James 9c9bd5e881 Document details: inline category dropdown, formatted processed_at timestamp 2026-02-10 04:00:11 -05:00
James 4a0e9648ac Move category edit to document page (inline dropdown), revert dashboard to static badges 2026-02-10 03:55:27 -05:00
James dabd97e13c Dashboard: formatted timestamps (MM/DD/YYYY HH:MM TZ), inline category edit dropdown 2026-02-10 03:52:46 -05:00
James 3a6aa8cbda Dashboard: narrower categories column, show scan time in recent docs, add inou/sophia categories 2026-02-10 03:48:20 -05:00
James b3bb615075 Add migration script for hash-based IDs to date-slug format (31 docs migrated) 2026-02-10 00:34:44 -05:00
James 5445b294cb Share links now use .pdf extension and Content-Disposition header for Android compatibility 2026-02-10 00:33:13 -05:00
James a77a31f4c9 Fix share links: use external URL (docs.jongsma.me), fix copy button, add copy on existing shares 2026-02-09 12:09:16 -05:00
James 9f0bac5783 Add document sharing with expiring links
- Share table with random tokens and optional expiry (default 7 days)
- Public /s/{token} endpoint serves PDF directly
- Share/revoke UI on document page with copy-to-clipboard
- Caddy reverse proxy configured at docs.jongsma.me
2026-02-09 11:28:21 -05:00
James a73ae5c03e fix: delete document cleans up store files and embeddings 2026-02-08 03:55:42 -05:00
James 00d0b0a0d7 Initial commit 2026-02-04 13:37:26 -05:00