# Azure Files Backup — Requirements Spec *Captured: 2025-01-28 | Domain: Personal | Priority: HIGH* ## Purpose **POC to prove a point:** The right architecture can backup billions of files with minimal database overhead. This is NOT a Kaseya project — it's Johan demonstrating his design philosophy. ## Target - **Azure Files API** specifically - NOT Azure Blob Storage - NOT OneDrive/SharePoint ## Scale Requirements - **Billions of files** - 64-bit node IDs required - DB must fit in RAM for fast queries (~50GB target) ## Database Design (~50 bytes/file) | Field | Type | Size | Purpose | |-------|------|------|---------| | node_id | int64 | 8 bytes | Unique identifier (billions need 64-bit) | | parent_id | int64 | 8 bytes | Tree structure link | | name | varchar | ~20 bytes | Filename only, NOT full path | | size | int64 | 8 bytes | File size in bytes | | mtime | int64 | 8 bytes | Unix timestamp | | hash | int64 | 8 bytes | xorhash (MSFT standard) | **Total: ~50 bytes/file → ~50GB for 1 billion files → fits in RAM** ### Key Constraints - **Node tree only** — NO full path strings stored - Paths reconstructed by walking parent_id to root - Rename directory = update 1 row, not millions - DB is index + analytics only ## Object Storage Design Everything that doesn't fit in 50 bytes goes here: - Full metadata (ACLs, extended attributes, permissions) - File content (chunked, deduplicated) - Version history - FlatBuffer serialized ### Bundling - **TAR format** (proven, standard) - Only when it saves ops (not for just 2 files) - Threshold TBD (likely <64KB or <1MB) ## Hash Strategy - **xorhash** — MSFT standard, 64-bit, fast - NOT sha256 (overkill for change detection) - Used for: change detection, not cryptographic verification ## Architecture ``` ~/dev/azure-backup/ ├── core/ — library (tree, hash, storage interface, flatbuffer) ├── worker/ — K8s-scalable backup worker (100s of workers) ├── api/ — REST API for GUI └── web/ — Go templates + htmx ``` ### Worker Design - Stateless K8s pods - Horizontal scaling (add pods, auto-claim work) - Job types: scan, backup, restore, verify - Queue: Postgres SKIP LOCKED (works up to ~1000 workers) ### Multi-Tenant - Isolated by tenant_id + share_id - Each tenant+share gets separate node tree - Object paths: `{tenant_id}/{share_id}/{node_id}` ## GUI Requirements - **Web UI:** Go + htmx/templ - **Multi-tenant view** (not single-tenant) ## Meta - **Language:** Go (all the way, core library) - **Repo:** `~/dev/azure-backup` - **License:** Proprietary - **Type:** Personal POC (prove a point) ## Open Questions (resolved) - ✅ 64-bit node IDs (billions of files) - ✅ xorhash not sha256 - ✅ TAR bundling - ✅ Multi-tenant GUI - ✅ Proprietary license ## Status - ✅ Requirements captured - ✅ Repo scaffolded - ✅ ARCHITECTURE.md written - ✅ FlatBuffer schema + Go code generated - ✅ Azure SDK integration (real client implementation) - ✅ Web UI (Go + htmx + Tailwind) - ✅ 4,400+ lines of Go code - 🔲 Azure free trial account (needs Johan) - 🔲 Database integration (Postgres) - 🔲 End-to-end test with real Azure Files