3.1 KiB
3.1 KiB
Azure Files Backup — Requirements Spec
Captured: 2025-01-28 | Domain: Personal | Priority: HIGH
Purpose
POC to prove a point: The right architecture can backup billions of files with minimal database overhead.
This is NOT a Kaseya project — it's Johan demonstrating his design philosophy.
Target
- Azure Files API specifically
- NOT Azure Blob Storage
- NOT OneDrive/SharePoint
Scale Requirements
- Billions of files
- 64-bit node IDs required
- DB must fit in RAM for fast queries (~50GB target)
Database Design (~50 bytes/file)
| Field | Type | Size | Purpose |
|---|---|---|---|
| node_id | int64 | 8 bytes | Unique identifier (billions need 64-bit) |
| parent_id | int64 | 8 bytes | Tree structure link |
| name | varchar | ~20 bytes | Filename only, NOT full path |
| size | int64 | 8 bytes | File size in bytes |
| mtime | int64 | 8 bytes | Unix timestamp |
| hash | int64 | 8 bytes | xorhash (MSFT standard) |
Total: ~50 bytes/file → ~50GB for 1 billion files → fits in RAM
Key Constraints
- Node tree only — NO full path strings stored
- Paths reconstructed by walking parent_id to root
- Rename directory = update 1 row, not millions
- DB is index + analytics only
Object Storage Design
Everything that doesn't fit in 50 bytes goes here:
- Full metadata (ACLs, extended attributes, permissions)
- File content (chunked, deduplicated)
- Version history
- FlatBuffer serialized
Bundling
- TAR format (proven, standard)
- Only when it saves ops (not for just 2 files)
- Threshold TBD (likely <64KB or <1MB)
Hash Strategy
- xorhash — MSFT standard, 64-bit, fast
- NOT sha256 (overkill for change detection)
- Used for: change detection, not cryptographic verification
Architecture
~/dev/azure-backup/
├── core/ — library (tree, hash, storage interface, flatbuffer)
├── worker/ — K8s-scalable backup worker (100s of workers)
├── api/ — REST API for GUI
└── web/ — Go templates + htmx
Worker Design
- Stateless K8s pods
- Horizontal scaling (add pods, auto-claim work)
- Job types: scan, backup, restore, verify
- Queue: Postgres SKIP LOCKED (works up to ~1000 workers)
Multi-Tenant
- Isolated by tenant_id + share_id
- Each tenant+share gets separate node tree
- Object paths:
{tenant_id}/{share_id}/{node_id}
GUI Requirements
- Web UI: Go + htmx/templ
- Multi-tenant view (not single-tenant)
Meta
- Language: Go (all the way, core library)
- Repo:
~/dev/azure-backup - License: Proprietary
- Type: Personal POC (prove a point)
Open Questions (resolved)
- ✅ 64-bit node IDs (billions of files)
- ✅ xorhash not sha256
- ✅ TAR bundling
- ✅ Multi-tenant GUI
- ✅ Proprietary license
Status
- ✅ Requirements captured
- ✅ Repo scaffolded
- ✅ ARCHITECTURE.md written
- ✅ FlatBuffer schema + Go code generated
- ✅ Azure SDK integration (real client implementation)
- ✅ Web UI (Go + htmx + Tailwind)
- ✅ 4,400+ lines of Go code
- 🔲 Azure free trial account (needs Johan)
- 🔲 Database integration (Postgres)
- 🔲 End-to-end test with real Azure Files