clawd/memory/azure-files-backup-requirem...

105 lines
3.1 KiB
Markdown

# Azure Files Backup — Requirements Spec
*Captured: 2025-01-28 | Domain: Personal | Priority: HIGH*
## Purpose
**POC to prove a point:** The right architecture can backup billions of files with minimal database overhead.
This is NOT a Kaseya project — it's Johan demonstrating his design philosophy.
## Target
- **Azure Files API** specifically
- NOT Azure Blob Storage
- NOT OneDrive/SharePoint
## Scale Requirements
- **Billions of files**
- 64-bit node IDs required
- DB must fit in RAM for fast queries (~50GB target)
## Database Design (~50 bytes/file)
| Field | Type | Size | Purpose |
|-------|------|------|---------|
| node_id | int64 | 8 bytes | Unique identifier (billions need 64-bit) |
| parent_id | int64 | 8 bytes | Tree structure link |
| name | varchar | ~20 bytes | Filename only, NOT full path |
| size | int64 | 8 bytes | File size in bytes |
| mtime | int64 | 8 bytes | Unix timestamp |
| hash | int64 | 8 bytes | xorhash (MSFT standard) |
**Total: ~50 bytes/file → ~50GB for 1 billion files → fits in RAM**
### Key Constraints
- **Node tree only** — NO full path strings stored
- Paths reconstructed by walking parent_id to root
- Rename directory = update 1 row, not millions
- DB is index + analytics only
## Object Storage Design
Everything that doesn't fit in 50 bytes goes here:
- Full metadata (ACLs, extended attributes, permissions)
- File content (chunked, deduplicated)
- Version history
- FlatBuffer serialized
### Bundling
- **TAR format** (proven, standard)
- Only when it saves ops (not for just 2 files)
- Threshold TBD (likely <64KB or <1MB)
## Hash Strategy
- **xorhash** MSFT standard, 64-bit, fast
- NOT sha256 (overkill for change detection)
- Used for: change detection, not cryptographic verification
## Architecture
```
~/dev/azure-backup/
├── core/ — library (tree, hash, storage interface, flatbuffer)
├── worker/ — K8s-scalable backup worker (100s of workers)
├── api/ — REST API for GUI
└── web/ — Go templates + htmx
```
### Worker Design
- Stateless K8s pods
- Horizontal scaling (add pods, auto-claim work)
- Job types: scan, backup, restore, verify
- Queue: Postgres SKIP LOCKED (works up to ~1000 workers)
### Multi-Tenant
- Isolated by tenant_id + share_id
- Each tenant+share gets separate node tree
- Object paths: `{tenant_id}/{share_id}/{node_id}`
## GUI Requirements
- **Web UI:** Go + htmx/templ
- **Multi-tenant view** (not single-tenant)
## Meta
- **Language:** Go (all the way, core library)
- **Repo:** `~/dev/azure-backup`
- **License:** Proprietary
- **Type:** Personal POC (prove a point)
## Open Questions (resolved)
- 64-bit node IDs (billions of files)
- xorhash not sha256
- TAR bundling
- Multi-tenant GUI
- Proprietary license
## Status
- Requirements captured
- Repo scaffolded
- ARCHITECTURE.md written
- FlatBuffer schema + Go code generated
- Azure SDK integration (real client implementation)
- Web UI (Go + htmx + Tailwind)
- 4,400+ lines of Go code
- 🔲 Azure free trial account (needs Johan)
- 🔲 Database integration (Postgres)
- 🔲 End-to-end test with real Azure Files