105 lines
3.1 KiB
Markdown
105 lines
3.1 KiB
Markdown
# Azure Files Backup — Requirements Spec
|
|
|
|
*Captured: 2025-01-28 | Domain: Personal | Priority: HIGH*
|
|
|
|
## Purpose
|
|
**POC to prove a point:** The right architecture can backup billions of files with minimal database overhead.
|
|
|
|
This is NOT a Kaseya project — it's Johan demonstrating his design philosophy.
|
|
|
|
## Target
|
|
- **Azure Files API** specifically
|
|
- NOT Azure Blob Storage
|
|
- NOT OneDrive/SharePoint
|
|
|
|
## Scale Requirements
|
|
- **Billions of files**
|
|
- 64-bit node IDs required
|
|
- DB must fit in RAM for fast queries (~50GB target)
|
|
|
|
## Database Design (~50 bytes/file)
|
|
|
|
| Field | Type | Size | Purpose |
|
|
|-------|------|------|---------|
|
|
| node_id | int64 | 8 bytes | Unique identifier (billions need 64-bit) |
|
|
| parent_id | int64 | 8 bytes | Tree structure link |
|
|
| name | varchar | ~20 bytes | Filename only, NOT full path |
|
|
| size | int64 | 8 bytes | File size in bytes |
|
|
| mtime | int64 | 8 bytes | Unix timestamp |
|
|
| hash | int64 | 8 bytes | xorhash (MSFT standard) |
|
|
|
|
**Total: ~50 bytes/file → ~50GB for 1 billion files → fits in RAM**
|
|
|
|
### Key Constraints
|
|
- **Node tree only** — NO full path strings stored
|
|
- Paths reconstructed by walking parent_id to root
|
|
- Rename directory = update 1 row, not millions
|
|
- DB is index + analytics only
|
|
|
|
## Object Storage Design
|
|
Everything that doesn't fit in 50 bytes goes here:
|
|
- Full metadata (ACLs, extended attributes, permissions)
|
|
- File content (chunked, deduplicated)
|
|
- Version history
|
|
- FlatBuffer serialized
|
|
|
|
### Bundling
|
|
- **TAR format** (proven, standard)
|
|
- Only when it saves ops (not for just 2 files)
|
|
- Threshold TBD (likely <64KB or <1MB)
|
|
|
|
## Hash Strategy
|
|
- **xorhash** — MSFT standard, 64-bit, fast
|
|
- NOT sha256 (overkill for change detection)
|
|
- Used for: change detection, not cryptographic verification
|
|
|
|
## Architecture
|
|
|
|
```
|
|
~/dev/azure-backup/
|
|
├── core/ — library (tree, hash, storage interface, flatbuffer)
|
|
├── worker/ — K8s-scalable backup worker (100s of workers)
|
|
├── api/ — REST API for GUI
|
|
└── web/ — Go templates + htmx
|
|
```
|
|
|
|
### Worker Design
|
|
- Stateless K8s pods
|
|
- Horizontal scaling (add pods, auto-claim work)
|
|
- Job types: scan, backup, restore, verify
|
|
- Queue: Postgres SKIP LOCKED (works up to ~1000 workers)
|
|
|
|
### Multi-Tenant
|
|
- Isolated by tenant_id + share_id
|
|
- Each tenant+share gets separate node tree
|
|
- Object paths: `{tenant_id}/{share_id}/{node_id}`
|
|
|
|
## GUI Requirements
|
|
- **Web UI:** Go + htmx/templ
|
|
- **Multi-tenant view** (not single-tenant)
|
|
|
|
## Meta
|
|
- **Language:** Go (all the way, core library)
|
|
- **Repo:** `~/dev/azure-backup`
|
|
- **License:** Proprietary
|
|
- **Type:** Personal POC (prove a point)
|
|
|
|
## Open Questions (resolved)
|
|
- ✅ 64-bit node IDs (billions of files)
|
|
- ✅ xorhash not sha256
|
|
- ✅ TAR bundling
|
|
- ✅ Multi-tenant GUI
|
|
- ✅ Proprietary license
|
|
|
|
## Status
|
|
- ✅ Requirements captured
|
|
- ✅ Repo scaffolded
|
|
- ✅ ARCHITECTURE.md written
|
|
- ✅ FlatBuffer schema + Go code generated
|
|
- ✅ Azure SDK integration (real client implementation)
|
|
- ✅ Web UI (Go + htmx + Tailwind)
|
|
- ✅ 4,400+ lines of Go code
|
|
- 🔲 Azure free trial account (needs Johan)
|
|
- 🔲 Database integration (Postgres)
|
|
- 🔲 End-to-end test with real Azure Files
|