clavitor/docs/NOC-DEPLOYMENT-PLAN.md

284 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Clavitor — Global NOC Deployment Plan
**Owner:** James ⚡
**Target:** Live Friday March 6, 2026
**Budget:** ~$6467/mo (20 AWS regions + Hostkey HQ)
**HQ:** Hans NOC Node — Hostkey Zürich (185.218.204.47, noc.clavitor.com)
---
## Overview
Deploy clavitor across 20 AWS regions (t4g.nano / ARM Graviton, ~$3/mo each), managed by an OpenClaw NOC agent running on the Hostkey HQ node (Hans, 185.218.204.47). Each AWS node runs NixOS + the clavitor Go binary. All management traffic flows over WireGuard. Monitoring via Uptime Kuma push heartbeats from each node.
**Platform decision:** AWS EC2 t4g.nano (ARM/Graviton2). One binary per region. No multi-tenant clustering — each node is fully independent.
**Deployment method:** TBD — likely Terraform or manual AWS Console for initial rollout. Not yet decided; tooling built to accommodate either approach.
---
## Region Selection (21 nodes total: 20 AWS + 1 Hostkey HQ)
| # | Name | City | AWS Region | Provider |
|---|------|------|------------|----------|
| HQ | zurich | Zürich, CH | — | Hostkey (Hans, 185.218.204.47) |
| 1 | virginia | N. Virginia, US | us-east-1 | AWS t4g.nano |
| 2 | ncalifornia | N. California, US | us-west-1 | AWS t4g.nano |
| 3 | montreal | Montreal, CA | ca-central-1 | AWS t4g.nano |
| 4 | mexicocity | Mexico City, MX | mx-central-1 | AWS t4g.nano |
| 5 | saopaulo | São Paulo, BR | sa-east-1 | AWS t4g.nano |
| 6 | london | London, UK | eu-west-2 | AWS t4g.nano |
| 7 | paris | Paris, FR | eu-west-3 | AWS t4g.nano |
| 8 | frankfurt | Frankfurt, DE | eu-central-1 | AWS t4g.nano |
| 9 | spain | Spain, ES | eu-south-2 | AWS t4g.nano |
| 10 | stockholm | Stockholm, SE | eu-north-1 | AWS t4g.nano |
| 11 | uae | UAE | me-central-1 | AWS t4g.nano |
| 12 | telaviv | Tel Aviv, IL | il-central-1 | AWS t4g.nano |
| 13 | capetown | Cape Town, ZA | af-south-1 | AWS t4g.nano |
| 14 | mumbai | Mumbai, IN | ap-south-1 | AWS t4g.nano |
| 15 | singapore | Singapore, SG | ap-southeast-1 | AWS t4g.nano |
| 16 | jakarta | Jakarta, ID | ap-southeast-3 | AWS t4g.nano |
| 17 | malaysia | Kuala Lumpur, MY | ap-southeast-5 | AWS t4g.nano |
| 18 | sydney | Sydney, AU | ap-southeast-2 | AWS t4g.nano |
| 19 | seoul | Seoul, KR | ap-northeast-2 | AWS t4g.nano |
| 20 | hongkong | Hong Kong | ap-east-1 | AWS t4g.nano |
*Johan-approved on 2026-03-02.*
---
## Milestones at a Glance
| # | Milestone | Owner | Deadline |
|---|-----------|-------|----------|
| M1 | Hans HQ ready (WireGuard hub + OC NOC + Kuma) | James | Mon Mar 2, EOD |
| M2 | NixOS config + deploy tooling in repo | James | Tue Mar 3, EOD |
| M3 | Pilot: 3 nodes live (Virginia + 2 others) | James | Wed Mar 4, noon |
| M4 | Go/No-Go review | Johan | Wed Mar 4, EOD |
| M5 | Full 20-region AWS fleet live | James | Thu Mar 5, EOD |
| M6 | DNS, TLS, health checks verified | James | Thu Mar 5, EOD |
| M7 | Go-live: clavitor.com routing to fleet | Johan + James | Fri Mar 6, noon |
---
## Day-by-Day Plan
---
### Sunday Mar 1 — Planning & Prerequisites
- [x] Read INFRASTRUCTURE.md, write this plan
- [ ] **Johan:** Set up AWS account + credentials (IAM user or root — access keys needed)
- [ ] **Johan:** Decide deployment method: Terraform vs manual AWS Console
- [ ] **Johan:** Approve plan → James starts Monday
---
### Monday Mar 2 — Hans HQ Setup (M1)
**M1.1 — WireGuard Hub (on Hans, 185.218.204.47)**
- Generate Hans hub keypair
- Configure wg0: `10.84.0.1/24`, UDP 51820
- UFW: allow 51820 inbound
- Save `hans.pub` to repo
**M1.2 — OpenClaw NOC Agent**
- ✅ OpenClaw v2026.3.1 installed on Hans
- Model: Fireworks MiniMax M2.5 (no Anthropic tokens on Hans)
- Telegram/Discord routing configured for deploy commands
**M1.3 — Uptime Kuma fleet monitors**
- New ntfy topic: `clavitor-alerts`
- 20 push monitors in Kuma, one per AWS region
- SEV2: 2 missed pushes; SEV1: 5+ min down
- All monitors pending (nodes not yet live)
**M1.4 — SOC domain**
- `soc.clavitor.com` → 185.218.204.47 (Cloudflare DNS-only)
- Kuma accessible at soc.clavitor.com
**✅ M1 Done:** WireGuard hub up on Hans, NOC agent running, Kuma fleet monitors configured, SOC domain live.
---
### Tuesday Mar 3 — NixOS Config & Tooling (M2)
**M2.1 — Repo structure**
```
clavitor/infra/
nixos/
base.nix # shared: WireGuard spoke, SSH, clavitor service, firewall
nodes/
virginia.nix # per-node vars: wg_ip, hostname, kuma_token, aws_region
frankfurt.nix
... (20 total)
scripts/
keygen.sh # generate WireGuard keypair for a new node
provision.sh # provision AWS EC2 + full NixOS config push
deploy.sh # push binary + nixos-rebuild [node|all], rolling
healthcheck.sh # verify: WG ping, HTTPS 200, Kuma heartbeat received
wireguard/
hans.pub # hub public key (Hans HQ)
peers.conf # all node pubkeys + WG IPs (no private keys ever)
```
> Provision approach (Terraform vs AWS Console) is TBD. Scripts above accommodate either — provision.sh takes an already-running EC2 IP and configures it from there.
**M2.2 — base.nix**
- WireGuard spoke (parameterized), pointing hub at 185.218.204.47
- SSH on WireGuard interface only — no public port 22
- clavitor systemd service
- Firewall: public 80+443 only
- Nix store: 2 generations max, weekly GC
**M2.3 — 20 AWS node var files**
One `.nix` per node: wg_ip, hostname, aws_region, kuma_push_token, subdomain
**M2.4 — clavitor binary: telemetry push**
New background goroutine (30s interval):
- Reads: `runtime.MemStats`, `/proc/loadavg`, disk, DB size + integrity check
- POSTs JSON to `KUMA_PUSH_URL` env var
- Fields: ram_mb, disk_pct, cpu_pct, db_size_mb, db_integrity, active_sessions, req_1h, err_1h, cert_days_remaining, nix_gen, uptime_s
- Build: `CGO_ENABLED=1`, cross-compiled to `linux/arm64` (t4g.nano is Graviton2/ARM)
**M2.5 — provision.sh**
```
provision.sh <ip> <node-name>
```
1. SSH to fresh EC2 instance (Amazon Linux or NixOS AMI)
2. Run `nixos-infect` if needed → wait for reboot (~3 min)
3. Push base.nix + node vars + WireGuard private key
4. `nixos-rebuild switch`
5. Push clavitor binary (`linux/arm64`) + .env
6. Run healthcheck.sh → confirm WG up, HTTPS 200, Kuma green
**M2.6 — deploy.sh**
- Rolling: deploy one node → verify health → next
- Abort on first failure
**✅ M2 Done:** Any node provisionable in <20 min. Fleet-wide binary deploy in <10 min.
---
### Wednesday Mar 4 — Pilot: 3 Nodes (M3 + M4)
**M3.1 — Virginia as first AWS node (Wed AM)**
- Launch t4g.nano in us-east-1
- `provision.sh` DNS healthcheck Kuma green
- `https://virginia.clavitor.com`
**M3.2 — Frankfurt (Wed AM)**
- t4g.nano in eu-central-1 (~100ms from Hans HQ)
- `provision.sh` DNS healthcheck Kuma green
**M3.3 — Singapore (Wed AM)**
- t4g.nano in ap-southeast-1
- `provision.sh` DNS healthcheck Kuma green
**M3.4 — Validation (Wed noon)**
- `deploy.sh all` rolling update across 3 nodes
- Kill clavitor on Frankfurt Kuma alert fires to ntfy in <2 min restart green
- `nmap` each node: confirm port 22 not public
- TLS cert valid on all 3
**M4 — Go/No-Go (Wed EOD)**
- Johan reviews 3 pilot nodes
- Blockers fixed same day
- Green light full fleet Thursday
---
### Thursday Mar 5 — Full Fleet (M5 + M6)
**M5 — Provision remaining 17 AWS nodes**
| Batch | Regions | Time |
|-------|---------|------|
| 1 | N.California, Montreal, Mexico City, São Paulo | Thu 9 AM |
| 2 | London, Paris, Spain, Stockholm | Thu 11 AM |
| 3 | UAE, Tel Aviv, Cape Town, Mumbai | Thu 1 PM |
| 4 | Jakarta, Malaysia, Sydney, Seoul, Hong Kong | Thu 3 PM |
Each node: launch t4g.nano in region `provision.sh` DNS healthcheck Kuma green
**M6 — Fleet verification (Thu EOD)**
- Kuma: all 20 monitors green
- `deploy.sh all` rolling deploy across full fleet
- Latency check: all nodes reachable from Hans HQ WireGuard
- No public SSH on any node (nmap spot check)
- TLS valid on all 20
** M5+M6 Done:** 20 AWS nodes live, all green, WireGuard mesh stable.
---
### Friday Mar 6 — Go Live (M7)
**M7.1 — Final review (Fri AM)**
- Johan spot-checks 3-4 random nodes
- Kuma dashboard review
- Any last fixes
**M7.2 — clavitor.com routing (Fri noon)**
- Primary: `clavitor.com` Virginia (largest US East market)
- Optional: Cloudflare Load Balancer for GeoDNS ($5/mo Johan decides)
**M7.3 — Go-live**
- Dashboard briefing: fleet live
- `https://soc.clavitor.com` status page
**🚀 LIVE: Friday March 6, 2026 noon ET**
---
## Prerequisites from Johan
| Item | Needed By | Status |
|------|-----------|--------|
| AWS account + credentials (access keys) | Mon Mar 2 AM | 🔴 Outstanding blocks everything |
| AWS deployment method decision (Terraform vs manual) | Tue Mar 3 AM | 🟡 TBD |
| Plan approval | Sun Mar 1 | Approved |
> **No longer needed:** ~~Vultr API key~~ — Vultr removed from architecture entirely.
Everything else James handles autonomously once AWS credentials are available.
---
## Risk Register
| Risk | Mitigation |
|------|-----------|
| AWS account setup delay | Critical path affects M3 and all downstream milestones |
| nixos-infect fails on AWS AMI | Fallback: use official NixOS AMI for arm64 directly |
| Let's Encrypt rate limit | 20 certs/week well under 50 limit; stagger if needed |
| clavitor CGO/SQLite on NixOS arm64 | Cross-compile with zig; fallback: modernc.org/sqlite (pure Go) |
| WireGuard NAT on EC2 | persistentKeepalive=25; AWS EC2 bare networking, no double-NAT |
| t4g.nano RAM (0.5GB) | clavitor binary is ~15MB + SQLite; should be fine at low volume |
---
## Cost Summary
| Component | Count | Unit | Monthly |
|-----------|-------|------|---------|
| Hans HQ (Hostkey, Zürich) | 1 | 3.90/mo | ~$4 |
| AWS EC2 t4g.nano | 20 | ~$3/mo | ~$60 |
| **Total** | **21** | | **~$6467/mo** |
Budget ceiling: $100/mo **~$3336/mo reserve** for upgrades.
---
## Post-Launch (not blocking Friday)
- GeoDNS / Cloudflare Load Balancer for latency-based routing
- Automated weekly NixOS updates via NOC cron
- China mainland Phase 2 (requires ICP license + separate AWS China account)
- Terraform for reproducible fleet management (once initial rollout proven)
- clavitor-web multi-tenant backend with node assignment
---
*Written: 2026-03-01 · Updated: 2026-03-03 · James ⚡*