docs/soc2/disaster-recovery.md

173 lines
6.2 KiB
Markdown

# Disaster Recovery Plan
*Last updated: 2026-02-04*
*Owner: James ⚡*
---
## Infrastructure Overview
| Component | Host | Purpose |
|---|---|---|
| **forge** | 192.168.1.16 | Primary server — OpenClaw gateway, all services |
| **Zurich VPS** | 82.22.36.202 | Git repos, Uptime Kuma, security scanning |
| **192.168.1.253** | LAN | Docker services (Immich, ClickHouse, Jellyfin, Signal, qBittorrent) |
| **192.168.1.252** | LAN | Home Assistant OS |
| **Caddy** | 192.168.0.2 | Reverse proxy (james.jongsma.me, inou.com) |
---
## Backup Strategy
### Tier 1: Git-backed (automated)
All source code is pushed to `git@zurich.inou.com:<repo>.git`. Hourly audit (`scripts/git-audit.sh`) checks for anomalies.
**Repos (as of 2026-02-04):**
- inou, azure-backup, james-dashboard, mail-bridge, mail-agent
- inou-mobile, clawdnode-android, clawdnode-debug-server, clawdnode-gateway
- message-bridge, messaging-center, docman, docsys, docs
- moltmobile-android, moltmobile-gateway, screenshot-server
- docproc, clawd (workspace)
**Recovery:** `git clone git@zurich.inou.com:<repo>.git`
### Tier 2: Configuration (documented, manually recoverable)
These items can't be git-tracked but are documented here for recovery.
#### Signal CLI (bot number: +31634481877)
- **Data:** `~/.clawdbot/tools/signal-cli/` (linked device keys)
- **Recovery:** Re-link using QR code from primary device. Takes ~2 minutes.
- **Impact:** Bot is offline until re-linked. No data loss — message history is on Signal servers.
- **Note:** Signal CLI version and trust store rebuild automatically on first run.
#### WhatsApp (message-bridge, linked to +1 727 225 2475)
- **Data:** `~/.message-bridge/whatsapp.db` (session keys + media refs)
- **Recovery:** Delete `whatsapp.db`, restart message-bridge, scan new QR code from Johan's phone.
- **Impact:** Bot offline until re-linked. Historical messages in WhatsApp, not in our DB.
- **Note:** QR code available at `http://localhost:8030/qr?format=png` after restart.
#### Proton Mail Bridge
- **Data:** `~/.config/protonmail/bridge-v3/` (account link, encryption keys)
- **Recovery:**
1. `apt install protonmail-bridge` (or download from proton.me)
2. Set keychain: `echo '{"preferred_keychain": "pass"}' > ~/.config/protonmail/bridge-v3/prefs.json`
3. Run `protonmail-bridge --cli`, login with tj@jongsma.me credentials
4. Note new bridge password, update mail-bridge config
- **Impact:** Email processing offline until re-linked. ~10 min recovery.
- **Credentials:** Account password with Johan. Bridge password regenerated on setup.
### Tier 3: Data (critical, needs backup solution)
| Data | Path | Size | Backup Status |
|---|---|---|---|
| **Sophia's documents** | `~/sophia/` | 9.2 GB | ⚠️ **SINGLE COPY** — needs offsite backup |
| **Document store** | `~/documents/` | 7.9 MB | In git (docsys repo) for records, PDFs local only |
| **GLM-OCR model** | `~/models/glm-ocr/` | 2.5 GB | Re-downloadable from HuggingFace |
### Tier 4: Rebuildable (no backup needed)
| Component | Recovery |
|---|---|
| Python venvs (`ocr-env/`, `.venv/`) | `pip install -r requirements.txt` |
| Node modules | `npm install` |
| Flutter SDK | Re-download |
| Docker images on 253 | `docker compose pull` |
| OC session transcripts | Nice-to-have, not critical |
---
## Service Recovery Procedures
### Full Server Loss (forge)
**Prerequisites:** SSH key authorized on Zurich VPS, new Ubuntu 24.04 server.
1. **OS Setup:**
```
apt update && apt upgrade
adduser johan
# Install: git, go, node, python3, docker (if needed)
```
2. **SSH Keys:**
- Generate new: `ssh-keygen -t ed25519`
- Authorize on Zurich: `ssh root@zurich.inou.com` → add to `/home/git/.ssh/authorized_keys`
3. **Clone all repos:**
```
mkdir ~/dev && cd ~/dev
for repo in inou azure-backup james-dashboard mail-bridge mail-agent \
inou-mobile clawdnode-android message-bridge messaging-center \
docman docsys docs docproc screenshot-server; do
git clone git@zurich.inou.com:$repo.git
done
git clone git@zurich.inou.com:clawd.git ~/clawd
```
4. **Install OpenClaw:**
```
npm install -g openclaw
openclaw init
# Restore gateway config from clawd/config-backups/ or memory
```
5. **Restore services** (see systemd units below)
6. **Re-link integrations:**
- Signal CLI: QR code link
- WhatsApp: QR code link
- Proton Bridge: CLI login
### Systemd Service Units
All services run as user units (`systemctl --user`).
| Service | Binary/Command | Port | Working Dir |
|---|---|---|---|
| `openclaw-gateway` | `node openclaw gateway` | 18789 | — |
| `signal-cli` | `signal-cli daemon --http 0.0.0.0:8080` | 8080 | — |
| `protonmail-bridge` | `protonmail-bridge --noninteractive` | 1143/1025 | — |
| `mail-bridge` | `message-center -config config.yaml` | 8025 | `~/dev/mail-bridge` |
| `message-bridge` | `message-bridge` | 8030 | `~/dev/message-bridge` |
| `james-dashboard` | `james-dashboard --dir .` | 9200 | `~/dev/james-dashboard` |
| `ocr-service` | `python server.py` | 8090 | `~/ocr-service` |
| `docsys` | `docsys` | — | `~/dev/docsys` |
**Unit files location:** `~/.config/systemd/user/`
**Environment files:**
- `~/.config/message-center.env` (mail-bridge credentials)
- OpenClaw gateway env vars in unit file (API keys, tokens)
### Zurich VPS Loss
1. Provision new VPS
2. Install git, create `git` user with `git-shell`
3. Push all repos from forge (they're the primary copies)
4. Reinstall Uptime Kuma, Caddy, nuclei
5. Update DNS if IP changes
---
## Monitoring
| Check | Frequency | Tool |
|---|---|---|
| Service health | Every heartbeat | `scripts/service-health.sh` |
| Git audit | Hourly (:30) | `scripts/git-audit.sh` via cron |
| Claude usage | Hourly (:00) | `scripts/claude-usage-check.sh` via cron |
| Nuclei security scan | Monthly | Cron from Zurich |
| Docker updates (253) | Weekly (Sunday) | Heartbeat task |
| HAOS updates | Weekly (Sunday) | Heartbeat task |
| Uptime Kuma | Continuous | https://zurich.inou.com:3001 |
---
## Open Items
- [ ] **Sophia docs backup** — 9.2 GB, single copy. Needs offsite (Proton Drive, Zurich, or both)
- [ ] **Systemd unit backup** — Track in git (docs repo or clawd)
- [ ] **Automated config snapshots** — Gateway config, env files