# Replication Design — Event-Driven Async (Commercial Only) ## Core Principle: Trigger on Change, Not Time Polling every 30s is wasteful when vaults may go days without changes. Replication fires **immediately** when a write happens, then goes idle. ## Architecture ### On Primary (Calgary) ``` Client Request → Primary Handler ↓ [1] Apply to local DB [2] Mark entry dirty (replication_dirty = 1) [3] Signal replication worker (non-blocking channel) [4] Return success to client (don't wait) ↓ Replication Worker (event-driven, wakes on signal) ↓ POST dirty entries to Backup /api/replication/apply ↓ Clear dirty flag on ACK ``` **No polling. No timer. The worker sleeps until woken.** ### Replication Worker ```go type ReplicationWorker struct { db *lib.DB config *ReplicationConfig signal chan struct{} // Buffered channel (size 1) pending map[int64]bool // Dedup in-memory mu sync.Mutex } func (w *ReplicationWorker) Signal() { select { case w.signal <- struct{}{}: default: // Already signaled, worker will pick up all dirty entries } } func (w *ReplicationWorker) Run(ctx context.Context) { for { select { case <-ctx.Done(): return case <-w.signal: w.replicateBatch() } } } func (w *ReplicationWorker) replicateBatch() { // Get all dirty entries (could be 1 or many if burst) entries, _ := lib.EntryListDirty(w.db, 100) if len(entries) == 0 { return } // POST to backup // Retry with backoff on failure // Mark replicated on success } ``` ### Signal Flow ```go // In CreateEntry, UpdateEntry, DeleteEntry handlers: func (h *Handlers) CreateEntry(...) { // ... create entry ... // Commercial only: mark dirty and signal replicator if edition.Current.Name() == "commercial" { lib.EntryMarkDirty(h.db(r), entry.EntryID) edition.SignalReplication() // Non-blocking } // Return to client immediately } ``` ### On Backup (Zurich) Same as before: Read-only mode, applies replication pushes, rejects client writes. ## Efficiency Gains | Metric | Polling (30s) | Event-Driven | |--------|---------------|--------------| | CPU wakeups/day | 2,880 | ~number of actual writes | | Network requests/day | 2,880 | ~number of actual writes | | Egress/day | High (always checking) | Low (only when data changes) | | Latency | 0-30s | Immediate | For a vault with 10 writes/day: **288x fewer wakeups.** ## Burst Handling If 50 entries change in a burst (e.g., batch import): 1. All 50 marked dirty 2. Worker wakes once 3. Sends all 50 in single batch 4. Goes back to sleep No 50 separate HTTP requests. ## Failure & Retry ```go func replicateBatch() { entries, _ := lib.EntryListDirty(db, 100) for attempt := 0; attempt < maxRetries; attempt++ { err := postToBackup(entries) if err == nil { // Success: clear dirty flags for _, e := range entries { lib.EntryMarkReplicated(db, e.EntryID) } return } // Failure: entries stay dirty, will be picked up next signal // Backoff: 1s, 5s, 25s, 125s... time.Sleep(time.Duration(math.Pow(5, attempt)) * time.Second) } // Max retries exceeded: alert operator edition.Current.AlertOperator(ctx, "replication_failed", "Backup unreachable after retries", map[string]any{ "count": len(entries), "last_error": err.Error(), }) } ``` No persistent queue needed - dirty flags in SQLite are the queue. ## Code Changes Required ### 1. Signal Function (Commercial Only) ```go // edition/replication.go var replicationSignal chan struct{} func SignalReplication() { if replicationSignal != nil { select { case replicationSignal <- struct{}{}: default: } } } ``` ### 2. Modified Handlers All write handlers need: ```go if edition.Current.Name() == "commercial" { lib.EntryMarkDirty(db, entryID) edition.SignalReplication() } ``` ### 3. Remove Polling Delete the ticker from replication worker. Replace with `<-signal` only. ## Resource Usage | Resource | Polling | Event-Driven | |----------|---------|--------------| | Goroutine | Always running | Running but blocked on channel (idle) | | Memory | Minimal | Minimal (just channel + map) | | CPU | 2,880 wakeups/day | #writes wakeups/day | | Network | 2,880 requests/day | #writes requests/day | | SQLite queries | 2,880/day | #writes/day | ## Design Notes **No persistent queue needed** - the `replication_dirty` column IS the queue. Worker crash? On restart, `EntryListDirty()` finds all pending work. **No timer needed** - Go channel with `select` is the most efficient wait mechanism. **Batching automatic** - Multiple signals while worker is busy? Channel size 1 means worker picks up ALL dirty entries on next iteration, not one-by-one. --- **This is the right design for low-resource, low-change vaults.**