5.2 KiB
Replication Design — Event-Driven Async (Commercial Only)
Core Principle: Trigger on Change, Not Time
Polling every 30s is wasteful when vaults may go days without changes. Replication fires immediately when a write happens, then goes idle.
Architecture
On Primary (Calgary)
Client Request → Primary Handler
↓
[1] Apply to local DB
[2] Mark entry dirty (replication_dirty = 1)
[3] Signal replication worker (non-blocking channel)
[4] Return success to client (don't wait)
↓
Replication Worker (event-driven, wakes on signal)
↓
POST dirty entries to Backup /api/replication/apply
↓
Clear dirty flag on ACK
No polling. No timer. The worker sleeps until woken.
Replication Worker
type ReplicationWorker struct {
db *lib.DB
config *ReplicationConfig
signal chan struct{} // Buffered channel (size 1)
pending map[int64]bool // Dedup in-memory
mu sync.Mutex
}
func (w *ReplicationWorker) Signal() {
select {
case w.signal <- struct{}{}:
default:
// Already signaled, worker will pick up all dirty entries
}
}
func (w *ReplicationWorker) Run(ctx context.Context) {
for {
select {
case <-ctx.Done():
return
case <-w.signal:
w.replicateBatch()
}
}
}
func (w *ReplicationWorker) replicateBatch() {
// Get all dirty entries (could be 1 or many if burst)
entries, _ := lib.EntryListDirty(w.db, 100)
if len(entries) == 0 {
return
}
// POST to backup
// Retry with backoff on failure
// Mark replicated on success
}
Signal Flow
// In CreateEntry, UpdateEntry, DeleteEntry handlers:
func (h *Handlers) CreateEntry(...) {
// ... create entry ...
// Commercial only: mark dirty and signal replicator
if edition.Current.Name() == "commercial" {
lib.EntryMarkDirty(h.db(r), entry.EntryID)
edition.SignalReplication() // Non-blocking
}
// Return to client immediately
}
On Backup (Zurich)
Same as before: Read-only mode, applies replication pushes, rejects client writes.
Efficiency Gains
| Metric | Polling (30s) | Event-Driven |
|---|---|---|
| CPU wakeups/day | 2,880 | ~number of actual writes |
| Network requests/day | 2,880 | ~number of actual writes |
| Egress/day | High (always checking) | Low (only when data changes) |
| Latency | 0-30s | Immediate |
For a vault with 10 writes/day: 288x fewer wakeups.
Burst Handling
If 50 entries change in a burst (e.g., batch import):
- All 50 marked dirty
- Worker wakes once
- Sends all 50 in single batch
- Goes back to sleep
No 50 separate HTTP requests.
Failure & Retry
func replicateBatch() {
entries, _ := lib.EntryListDirty(db, 100)
for attempt := 0; attempt < maxRetries; attempt++ {
err := postToBackup(entries)
if err == nil {
// Success: clear dirty flags
for _, e := range entries {
lib.EntryMarkReplicated(db, e.EntryID)
}
return
}
// Failure: entries stay dirty, will be picked up next signal
// Backoff: 1s, 5s, 25s, 125s...
time.Sleep(time.Duration(math.Pow(5, attempt)) * time.Second)
}
// Max retries exceeded: alert operator
edition.Current.AlertOperator(ctx, "replication_failed",
"Backup unreachable after retries", map[string]any{
"count": len(entries),
"last_error": err.Error(),
})
}
No persistent queue needed - dirty flags in SQLite are the queue.
Code Changes Required
1. Signal Function (Commercial Only)
// edition/replication.go
var replicationSignal chan struct{}
func SignalReplication() {
if replicationSignal != nil {
select {
case replicationSignal <- struct{}{}:
default:
}
}
}
2. Modified Handlers
All write handlers need:
if edition.Current.Name() == "commercial" {
lib.EntryMarkDirty(db, entryID)
edition.SignalReplication()
}
3. Remove Polling
Delete the ticker from replication worker. Replace with <-signal only.
Resource Usage
| Resource | Polling | Event-Driven |
|---|---|---|
| Goroutine | Always running | Running but blocked on channel (idle) |
| Memory | Minimal | Minimal (just channel + map) |
| CPU | 2,880 wakeups/day | #writes wakeups/day |
| Network | 2,880 requests/day | #writes requests/day |
| SQLite queries | 2,880/day | #writes/day |
Design Notes
No persistent queue needed - the replication_dirty column IS the queue.
Worker crash? On restart, EntryListDirty() finds all pending work.
No timer needed - Go channel with select is the most efficient wait mechanism.
Batching automatic - Multiple signals while worker is busy? Channel size 1 means worker picks up ALL dirty entries on next iteration, not one-by-one.
This is the right design for low-resource, low-change vaults.