clavitor/clavis/clavis-vault/SPEC-replication-async.md

5.2 KiB

Replication Design — Event-Driven Async (Commercial Only)

Core Principle: Trigger on Change, Not Time

Polling every 30s is wasteful when vaults may go days without changes. Replication fires immediately when a write happens, then goes idle.

Architecture

On Primary (Calgary)

Client Request → Primary Handler
                      ↓
              [1] Apply to local DB
              [2] Mark entry dirty (replication_dirty = 1)
              [3] Signal replication worker (non-blocking channel)
              [4] Return success to client (don't wait)
                      ↓
              Replication Worker (event-driven, wakes on signal)
                      ↓
              POST dirty entries to Backup /api/replication/apply
                      ↓
              Clear dirty flag on ACK

No polling. No timer. The worker sleeps until woken.

Replication Worker

type ReplicationWorker struct {
    db       *lib.DB
    config   *ReplicationConfig
    signal   chan struct{}     // Buffered channel (size 1)
    pending  map[int64]bool    // Dedup in-memory
    mu       sync.Mutex
}

func (w *ReplicationWorker) Signal() {
    select {
    case w.signal <- struct{}{}:
    default:
        // Already signaled, worker will pick up all dirty entries
    }
}

func (w *ReplicationWorker) Run(ctx context.Context) {
    for {
        select {
        case <-ctx.Done():
            return
        case <-w.signal:
            w.replicateBatch()
        }
    }
}

func (w *ReplicationWorker) replicateBatch() {
    // Get all dirty entries (could be 1 or many if burst)
    entries, _ := lib.EntryListDirty(w.db, 100)
    if len(entries) == 0 {
        return
    }
    
    // POST to backup
    // Retry with backoff on failure
    // Mark replicated on success
}

Signal Flow

// In CreateEntry, UpdateEntry, DeleteEntry handlers:
func (h *Handlers) CreateEntry(...) {
    // ... create entry ...
    
    // Commercial only: mark dirty and signal replicator
    if edition.Current.Name() == "commercial" {
        lib.EntryMarkDirty(h.db(r), entry.EntryID)
        edition.SignalReplication() // Non-blocking
    }
    
    // Return to client immediately
}

On Backup (Zurich)

Same as before: Read-only mode, applies replication pushes, rejects client writes.

Efficiency Gains

Metric Polling (30s) Event-Driven
CPU wakeups/day 2,880 ~number of actual writes
Network requests/day 2,880 ~number of actual writes
Egress/day High (always checking) Low (only when data changes)
Latency 0-30s Immediate

For a vault with 10 writes/day: 288x fewer wakeups.

Burst Handling

If 50 entries change in a burst (e.g., batch import):

  1. All 50 marked dirty
  2. Worker wakes once
  3. Sends all 50 in single batch
  4. Goes back to sleep

No 50 separate HTTP requests.

Failure & Retry

func replicateBatch() {
    entries, _ := lib.EntryListDirty(db, 100)
    
    for attempt := 0; attempt < maxRetries; attempt++ {
        err := postToBackup(entries)
        if err == nil {
            // Success: clear dirty flags
            for _, e := range entries {
                lib.EntryMarkReplicated(db, e.EntryID)
            }
            return
        }
        
        // Failure: entries stay dirty, will be picked up next signal
        // Backoff: 1s, 5s, 25s, 125s...
        time.Sleep(time.Duration(math.Pow(5, attempt)) * time.Second)
    }
    
    // Max retries exceeded: alert operator
    edition.Current.AlertOperator(ctx, "replication_failed", 
        "Backup unreachable after retries", map[string]any{
            "count": len(entries),
            "last_error": err.Error(),
        })
}

No persistent queue needed - dirty flags in SQLite are the queue.

Code Changes Required

1. Signal Function (Commercial Only)

// edition/replication.go
var replicationSignal chan struct{}

func SignalReplication() {
    if replicationSignal != nil {
        select {
        case replicationSignal <- struct{}{}:
        default:
        }
    }
}

2. Modified Handlers

All write handlers need:

if edition.Current.Name() == "commercial" {
    lib.EntryMarkDirty(db, entryID)
    edition.SignalReplication()
}

3. Remove Polling

Delete the ticker from replication worker. Replace with <-signal only.

Resource Usage

Resource Polling Event-Driven
Goroutine Always running Running but blocked on channel (idle)
Memory Minimal Minimal (just channel + map)
CPU 2,880 wakeups/day #writes wakeups/day
Network 2,880 requests/day #writes requests/day
SQLite queries 2,880/day #writes/day

Design Notes

No persistent queue needed - the replication_dirty column IS the queue. Worker crash? On restart, EntryListDirty() finds all pending work.

No timer needed - Go channel with select is the most efficient wait mechanism.

Batching automatic - Multiple signals while worker is busy? Channel size 1 means worker picks up ALL dirty entries on next iteration, not one-by-one.


This is the right design for low-resource, low-change vaults.