clavitor/clavis/clavis-vault/SECURITY_REVIEW_REPLICATION.md

9.8 KiB

Security Review: Replication Function (Commercial Edition)

Scope: Event-driven async replication between Primary and Backup POPs Files Reviewed:

  • edition/config.go - Config loading and validation
  • edition/replication.go - Replication worker and client
  • edition/backup_mode.go - Backup mode detection and write rejection
  • lib/dbcore.go - Dirty marking functions

CRITICAL ISSUES (Fix Before Production)

1. MISSING: Inter-POP Authentication Implementation

Risk: Unauthorized POPs could push malicious data to backups Location: edition/replication.go:150 (stub)

Current State:

// TODO: POST to backup POP
// No authentication implemented

Required:

  • mTLS (client cert) OR
  • Shared secret in Authorization: Bearer <token> header
  • Token must be read from file (not hardcoded)

Fix:

req.Header.Set("Authorization", "Bearer " + loadTokenFromFile(w.config.Auth.TokenFile))
// OR use TLS client certificates

2. MISSING: Backup-Side Request Authentication

Risk: Anyone with network access could POST to /api/replication/apply Location: Not yet implemented (backup receive endpoint)

Required on Backup:

  • Verify Authorization header matches shared secret
  • Verify source POP ID is in authorized list
  • mTLS client cert validation (if using mTLS)

3. WEAK: Backup Mode Detection Uses Environment Variable

Risk: CLAVITOR_BACKUP_MODE=true can be set accidentally or maliciously Location: edition/backup_mode.go:23

Current:

if os.Getenv("CLAVITOR_BACKUP_MODE") == "true" {
    return true
}

Problem:

  • Env vars are visible to all processes (ps e)
  • No validation that this POP is actually a backup
  • Could cause split-brain if primary thinks it's backup

Fix:

  • Read from config file only (replication.yaml with role: backup)
  • Use control plane confirmation (call HQ to verify role)
  • Remove env var option entirely

4. MISSING: Replay Attack Protection

Risk: Attacker could replay old replication requests Location: edition/replication.go:173-177

Current Request:

type ReplicationRequest struct {
    SourcePOP string            `json:"source_pop"`
    Entries   []ReplicatedEntry `json:"entries"`
    Timestamp int64             `json:"timestamp"`
}

Missing:

  • No nonce/unique ID per request
  • No request signature
  • No max age validation (5-minute window?)

Fix:

type ReplicationRequest struct {
    SourcePOP string            `json:"source_pop"`
    Entries   []ReplicatedEntry `json:"entries"`
    Timestamp int64             `json:"timestamp"`
    Nonce     string            `json:"nonce"`      // Random 16 bytes
    Signature string            `json:"signature"`  // HMAC-SHA256 of body
}

5. HIGH: Shared Secret Token Validation Weak

Risk: Token file existence checked, but not token format/entropy Location: edition/config.go:58

Current:

if _, err := os.Stat(tokenFile); err != nil {
    return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
}

Missing:

  • No token entropy validation (is it 32+ random bytes?)
  • No token rotation mechanism
  • No token expiration

Fix:

token, err := os.ReadFile(tokenFile)
if len(token) < 32 {
    return nil, fmt.Errorf("token must be at least 32 bytes")
}
// Check entropy (not all zeros, not repetitive)

HIGH SEVERITY ISSUES

6. MISSING: HTTPS Certificate Validation Disabled?

Risk: MITM attack between primary and backup Location: edition/replication.go:150 (stub)

Concern: If using self-signed certs or InsecureSkipVerify: true

Required:

  • Proper TLS cert validation
  • Pin backup POP's cert or use internal CA
  • Never InsecureSkipVerify: true

7. NO AUDIT LOGGING OF REPLICATION

Risk: Cannot detect unauthorized replication attempts Location: Missing throughout

Required:

  • Log every replication attempt (success and failure)
  • Include: source POP ID, entry count, timestamp, auth result
  • Alert on auth failures (possible attack)

8. CASCADE REPLICATION NOT PREVENTED

Risk: Backup could accidentally re-replicate to another POP, causing loops Location: edition/replication.go

Current: Worker only checks Role != "primary", but doesn't prevent config errors

Fix:

// In startReplication():
if cfg.Role == "backup" {
    log.Printf("Backup POP: replication worker disabled (receive-only)")
    return  // Don't start worker
}

9. INFORMATION DISCLOSURE IN ERRORS

Risk: Error messages leak internal paths and structure Location: edition/config.go:23, 28, 59

Current:

return nil, fmt.Errorf("cannot read replication config: %w", err)
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)

Problem:

  • Exposes /etc/clavitor/replication.yaml path
  • Exposes /etc/clavitor/replication.key path
  • Could help attacker locate secrets

Fix:

return nil, fmt.Errorf("replication configuration error")
// Log full details internally only

MEDIUM SEVERITY ISSUES

10. NO RATE LIMITING ON REPLICATION ENDPOINT

Risk: DoS via flood of replication requests Location: Not implemented (backup receive endpoint)

Required:

  • Rate limit /api/replication/apply per source POP
  • Max 1 request/second per POP
  • Ban source POP after 10 failed auth attempts

11. BACKOFF CALCULATION FLOAT CONVERSION

Risk: Minor - could panic or behave unexpectedly Location: edition/replication.go:109

Current:

backoff := time.Duration(math.Pow(5, float64(attempt-1))) * time.Second

Issue: For attempt=5, result is 625 seconds = 10.4 minutes, but truncated

Fix: Use integer math:

delays := []time.Duration{1*time.Second, 5*time.Second, 25*time.Second, 125*time.Second, 625*time.Second}
if attempt > 0 && attempt <= len(delays) {
    backoff = delays[attempt-1]
}

12. WORKER DB CONNECTION SEPARATE FROM MAIN

Risk: Inconsistent state if main DB has different connection params Location: edition/replication.go:56-61

Current:

dbPath := dataDir + "/clavitor-*.db" // TODO: proper vault lookup
db, err := lib.OpenDB(dbPath)

Issue: Replication worker uses separate DB connection from main handlers. If WAL mode settings differ, could cause lock issues.

Fix: Pass existing DB connection from main, or ensure identical settings.

13. NO MAX DIRTY ENTRY LIMIT

Risk: Memory exhaustion if backup down for long time Location: edition/replication.go:142

Current:

entries, err := lib.EntryListDirty(w.db, w.config.Replication.BatchSize)

Issue: Only limited by batch size, but if replication fails repeatedly, entries accumulate in DB (not memory). Actually OK - dirty entries stay in DB, not memory.

Status: Acceptable (disk-bound, not memory-bound)


LOW SEVERITY / DEFENSE IN DEPTH

14. TIMING ATTACK ON TOKEN COMPARISON

Risk: Side-channel attack if comparing tokens naively Location: Not implemented yet (backup auth)

Fix: Use constant-time comparison:

if subtle.ConstantTimeCompare(receivedToken, expectedToken) != 1 {
    return authFailed
}

15. NO REQUEST SIZE LIMITS ON BACKUP ENDPOINT

Risk: Large request could cause OOM Location: Not implemented (backup receive)

Fix: http.MaxBytesReader(w, r.Body, maxReplicationBodySize)

16. LOG INJECTION VIA ENTRY DATA

Risk: If entry titles contain format strings or escape sequences Location: edition/replication.go:154

Current:

log.Printf("Replicating %d entries to %s", len(entries), w.config.BackupPOP.URL)

Issue: If somehow user-controlled data gets into log, could inject log formatters

Fix: Use structured logging, never printf with user data


SECURITY RECOMMENDATIONS

Immediate (Before Production)

  1. Implement mTLS between POPs - Most secure option, no shared secrets
  2. Add request signing with HMAC - Prevent replays
  3. Add replication audit logging - Security monitoring
  4. Remove env var backup mode - Config file only
  5. Add rate limiting - Prevent DoS

Short Term

  1. Token rotation mechanism - Hot-swap without restart
  2. Control plane integration - Central authority for POP roles
  3. Add security headers - X-Content-Type-Options, etc.

Design Improvements

  1. Consider using ed25519 signatures instead of shared secrets
  2. Add replication watermark/epoch - For detecting gaps

SECURE CONFIGURATION EXAMPLE

# /etc/clavitor/replication.yaml
pop_id: "calgary-01"
region: "north-america"
role: "primary"

backup_pop:
  id: "zurich-01"
  url: "https://zurich-01.clavitor.ai:8443"  # TLS 1.3 only
  # No auth_token_file here - use mTLS instead

auth:
  method: "mtls"  # or "bearer"
  # For mTLS:
  mtls_cert: "/etc/clavitor/replication.crt"
  mtls_key: "/etc/clavitor/replication.key"
  mtls_ca: "/etc/clavitor/ca.crt"  # Verify backup cert
  # For bearer token:
  # token_file: "/etc/clavitor/replication.key"
  # token_rotation_interval: 86400  # 24 hours

replication:
  batch_size: 100
  max_retries: 5
  request_timeout: 30  # seconds
  max_request_age: 300   # 5 minutes (anti-replay)

security:
  audit_log: "/var/log/clavitor/replication.log"
  rate_limit_per_minute: 60
  ban_after_failures: 10
  ban_duration: 3600  # 1 hour

SUMMARY

Risk Level Count Issues
CRITICAL 5 Auth not implemented, replay attacks, env var backup mode, weak token validation, MITM risk
HIGH 4 HTTPS validation, audit logging, cascade prevention, info disclosure
MEDIUM 4 Rate limiting, backoff math, DB connection, size limits
LOW 3 Timing attacks, request size, log injection

Bottom Line: The replication scaffolding is good, but the security-critical parts (auth, replay protection, audit) are TODO stubs. Do not deploy to production until CRITICAL and HIGH issues are resolved.