9.8 KiB
Security Review: Replication Function (Commercial Edition)
Scope: Event-driven async replication between Primary and Backup POPs Files Reviewed:
edition/config.go- Config loading and validationedition/replication.go- Replication worker and clientedition/backup_mode.go- Backup mode detection and write rejectionlib/dbcore.go- Dirty marking functions
CRITICAL ISSUES (Fix Before Production)
1. MISSING: Inter-POP Authentication Implementation
Risk: Unauthorized POPs could push malicious data to backups
Location: edition/replication.go:150 (stub)
Current State:
// TODO: POST to backup POP
// No authentication implemented
Required:
- mTLS (client cert) OR
- Shared secret in
Authorization: Bearer <token>header - Token must be read from file (not hardcoded)
Fix:
req.Header.Set("Authorization", "Bearer " + loadTokenFromFile(w.config.Auth.TokenFile))
// OR use TLS client certificates
2. MISSING: Backup-Side Request Authentication
Risk: Anyone with network access could POST to /api/replication/apply
Location: Not yet implemented (backup receive endpoint)
Required on Backup:
- Verify
Authorizationheader matches shared secret - Verify source POP ID is in authorized list
- mTLS client cert validation (if using mTLS)
3. WEAK: Backup Mode Detection Uses Environment Variable
Risk: CLAVITOR_BACKUP_MODE=true can be set accidentally or maliciously
Location: edition/backup_mode.go:23
Current:
if os.Getenv("CLAVITOR_BACKUP_MODE") == "true" {
return true
}
Problem:
- Env vars are visible to all processes (
ps e) - No validation that this POP is actually a backup
- Could cause split-brain if primary thinks it's backup
Fix:
- Read from config file only (
replication.yamlwithrole: backup) - Use control plane confirmation (call HQ to verify role)
- Remove env var option entirely
4. MISSING: Replay Attack Protection
Risk: Attacker could replay old replication requests
Location: edition/replication.go:173-177
Current Request:
type ReplicationRequest struct {
SourcePOP string `json:"source_pop"`
Entries []ReplicatedEntry `json:"entries"`
Timestamp int64 `json:"timestamp"`
}
Missing:
- No nonce/unique ID per request
- No request signature
- No max age validation (5-minute window?)
Fix:
type ReplicationRequest struct {
SourcePOP string `json:"source_pop"`
Entries []ReplicatedEntry `json:"entries"`
Timestamp int64 `json:"timestamp"`
Nonce string `json:"nonce"` // Random 16 bytes
Signature string `json:"signature"` // HMAC-SHA256 of body
}
5. HIGH: Shared Secret Token Validation Weak
Risk: Token file existence checked, but not token format/entropy
Location: edition/config.go:58
Current:
if _, err := os.Stat(tokenFile); err != nil {
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
}
Missing:
- No token entropy validation (is it 32+ random bytes?)
- No token rotation mechanism
- No token expiration
Fix:
token, err := os.ReadFile(tokenFile)
if len(token) < 32 {
return nil, fmt.Errorf("token must be at least 32 bytes")
}
// Check entropy (not all zeros, not repetitive)
HIGH SEVERITY ISSUES
6. MISSING: HTTPS Certificate Validation Disabled?
Risk: MITM attack between primary and backup
Location: edition/replication.go:150 (stub)
Concern: If using self-signed certs or InsecureSkipVerify: true
Required:
- Proper TLS cert validation
- Pin backup POP's cert or use internal CA
- Never
InsecureSkipVerify: true
7. NO AUDIT LOGGING OF REPLICATION
Risk: Cannot detect unauthorized replication attempts Location: Missing throughout
Required:
- Log every replication attempt (success and failure)
- Include: source POP ID, entry count, timestamp, auth result
- Alert on auth failures (possible attack)
8. CASCADE REPLICATION NOT PREVENTED
Risk: Backup could accidentally re-replicate to another POP, causing loops
Location: edition/replication.go
Current: Worker only checks Role != "primary", but doesn't prevent config errors
Fix:
// In startReplication():
if cfg.Role == "backup" {
log.Printf("Backup POP: replication worker disabled (receive-only)")
return // Don't start worker
}
9. INFORMATION DISCLOSURE IN ERRORS
Risk: Error messages leak internal paths and structure
Location: edition/config.go:23, 28, 59
Current:
return nil, fmt.Errorf("cannot read replication config: %w", err)
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
Problem:
- Exposes
/etc/clavitor/replication.yamlpath - Exposes
/etc/clavitor/replication.keypath - Could help attacker locate secrets
Fix:
return nil, fmt.Errorf("replication configuration error")
// Log full details internally only
MEDIUM SEVERITY ISSUES
10. NO RATE LIMITING ON REPLICATION ENDPOINT
Risk: DoS via flood of replication requests Location: Not implemented (backup receive endpoint)
Required:
- Rate limit
/api/replication/applyper source POP - Max 1 request/second per POP
- Ban source POP after 10 failed auth attempts
11. BACKOFF CALCULATION FLOAT CONVERSION
Risk: Minor - could panic or behave unexpectedly
Location: edition/replication.go:109
Current:
backoff := time.Duration(math.Pow(5, float64(attempt-1))) * time.Second
Issue: For attempt=5, result is 625 seconds = 10.4 minutes, but truncated
Fix: Use integer math:
delays := []time.Duration{1*time.Second, 5*time.Second, 25*time.Second, 125*time.Second, 625*time.Second}
if attempt > 0 && attempt <= len(delays) {
backoff = delays[attempt-1]
}
12. WORKER DB CONNECTION SEPARATE FROM MAIN
Risk: Inconsistent state if main DB has different connection params
Location: edition/replication.go:56-61
Current:
dbPath := dataDir + "/clavitor-*.db" // TODO: proper vault lookup
db, err := lib.OpenDB(dbPath)
Issue: Replication worker uses separate DB connection from main handlers. If WAL mode settings differ, could cause lock issues.
Fix: Pass existing DB connection from main, or ensure identical settings.
13. NO MAX DIRTY ENTRY LIMIT
Risk: Memory exhaustion if backup down for long time
Location: edition/replication.go:142
Current:
entries, err := lib.EntryListDirty(w.db, w.config.Replication.BatchSize)
Issue: Only limited by batch size, but if replication fails repeatedly, entries accumulate in DB (not memory). Actually OK - dirty entries stay in DB, not memory.
Status: Acceptable (disk-bound, not memory-bound)
LOW SEVERITY / DEFENSE IN DEPTH
14. TIMING ATTACK ON TOKEN COMPARISON
Risk: Side-channel attack if comparing tokens naively Location: Not implemented yet (backup auth)
Fix: Use constant-time comparison:
if subtle.ConstantTimeCompare(receivedToken, expectedToken) != 1 {
return authFailed
}
15. NO REQUEST SIZE LIMITS ON BACKUP ENDPOINT
Risk: Large request could cause OOM Location: Not implemented (backup receive)
Fix: http.MaxBytesReader(w, r.Body, maxReplicationBodySize)
16. LOG INJECTION VIA ENTRY DATA
Risk: If entry titles contain format strings or escape sequences
Location: edition/replication.go:154
Current:
log.Printf("Replicating %d entries to %s", len(entries), w.config.BackupPOP.URL)
Issue: If somehow user-controlled data gets into log, could inject log formatters
Fix: Use structured logging, never printf with user data
SECURITY RECOMMENDATIONS
Immediate (Before Production)
- Implement mTLS between POPs - Most secure option, no shared secrets
- Add request signing with HMAC - Prevent replays
- Add replication audit logging - Security monitoring
- Remove env var backup mode - Config file only
- Add rate limiting - Prevent DoS
Short Term
- Token rotation mechanism - Hot-swap without restart
- Control plane integration - Central authority for POP roles
- Add security headers -
X-Content-Type-Options, etc.
Design Improvements
- Consider using ed25519 signatures instead of shared secrets
- Add replication watermark/epoch - For detecting gaps
SECURE CONFIGURATION EXAMPLE
# /etc/clavitor/replication.yaml
pop_id: "calgary-01"
region: "north-america"
role: "primary"
backup_pop:
id: "zurich-01"
url: "https://zurich-01.clavitor.ai:8443" # TLS 1.3 only
# No auth_token_file here - use mTLS instead
auth:
method: "mtls" # or "bearer"
# For mTLS:
mtls_cert: "/etc/clavitor/replication.crt"
mtls_key: "/etc/clavitor/replication.key"
mtls_ca: "/etc/clavitor/ca.crt" # Verify backup cert
# For bearer token:
# token_file: "/etc/clavitor/replication.key"
# token_rotation_interval: 86400 # 24 hours
replication:
batch_size: 100
max_retries: 5
request_timeout: 30 # seconds
max_request_age: 300 # 5 minutes (anti-replay)
security:
audit_log: "/var/log/clavitor/replication.log"
rate_limit_per_minute: 60
ban_after_failures: 10
ban_duration: 3600 # 1 hour
SUMMARY
| Risk Level | Count | Issues |
|---|---|---|
| CRITICAL | 5 | Auth not implemented, replay attacks, env var backup mode, weak token validation, MITM risk |
| HIGH | 4 | HTTPS validation, audit logging, cascade prevention, info disclosure |
| MEDIUM | 4 | Rate limiting, backoff math, DB connection, size limits |
| LOW | 3 | Timing attacks, request size, log injection |
Bottom Line: The replication scaffolding is good, but the security-critical parts (auth, replay protection, audit) are TODO stubs. Do not deploy to production until CRITICAL and HIGH issues are resolved.