Security review: Replication functionality (Commercial Only)

Comprehensive security audit of event-driven replication.

CRITICAL issues (5):
1. Inter-POP authentication not implemented (stub TODO)
2. Backup-side request authentication missing
3. Backup mode uses env var (should be config-only)
4. No replay attack protection (need nonces + signatures)
5. Weak token validation (only checks existence, not entropy)

HIGH issues (4):
6. HTTPS cert validation concern
7. No audit logging of replication attempts
8. Cascade replication not prevented
9. Information disclosure in error messages

Status: NOT PRODUCTION READY - security TODO stubs present
This commit is contained in:
James 2026-04-02 01:02:36 -04:00
parent 16045d5185
commit fa7541bd4d
1 changed files with 344 additions and 0 deletions

View File

@ -0,0 +1,344 @@
# Security Review: Replication Function (Commercial Edition)
**Scope:** Event-driven async replication between Primary and Backup POPs
**Files Reviewed:**
- `edition/config.go` - Config loading and validation
- `edition/replication.go` - Replication worker and client
- `edition/backup_mode.go` - Backup mode detection and write rejection
- `lib/dbcore.go` - Dirty marking functions
---
## CRITICAL ISSUES (Fix Before Production)
### 1. MISSING: Inter-POP Authentication Implementation
**Risk:** Unauthorized POPs could push malicious data to backups
**Location:** `edition/replication.go:150` (stub)
**Current State:**
```go
// TODO: POST to backup POP
// No authentication implemented
```
**Required:**
- mTLS (client cert) OR
- Shared secret in `Authorization: Bearer <token>` header
- Token must be read from file (not hardcoded)
**Fix:**
```go
req.Header.Set("Authorization", "Bearer " + loadTokenFromFile(w.config.Auth.TokenFile))
// OR use TLS client certificates
```
### 2. MISSING: Backup-Side Request Authentication
**Risk:** Anyone with network access could POST to `/api/replication/apply`
**Location:** Not yet implemented (backup receive endpoint)
**Required on Backup:**
- Verify `Authorization` header matches shared secret
- Verify source POP ID is in authorized list
- mTLS client cert validation (if using mTLS)
### 3. WEAK: Backup Mode Detection Uses Environment Variable
**Risk:** `CLAVITOR_BACKUP_MODE=true` can be set accidentally or maliciously
**Location:** `edition/backup_mode.go:23`
**Current:**
```go
if os.Getenv("CLAVITOR_BACKUP_MODE") == "true" {
return true
}
```
**Problem:**
- Env vars are visible to all processes (`ps e`)
- No validation that this POP is actually a backup
- Could cause split-brain if primary thinks it's backup
**Fix:**
- Read from config file only (`replication.yaml` with `role: backup`)
- Use control plane confirmation (call HQ to verify role)
- Remove env var option entirely
### 4. MISSING: Replay Attack Protection
**Risk:** Attacker could replay old replication requests
**Location:** `edition/replication.go:173-177`
**Current Request:**
```go
type ReplicationRequest struct {
SourcePOP string `json:"source_pop"`
Entries []ReplicatedEntry `json:"entries"`
Timestamp int64 `json:"timestamp"`
}
```
**Missing:**
- No nonce/unique ID per request
- No request signature
- No max age validation (5-minute window?)
**Fix:**
```go
type ReplicationRequest struct {
SourcePOP string `json:"source_pop"`
Entries []ReplicatedEntry `json:"entries"`
Timestamp int64 `json:"timestamp"`
Nonce string `json:"nonce"` // Random 16 bytes
Signature string `json:"signature"` // HMAC-SHA256 of body
}
```
### 5. HIGH: Shared Secret Token Validation Weak
**Risk:** Token file existence checked, but not token format/entropy
**Location:** `edition/config.go:58`
**Current:**
```go
if _, err := os.Stat(tokenFile); err != nil {
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
}
```
**Missing:**
- No token entropy validation (is it 32+ random bytes?)
- No token rotation mechanism
- No token expiration
**Fix:**
```go
token, err := os.ReadFile(tokenFile)
if len(token) < 32 {
return nil, fmt.Errorf("token must be at least 32 bytes")
}
// Check entropy (not all zeros, not repetitive)
```
---
## HIGH SEVERITY ISSUES
### 6. MISSING: HTTPS Certificate Validation Disabled?
**Risk:** MITM attack between primary and backup
**Location:** `edition/replication.go:150` (stub)
**Concern:** If using self-signed certs or `InsecureSkipVerify: true`
**Required:**
- Proper TLS cert validation
- Pin backup POP's cert or use internal CA
- Never `InsecureSkipVerify: true`
### 7. NO AUDIT LOGGING OF REPLICATION
**Risk:** Cannot detect unauthorized replication attempts
**Location:** Missing throughout
**Required:**
- Log every replication attempt (success and failure)
- Include: source POP ID, entry count, timestamp, auth result
- Alert on auth failures (possible attack)
### 8. CASCADE REPLICATION NOT PREVENTED
**Risk:** Backup could accidentally re-replicate to another POP, causing loops
**Location:** `edition/replication.go`
**Current:** Worker only checks `Role != "primary"`, but doesn't prevent config errors
**Fix:**
```go
// In startReplication():
if cfg.Role == "backup" {
log.Printf("Backup POP: replication worker disabled (receive-only)")
return // Don't start worker
}
```
### 9. INFORMATION DISCLOSURE IN ERRORS
**Risk:** Error messages leak internal paths and structure
**Location:** `edition/config.go:23, 28, 59`
**Current:**
```go
return nil, fmt.Errorf("cannot read replication config: %w", err)
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
```
**Problem:**
- Exposes `/etc/clavitor/replication.yaml` path
- Exposes `/etc/clavitor/replication.key` path
- Could help attacker locate secrets
**Fix:**
```go
return nil, fmt.Errorf("replication configuration error")
// Log full details internally only
```
---
## MEDIUM SEVERITY ISSUES
### 10. NO RATE LIMITING ON REPLICATION ENDPOINT
**Risk:** DoS via flood of replication requests
**Location:** Not implemented (backup receive endpoint)
**Required:**
- Rate limit `/api/replication/apply` per source POP
- Max 1 request/second per POP
- Ban source POP after 10 failed auth attempts
### 11. BACKOFF CALCULATION FLOAT CONVERSION
**Risk:** Minor - could panic or behave unexpectedly
**Location:** `edition/replication.go:109`
**Current:**
```go
backoff := time.Duration(math.Pow(5, float64(attempt-1))) * time.Second
```
**Issue:** For attempt=5, result is 625 seconds = 10.4 minutes, but truncated
**Fix:** Use integer math:
```go
delays := []time.Duration{1*time.Second, 5*time.Second, 25*time.Second, 125*time.Second, 625*time.Second}
if attempt > 0 && attempt <= len(delays) {
backoff = delays[attempt-1]
}
```
### 12. WORKER DB CONNECTION SEPARATE FROM MAIN
**Risk:** Inconsistent state if main DB has different connection params
**Location:** `edition/replication.go:56-61`
**Current:**
```go
dbPath := dataDir + "/clavitor-*.db" // TODO: proper vault lookup
db, err := lib.OpenDB(dbPath)
```
**Issue:** Replication worker uses separate DB connection from main handlers. If WAL mode settings differ, could cause lock issues.
**Fix:** Pass existing DB connection from main, or ensure identical settings.
### 13. NO MAX DIRTY ENTRY LIMIT
**Risk:** Memory exhaustion if backup down for long time
**Location:** `edition/replication.go:142`
**Current:**
```go
entries, err := lib.EntryListDirty(w.db, w.config.Replication.BatchSize)
```
**Issue:** Only limited by batch size, but if replication fails repeatedly, entries accumulate in DB (not memory). **Actually OK** - dirty entries stay in DB, not memory.
**Status:** Acceptable (disk-bound, not memory-bound)
---
## LOW SEVERITY / DEFENSE IN DEPTH
### 14. TIMING ATTACK ON TOKEN COMPARISON
**Risk:** Side-channel attack if comparing tokens naively
**Location:** Not implemented yet (backup auth)
**Fix:** Use constant-time comparison:
```go
if subtle.ConstantTimeCompare(receivedToken, expectedToken) != 1 {
return authFailed
}
```
### 15. NO REQUEST SIZE LIMITS ON BACKUP ENDPOINT
**Risk:** Large request could cause OOM
**Location:** Not implemented (backup receive)
**Fix:** `http.MaxBytesReader(w, r.Body, maxReplicationBodySize)`
### 16. LOG INJECTION VIA ENTRY DATA
**Risk:** If entry titles contain format strings or escape sequences
**Location:** `edition/replication.go:154`
**Current:**
```go
log.Printf("Replicating %d entries to %s", len(entries), w.config.BackupPOP.URL)
```
**Issue:** If somehow user-controlled data gets into log, could inject log formatters
**Fix:** Use structured logging, never printf with user data
---
## SECURITY RECOMMENDATIONS
### Immediate (Before Production)
1. **Implement mTLS between POPs** - Most secure option, no shared secrets
2. **Add request signing with HMAC** - Prevent replays
3. **Add replication audit logging** - Security monitoring
4. **Remove env var backup mode** - Config file only
5. **Add rate limiting** - Prevent DoS
### Short Term
6. **Token rotation mechanism** - Hot-swap without restart
7. **Control plane integration** - Central authority for POP roles
8. **Add security headers** - `X-Content-Type-Options`, etc.
### Design Improvements
9. **Consider using ed25519 signatures** instead of shared secrets
10. **Add replication watermark/epoch** - For detecting gaps
---
## SECURE CONFIGURATION EXAMPLE
```yaml
# /etc/clavitor/replication.yaml
pop_id: "calgary-01"
region: "north-america"
role: "primary"
backup_pop:
id: "zurich-01"
url: "https://zurich-01.clavitor.ai:8443" # TLS 1.3 only
# No auth_token_file here - use mTLS instead
auth:
method: "mtls" # or "bearer"
# For mTLS:
mtls_cert: "/etc/clavitor/replication.crt"
mtls_key: "/etc/clavitor/replication.key"
mtls_ca: "/etc/clavitor/ca.crt" # Verify backup cert
# For bearer token:
# token_file: "/etc/clavitor/replication.key"
# token_rotation_interval: 86400 # 24 hours
replication:
batch_size: 100
max_retries: 5
request_timeout: 30 # seconds
max_request_age: 300 # 5 minutes (anti-replay)
security:
audit_log: "/var/log/clavitor/replication.log"
rate_limit_per_minute: 60
ban_after_failures: 10
ban_duration: 3600 # 1 hour
```
---
## SUMMARY
| Risk Level | Count | Issues |
|------------|-------|--------|
| **CRITICAL** | 5 | Auth not implemented, replay attacks, env var backup mode, weak token validation, MITM risk |
| **HIGH** | 4 | HTTPS validation, audit logging, cascade prevention, info disclosure |
| **MEDIUM** | 4 | Rate limiting, backoff math, DB connection, size limits |
| **LOW** | 3 | Timing attacks, request size, log injection |
**Bottom Line:** The replication scaffolding is good, but the security-critical parts (auth, replay protection, audit) are TODO stubs. **Do not deploy to production** until CRITICAL and HIGH issues are resolved.