345 lines
9.8 KiB
Markdown
345 lines
9.8 KiB
Markdown
# Security Review: Replication Function (Commercial Edition)
|
|
|
|
**Scope:** Event-driven async replication between Primary and Backup POPs
|
|
**Files Reviewed:**
|
|
- `edition/config.go` - Config loading and validation
|
|
- `edition/replication.go` - Replication worker and client
|
|
- `edition/backup_mode.go` - Backup mode detection and write rejection
|
|
- `lib/dbcore.go` - Dirty marking functions
|
|
|
|
---
|
|
|
|
## CRITICAL ISSUES (Fix Before Production)
|
|
|
|
### 1. MISSING: Inter-POP Authentication Implementation
|
|
**Risk:** Unauthorized POPs could push malicious data to backups
|
|
**Location:** `edition/replication.go:150` (stub)
|
|
|
|
**Current State:**
|
|
```go
|
|
// TODO: POST to backup POP
|
|
// No authentication implemented
|
|
```
|
|
|
|
**Required:**
|
|
- mTLS (client cert) OR
|
|
- Shared secret in `Authorization: Bearer <token>` header
|
|
- Token must be read from file (not hardcoded)
|
|
|
|
**Fix:**
|
|
```go
|
|
req.Header.Set("Authorization", "Bearer " + loadTokenFromFile(w.config.Auth.TokenFile))
|
|
// OR use TLS client certificates
|
|
```
|
|
|
|
### 2. MISSING: Backup-Side Request Authentication
|
|
**Risk:** Anyone with network access could POST to `/api/replication/apply`
|
|
**Location:** Not yet implemented (backup receive endpoint)
|
|
|
|
**Required on Backup:**
|
|
- Verify `Authorization` header matches shared secret
|
|
- Verify source POP ID is in authorized list
|
|
- mTLS client cert validation (if using mTLS)
|
|
|
|
### 3. WEAK: Backup Mode Detection Uses Environment Variable
|
|
**Risk:** `CLAVITOR_BACKUP_MODE=true` can be set accidentally or maliciously
|
|
**Location:** `edition/backup_mode.go:23`
|
|
|
|
**Current:**
|
|
```go
|
|
if os.Getenv("CLAVITOR_BACKUP_MODE") == "true" {
|
|
return true
|
|
}
|
|
```
|
|
|
|
**Problem:**
|
|
- Env vars are visible to all processes (`ps e`)
|
|
- No validation that this POP is actually a backup
|
|
- Could cause split-brain if primary thinks it's backup
|
|
|
|
**Fix:**
|
|
- Read from config file only (`replication.yaml` with `role: backup`)
|
|
- Use control plane confirmation (call HQ to verify role)
|
|
- Remove env var option entirely
|
|
|
|
### 4. MISSING: Replay Attack Protection
|
|
**Risk:** Attacker could replay old replication requests
|
|
**Location:** `edition/replication.go:173-177`
|
|
|
|
**Current Request:**
|
|
```go
|
|
type ReplicationRequest struct {
|
|
SourcePOP string `json:"source_pop"`
|
|
Entries []ReplicatedEntry `json:"entries"`
|
|
Timestamp int64 `json:"timestamp"`
|
|
}
|
|
```
|
|
|
|
**Missing:**
|
|
- No nonce/unique ID per request
|
|
- No request signature
|
|
- No max age validation (5-minute window?)
|
|
|
|
**Fix:**
|
|
```go
|
|
type ReplicationRequest struct {
|
|
SourcePOP string `json:"source_pop"`
|
|
Entries []ReplicatedEntry `json:"entries"`
|
|
Timestamp int64 `json:"timestamp"`
|
|
Nonce string `json:"nonce"` // Random 16 bytes
|
|
Signature string `json:"signature"` // HMAC-SHA256 of body
|
|
}
|
|
```
|
|
|
|
### 5. HIGH: Shared Secret Token Validation Weak
|
|
**Risk:** Token file existence checked, but not token format/entropy
|
|
**Location:** `edition/config.go:58`
|
|
|
|
**Current:**
|
|
```go
|
|
if _, err := os.Stat(tokenFile); err != nil {
|
|
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
|
|
}
|
|
```
|
|
|
|
**Missing:**
|
|
- No token entropy validation (is it 32+ random bytes?)
|
|
- No token rotation mechanism
|
|
- No token expiration
|
|
|
|
**Fix:**
|
|
```go
|
|
token, err := os.ReadFile(tokenFile)
|
|
if len(token) < 32 {
|
|
return nil, fmt.Errorf("token must be at least 32 bytes")
|
|
}
|
|
// Check entropy (not all zeros, not repetitive)
|
|
```
|
|
|
|
---
|
|
|
|
## HIGH SEVERITY ISSUES
|
|
|
|
### 6. MISSING: HTTPS Certificate Validation Disabled?
|
|
**Risk:** MITM attack between primary and backup
|
|
**Location:** `edition/replication.go:150` (stub)
|
|
|
|
**Concern:** If using self-signed certs or `InsecureSkipVerify: true`
|
|
|
|
**Required:**
|
|
- Proper TLS cert validation
|
|
- Pin backup POP's cert or use internal CA
|
|
- Never `InsecureSkipVerify: true`
|
|
|
|
### 7. NO AUDIT LOGGING OF REPLICATION
|
|
**Risk:** Cannot detect unauthorized replication attempts
|
|
**Location:** Missing throughout
|
|
|
|
**Required:**
|
|
- Log every replication attempt (success and failure)
|
|
- Include: source POP ID, entry count, timestamp, auth result
|
|
- Alert on auth failures (possible attack)
|
|
|
|
### 8. CASCADE REPLICATION NOT PREVENTED
|
|
**Risk:** Backup could accidentally re-replicate to another POP, causing loops
|
|
**Location:** `edition/replication.go`
|
|
|
|
**Current:** Worker only checks `Role != "primary"`, but doesn't prevent config errors
|
|
|
|
**Fix:**
|
|
```go
|
|
// In startReplication():
|
|
if cfg.Role == "backup" {
|
|
log.Printf("Backup POP: replication worker disabled (receive-only)")
|
|
return // Don't start worker
|
|
}
|
|
```
|
|
|
|
### 9. INFORMATION DISCLOSURE IN ERRORS
|
|
**Risk:** Error messages leak internal paths and structure
|
|
**Location:** `edition/config.go:23, 28, 59`
|
|
|
|
**Current:**
|
|
```go
|
|
return nil, fmt.Errorf("cannot read replication config: %w", err)
|
|
return nil, fmt.Errorf("auth token file not found: %s", tokenFile)
|
|
```
|
|
|
|
**Problem:**
|
|
- Exposes `/etc/clavitor/replication.yaml` path
|
|
- Exposes `/etc/clavitor/replication.key` path
|
|
- Could help attacker locate secrets
|
|
|
|
**Fix:**
|
|
```go
|
|
return nil, fmt.Errorf("replication configuration error")
|
|
// Log full details internally only
|
|
```
|
|
|
|
---
|
|
|
|
## MEDIUM SEVERITY ISSUES
|
|
|
|
### 10. NO RATE LIMITING ON REPLICATION ENDPOINT
|
|
**Risk:** DoS via flood of replication requests
|
|
**Location:** Not implemented (backup receive endpoint)
|
|
|
|
**Required:**
|
|
- Rate limit `/api/replication/apply` per source POP
|
|
- Max 1 request/second per POP
|
|
- Ban source POP after 10 failed auth attempts
|
|
|
|
### 11. BACKOFF CALCULATION FLOAT CONVERSION
|
|
**Risk:** Minor - could panic or behave unexpectedly
|
|
**Location:** `edition/replication.go:109`
|
|
|
|
**Current:**
|
|
```go
|
|
backoff := time.Duration(math.Pow(5, float64(attempt-1))) * time.Second
|
|
```
|
|
|
|
**Issue:** For attempt=5, result is 625 seconds = 10.4 minutes, but truncated
|
|
|
|
**Fix:** Use integer math:
|
|
```go
|
|
delays := []time.Duration{1*time.Second, 5*time.Second, 25*time.Second, 125*time.Second, 625*time.Second}
|
|
if attempt > 0 && attempt <= len(delays) {
|
|
backoff = delays[attempt-1]
|
|
}
|
|
```
|
|
|
|
### 12. WORKER DB CONNECTION SEPARATE FROM MAIN
|
|
**Risk:** Inconsistent state if main DB has different connection params
|
|
**Location:** `edition/replication.go:56-61`
|
|
|
|
**Current:**
|
|
```go
|
|
dbPath := dataDir + "/clavitor-*.db" // TODO: proper vault lookup
|
|
db, err := lib.OpenDB(dbPath)
|
|
```
|
|
|
|
**Issue:** Replication worker uses separate DB connection from main handlers. If WAL mode settings differ, could cause lock issues.
|
|
|
|
**Fix:** Pass existing DB connection from main, or ensure identical settings.
|
|
|
|
### 13. NO MAX DIRTY ENTRY LIMIT
|
|
**Risk:** Memory exhaustion if backup down for long time
|
|
**Location:** `edition/replication.go:142`
|
|
|
|
**Current:**
|
|
```go
|
|
entries, err := lib.EntryListDirty(w.db, w.config.Replication.BatchSize)
|
|
```
|
|
|
|
**Issue:** Only limited by batch size, but if replication fails repeatedly, entries accumulate in DB (not memory). **Actually OK** - dirty entries stay in DB, not memory.
|
|
|
|
**Status:** Acceptable (disk-bound, not memory-bound)
|
|
|
|
---
|
|
|
|
## LOW SEVERITY / DEFENSE IN DEPTH
|
|
|
|
### 14. TIMING ATTACK ON TOKEN COMPARISON
|
|
**Risk:** Side-channel attack if comparing tokens naively
|
|
**Location:** Not implemented yet (backup auth)
|
|
|
|
**Fix:** Use constant-time comparison:
|
|
```go
|
|
if subtle.ConstantTimeCompare(receivedToken, expectedToken) != 1 {
|
|
return authFailed
|
|
}
|
|
```
|
|
|
|
### 15. NO REQUEST SIZE LIMITS ON BACKUP ENDPOINT
|
|
**Risk:** Large request could cause OOM
|
|
**Location:** Not implemented (backup receive)
|
|
|
|
**Fix:** `http.MaxBytesReader(w, r.Body, maxReplicationBodySize)`
|
|
|
|
### 16. LOG INJECTION VIA ENTRY DATA
|
|
**Risk:** If entry titles contain format strings or escape sequences
|
|
**Location:** `edition/replication.go:154`
|
|
|
|
**Current:**
|
|
```go
|
|
log.Printf("Replicating %d entries to %s", len(entries), w.config.BackupPOP.URL)
|
|
```
|
|
|
|
**Issue:** If somehow user-controlled data gets into log, could inject log formatters
|
|
|
|
**Fix:** Use structured logging, never printf with user data
|
|
|
|
---
|
|
|
|
## SECURITY RECOMMENDATIONS
|
|
|
|
### Immediate (Before Production)
|
|
|
|
1. **Implement mTLS between POPs** - Most secure option, no shared secrets
|
|
2. **Add request signing with HMAC** - Prevent replays
|
|
3. **Add replication audit logging** - Security monitoring
|
|
4. **Remove env var backup mode** - Config file only
|
|
5. **Add rate limiting** - Prevent DoS
|
|
|
|
### Short Term
|
|
|
|
6. **Token rotation mechanism** - Hot-swap without restart
|
|
7. **Control plane integration** - Central authority for POP roles
|
|
8. **Add security headers** - `X-Content-Type-Options`, etc.
|
|
|
|
### Design Improvements
|
|
|
|
9. **Consider using ed25519 signatures** instead of shared secrets
|
|
10. **Add replication watermark/epoch** - For detecting gaps
|
|
|
|
---
|
|
|
|
## SECURE CONFIGURATION EXAMPLE
|
|
|
|
```yaml
|
|
# /etc/clavitor/replication.yaml
|
|
pop_id: "calgary-01"
|
|
region: "north-america"
|
|
role: "primary"
|
|
|
|
backup_pop:
|
|
id: "zurich-01"
|
|
url: "https://zurich-01.clavitor.ai:8443" # TLS 1.3 only
|
|
# No auth_token_file here - use mTLS instead
|
|
|
|
auth:
|
|
method: "mtls" # or "bearer"
|
|
# For mTLS:
|
|
mtls_cert: "/etc/clavitor/replication.crt"
|
|
mtls_key: "/etc/clavitor/replication.key"
|
|
mtls_ca: "/etc/clavitor/ca.crt" # Verify backup cert
|
|
# For bearer token:
|
|
# token_file: "/etc/clavitor/replication.key"
|
|
# token_rotation_interval: 86400 # 24 hours
|
|
|
|
replication:
|
|
batch_size: 100
|
|
max_retries: 5
|
|
request_timeout: 30 # seconds
|
|
max_request_age: 300 # 5 minutes (anti-replay)
|
|
|
|
security:
|
|
audit_log: "/var/log/clavitor/replication.log"
|
|
rate_limit_per_minute: 60
|
|
ban_after_failures: 10
|
|
ban_duration: 3600 # 1 hour
|
|
```
|
|
|
|
---
|
|
|
|
## SUMMARY
|
|
|
|
| Risk Level | Count | Issues |
|
|
|------------|-------|--------|
|
|
| **CRITICAL** | 5 | Auth not implemented, replay attacks, env var backup mode, weak token validation, MITM risk |
|
|
| **HIGH** | 4 | HTTPS validation, audit logging, cascade prevention, info disclosure |
|
|
| **MEDIUM** | 4 | Rate limiting, backoff math, DB connection, size limits |
|
|
| **LOW** | 3 | Timing attacks, request size, log injection |
|
|
|
|
**Bottom Line:** The replication scaffolding is good, but the security-critical parts (auth, replay protection, audit) are TODO stubs. **Do not deploy to production** until CRITICAL and HIGH issues are resolved.
|