3.6 KiB
Issue: Database errors not properly handled in updateSpan
Domain: clavis-telemetry
Assignee: @hans
Labels: violation, cardinal-rule-part-1, error-handling, sql
Priority: High
Date: 2026-04-08
Violation
Cardinal Rule Violated: Part 1 — "Mandatory error handling with unique codes" AND "Every if needs an else"
Per CLAVITOR-AGENT-HANDBOOK.md Part 1:
Mandatory error handling with unique codes:
- Every
ifneeds anelse. Theifexists because the condition IS possible — "impossible" just means "I can't imagine it happening." The else handles the case where your imagination was wrong.- Use unique error codes:
ToLog("ERR-12345: L3 unavailable in decrypt")
Location
File: clavis/clavis-telemetry/main.go
Function: updateSpan() (lines 323-358)
Violation 1: Line 328 — Error from QueryRow not checked:
db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
This returns an error that is silently ignored.
Violation 2: Line 332 — Error from QueryRow checked but not logged with unique code:
err := db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
The err is used for flow control but never logged with an error code when it's non-nil.
Violation 3: Line 335 — Error from Exec not checked:
db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)
Violation 4: Line 338 — Error from Exec not checked:
db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)
Violation 5: Line 354 — Error from Exec not checked:
db.Exec(`INSERT INTO uptime_spans (node_id, start_at, end_at) VALUES (?, ?, ?)`, nodeID, now, now)
Why This Matters
In telemetry/ops infrastructure, silent database failures mean:
- Outage tracking becomes unreliable
- Maintenance mode detection fails silently
- Uptime span data becomes corrupt without anyone knowing
- No forensic trail when investigating incidents
When the "impossible" happens (disk full, DB locked, schema mismatch), we need to know immediately with a unique error code.
Required Fix
- Check and handle ALL
db.QueryRow()errors with unique codes - Check and handle ALL
db.Exec()errors with unique codes - Log errors with format:
ERR-TELEMETRY-XXX: <context> - <details> - Consider whether failures should affect the span tracking logic
Example Fix
func updateSpan(...) {
now := time.Now().Unix()
serverAge := now - processStartTime
var inMaint bool
err := db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
if err != nil {
log.Printf("ERR-TELEMETRY-010: Failed to check maintenance mode - %v", err)
// Continue with inMaint=false as safe default
}
var spanID int64
var spanEnd int64
err = db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
// ... rest of function with proper error handling
}
Verification Checklist
- All
db.QueryRow()calls check and log errors with unique codes - All
db.Exec()calls check and log errors with unique codes - Error handling doesn't break span tracking flow
- Test cases added for database failure scenarios
- Code reviewed by Yurii before merge
Reporter: Yurii (Code & Principle Review)
Reference: CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes"