clavitor/issues/002-unhandled-db-errors.md

3.6 KiB

Issue: Database errors not properly handled in updateSpan

Domain: clavis-telemetry
Assignee: @hans
Labels: violation, cardinal-rule-part-1, error-handling, sql
Priority: High
Date: 2026-04-08


Violation

Cardinal Rule Violated: Part 1 — "Mandatory error handling with unique codes" AND "Every if needs an else"

Per CLAVITOR-AGENT-HANDBOOK.md Part 1:

Mandatory error handling with unique codes:

  • Every if needs an else. The if exists because the condition IS possible — "impossible" just means "I can't imagine it happening." The else handles the case where your imagination was wrong.
  • Use unique error codes: ToLog("ERR-12345: L3 unavailable in decrypt")

Location

File: clavis/clavis-telemetry/main.go

Function: updateSpan() (lines 323-358)

Violation 1: Line 328 — Error from QueryRow not checked:

db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)

This returns an error that is silently ignored.

Violation 2: Line 332 — Error from QueryRow checked but not logged with unique code:

err := db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)

The err is used for flow control but never logged with an error code when it's non-nil.

Violation 3: Line 335 — Error from Exec not checked:

db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)

Violation 4: Line 338 — Error from Exec not checked:

db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)

Violation 5: Line 354 — Error from Exec not checked:

db.Exec(`INSERT INTO uptime_spans (node_id, start_at, end_at) VALUES (?, ?, ?)`, nodeID, now, now)

Why This Matters

In telemetry/ops infrastructure, silent database failures mean:

  1. Outage tracking becomes unreliable
  2. Maintenance mode detection fails silently
  3. Uptime span data becomes corrupt without anyone knowing
  4. No forensic trail when investigating incidents

When the "impossible" happens (disk full, DB locked, schema mismatch), we need to know immediately with a unique error code.


Required Fix

  1. Check and handle ALL db.QueryRow() errors with unique codes
  2. Check and handle ALL db.Exec() errors with unique codes
  3. Log errors with format: ERR-TELEMETRY-XXX: <context> - <details>
  4. Consider whether failures should affect the span tracking logic

Example Fix

func updateSpan(...) {
    now := time.Now().Unix()
    serverAge := now - processStartTime

    var inMaint bool
    err := db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
    if err != nil {
        log.Printf("ERR-TELEMETRY-010: Failed to check maintenance mode - %v", err)
        // Continue with inMaint=false as safe default
    }

    var spanID int64
    var spanEnd int64
    err = db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
    
    // ... rest of function with proper error handling
}

Verification Checklist

  • All db.QueryRow() calls check and log errors with unique codes
  • All db.Exec() calls check and log errors with unique codes
  • Error handling doesn't break span tracking flow
  • Test cases added for database failure scenarios
  • Code reviewed by Yurii before merge

Reporter: Yurii (Code & Principle Review)
Reference: CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes"