clavitor/issues/002-unhandled-db-errors.md

114 lines
3.6 KiB
Markdown

# Issue: Database errors not properly handled in updateSpan
**Domain:** clavis-telemetry
**Assignee:** @hans
**Labels:** `violation`, `cardinal-rule-part-1`, `error-handling`, `sql`
**Priority:** High
**Date:** 2026-04-08
---
## Violation
**Cardinal Rule Violated:** Part 1 — "Mandatory error handling with unique codes" AND "Every `if` needs an `else`"
Per CLAVITOR-AGENT-HANDBOOK.md Part 1:
> Mandatory error handling with unique codes:
> - Every `if` needs an `else`. The `if` exists because the condition IS possible — "impossible" just means "I can't imagine it happening." The else handles the case where your imagination was wrong.
> - Use unique error codes: `ToLog("ERR-12345: L3 unavailable in decrypt")`
---
## Location
File: `clavis/clavis-telemetry/main.go`
Function: `updateSpan()` (lines 323-358)
**Violation 1:** Line 328 — Error from `QueryRow` not checked:
```go
db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
```
This returns an error that is silently ignored.
**Violation 2:** Line 332 — Error from `QueryRow` checked but not logged with unique code:
```go
err := db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
```
The `err` is used for flow control but never logged with an error code when it's non-nil.
**Violation 3:** Line 335 — Error from `Exec` not checked:
```go
db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)
```
**Violation 4:** Line 338 — Error from `Exec` not checked:
```go
db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)
```
**Violation 5:** Line 354 — Error from `Exec` not checked:
```go
db.Exec(`INSERT INTO uptime_spans (node_id, start_at, end_at) VALUES (?, ?, ?)`, nodeID, now, now)
```
---
## Why This Matters
In telemetry/ops infrastructure, silent database failures mean:
1. Outage tracking becomes unreliable
2. Maintenance mode detection fails silently
3. Uptime span data becomes corrupt without anyone knowing
4. No forensic trail when investigating incidents
When the "impossible" happens (disk full, DB locked, schema mismatch), we need to know immediately with a unique error code.
---
## Required Fix
1. Check and handle ALL `db.QueryRow()` errors with unique codes
2. Check and handle ALL `db.Exec()` errors with unique codes
3. Log errors with format: `ERR-TELEMETRY-XXX: <context> - <details>`
4. Consider whether failures should affect the span tracking logic
---
## Example Fix
```go
func updateSpan(...) {
now := time.Now().Unix()
serverAge := now - processStartTime
var inMaint bool
err := db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
if err != nil {
log.Printf("ERR-TELEMETRY-010: Failed to check maintenance mode - %v", err)
// Continue with inMaint=false as safe default
}
var spanID int64
var spanEnd int64
err = db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
// ... rest of function with proper error handling
}
```
---
## Verification Checklist
- [ ] All `db.QueryRow()` calls check and log errors with unique codes
- [ ] All `db.Exec()` calls check and log errors with unique codes
- [ ] Error handling doesn't break span tracking flow
- [ ] Test cases added for database failure scenarios
- [ ] Code reviewed by Yurii before merge
---
**Reporter:** Yurii (Code & Principle Review)
**Reference:** CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes"