114 lines
3.6 KiB
Markdown
114 lines
3.6 KiB
Markdown
# Issue: Database errors not properly handled in updateSpan
|
|
|
|
**Domain:** clavis-telemetry
|
|
**Assignee:** @hans
|
|
**Labels:** `violation`, `cardinal-rule-part-1`, `error-handling`, `sql`
|
|
**Priority:** High
|
|
**Date:** 2026-04-08
|
|
|
|
---
|
|
|
|
## Violation
|
|
|
|
**Cardinal Rule Violated:** Part 1 — "Mandatory error handling with unique codes" AND "Every `if` needs an `else`"
|
|
|
|
Per CLAVITOR-AGENT-HANDBOOK.md Part 1:
|
|
> Mandatory error handling with unique codes:
|
|
> - Every `if` needs an `else`. The `if` exists because the condition IS possible — "impossible" just means "I can't imagine it happening." The else handles the case where your imagination was wrong.
|
|
> - Use unique error codes: `ToLog("ERR-12345: L3 unavailable in decrypt")`
|
|
|
|
---
|
|
|
|
## Location
|
|
|
|
File: `clavis/clavis-telemetry/main.go`
|
|
|
|
Function: `updateSpan()` (lines 323-358)
|
|
|
|
**Violation 1:** Line 328 — Error from `QueryRow` not checked:
|
|
```go
|
|
db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
|
|
```
|
|
This returns an error that is silently ignored.
|
|
|
|
**Violation 2:** Line 332 — Error from `QueryRow` checked but not logged with unique code:
|
|
```go
|
|
err := db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
|
|
```
|
|
The `err` is used for flow control but never logged with an error code when it's non-nil.
|
|
|
|
**Violation 3:** Line 335 — Error from `Exec` not checked:
|
|
```go
|
|
db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)
|
|
```
|
|
|
|
**Violation 4:** Line 338 — Error from `Exec` not checked:
|
|
```go
|
|
db.Exec(`UPDATE uptime_spans SET end_at = ? WHERE id = ?`, now, spanID)
|
|
```
|
|
|
|
**Violation 5:** Line 354 — Error from `Exec` not checked:
|
|
```go
|
|
db.Exec(`INSERT INTO uptime_spans (node_id, start_at, end_at) VALUES (?, ?, ?)`, nodeID, now, now)
|
|
```
|
|
|
|
---
|
|
|
|
## Why This Matters
|
|
|
|
In telemetry/ops infrastructure, silent database failures mean:
|
|
1. Outage tracking becomes unreliable
|
|
2. Maintenance mode detection fails silently
|
|
3. Uptime span data becomes corrupt without anyone knowing
|
|
4. No forensic trail when investigating incidents
|
|
|
|
When the "impossible" happens (disk full, DB locked, schema mismatch), we need to know immediately with a unique error code.
|
|
|
|
---
|
|
|
|
## Required Fix
|
|
|
|
1. Check and handle ALL `db.QueryRow()` errors with unique codes
|
|
2. Check and handle ALL `db.Exec()` errors with unique codes
|
|
3. Log errors with format: `ERR-TELEMETRY-XXX: <context> - <details>`
|
|
4. Consider whether failures should affect the span tracking logic
|
|
|
|
---
|
|
|
|
## Example Fix
|
|
|
|
```go
|
|
func updateSpan(...) {
|
|
now := time.Now().Unix()
|
|
serverAge := now - processStartTime
|
|
|
|
var inMaint bool
|
|
err := db.QueryRow(`SELECT COUNT(*) > 0 FROM maintenance WHERE end_at IS NULL`).Scan(&inMaint)
|
|
if err != nil {
|
|
log.Printf("ERR-TELEMETRY-010: Failed to check maintenance mode - %v", err)
|
|
// Continue with inMaint=false as safe default
|
|
}
|
|
|
|
var spanID int64
|
|
var spanEnd int64
|
|
err = db.QueryRow(`SELECT id, end_at FROM uptime_spans WHERE node_id = ? ORDER BY end_at DESC LIMIT 1`, nodeID).Scan(&spanID, &spanEnd)
|
|
|
|
// ... rest of function with proper error handling
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
- [ ] All `db.QueryRow()` calls check and log errors with unique codes
|
|
- [ ] All `db.Exec()` calls check and log errors with unique codes
|
|
- [ ] Error handling doesn't break span tracking flow
|
|
- [ ] Test cases added for database failure scenarios
|
|
- [ ] Code reviewed by Yurii before merge
|
|
|
|
---
|
|
|
|
**Reporter:** Yurii (Code & Principle Review)
|
|
**Reference:** CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes"
|