clavitor/issues/001-missing-error-codes.md

92 lines
3.2 KiB
Markdown

# Issue: Missing unique error codes in clavis-telemetry
**Domain:** clavis-telemetry
**Assignee:** @hans
**Labels:** `violation`, `cardinal-rule-part-1`, `error-handling`
**Priority:** Medium
**Date:** 2026-04-08
---
## Violation
**Cardinal Rule Violated:** Part 1 — "Mandatory error handling with unique codes"
Per CLAVITOR-AGENT-HANDBOOK.md Part 1:
> Mandatory error handling with unique codes:
> - Every `if` needs an `else`. The `if` exists because the condition IS possible
> - Use unique error codes: `ToLog("ERR-12345: L3 unavailable in decrypt")`
> - When your "impossible" case triggers in production, you need to know exactly which assumption failed and where.
**Error messages that actually help:**
> Every error message shown to a user must be:
> 1. **Uniquely recognizable** — include an error code: `ERR-12345: ...`
> 2. **Actionable** — the user must know what to do next
> 3. **Routed to the actor who can resolve it**
---
## Location
File: `clavis/clavis-telemetry/main.go`
Lines with violations:
| Line | Current Code | Violation |
|------|--------------|-----------|
| 41 | `log.Fatalf("Failed to open operations.db: %v", err)` | No unique error code |
| 47 | `log.Fatalf("Failed to load CA chain for mTLS: %v", err)` | No unique error code |
| 228 | `log.Printf("Invalid certificate from %s: %v", popID, err)` | No unique error code |
| 337 | `log.Printf("SPAN EXTEND node=%s gap=%ds...")` | No unique error code |
| 342-351 | `log.Printf("OUTAGE SPAN node=%s...")` | No unique error code |
| 367-370 | `log.Printf("OUTAGE SPAN... alerting disabled")` | No unique error code |
| 383 | `log.Printf("OUTAGE SPAN ntfy error creating request: %v", err)` | No unique error code |
| 395 | `log.Printf("OUTAGE SPAN ntfy error sending alert: %v", err)` | No unique error code |
| 398 | `log.Printf("OUTAGE SPAN ntfy alert sent for node=%s", nodeID)` | No unique error code |
File: `clavis/clavis-telemetry/kuma.go`
| Line | Current Code | Violation |
|------|--------------|-----------|
| 56-58 | Silent fail on Kuma push error | Missing error handling entirely |
---
## Required Fix
1. Assign unique error codes for each error path (e.g., `ERR-TELEMETRY-001` through `ERR-TELEMETRY-020`)
2. Format: `ERR-TELEMETRY-XXX: <actionable message>`
3. Include error codes in:
- Fatal logs (database/CA loading failures)
- Certificate validation failures
- External alerting failures (ntfy)
- Kuma push failures (currently silent)
---
## Example Fix
```go
// Before:
log.Fatalf("Failed to open operations.db: %v", err)
// After:
log.Fatalf("ERR-TELEMETRY-001: Failed to open operations.db at %s - %v. Check permissions and disk space.", dbPath, err)
```
---
## Verification Checklist
- [ ] All `log.Fatalf` calls include `ERR-TELEMETRY-XXX` codes
- [ ] All `log.Printf` error logs include `ERR-TELEMETRY-XXX` codes
- [ ] Kuma push errors are no longer silent (line 56-58 kuma.go)
- [ ] Certificate validation failures include error codes
- [ ] External alert failures (ntfy) include error codes
- [ ] Test cases verify error codes appear in output
---
**Reporter:** Yurii (Code & Principle Review)
**Reference:** CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes"