92 lines
3.2 KiB
Markdown
92 lines
3.2 KiB
Markdown
# Issue: Missing unique error codes in clavis-telemetry
|
|
|
|
**Domain:** clavis-telemetry
|
|
**Assignee:** @hans
|
|
**Labels:** `violation`, `cardinal-rule-part-1`, `error-handling`
|
|
**Priority:** Medium
|
|
**Date:** 2026-04-08
|
|
|
|
---
|
|
|
|
## Violation
|
|
|
|
**Cardinal Rule Violated:** Part 1 — "Mandatory error handling with unique codes"
|
|
|
|
Per CLAVITOR-AGENT-HANDBOOK.md Part 1:
|
|
> Mandatory error handling with unique codes:
|
|
> - Every `if` needs an `else`. The `if` exists because the condition IS possible
|
|
> - Use unique error codes: `ToLog("ERR-12345: L3 unavailable in decrypt")`
|
|
> - When your "impossible" case triggers in production, you need to know exactly which assumption failed and where.
|
|
|
|
**Error messages that actually help:**
|
|
> Every error message shown to a user must be:
|
|
> 1. **Uniquely recognizable** — include an error code: `ERR-12345: ...`
|
|
> 2. **Actionable** — the user must know what to do next
|
|
> 3. **Routed to the actor who can resolve it**
|
|
|
|
---
|
|
|
|
## Location
|
|
|
|
File: `clavis/clavis-telemetry/main.go`
|
|
|
|
Lines with violations:
|
|
|
|
| Line | Current Code | Violation |
|
|
|------|--------------|-----------|
|
|
| 41 | `log.Fatalf("Failed to open operations.db: %v", err)` | No unique error code |
|
|
| 47 | `log.Fatalf("Failed to load CA chain for mTLS: %v", err)` | No unique error code |
|
|
| 228 | `log.Printf("Invalid certificate from %s: %v", popID, err)` | No unique error code |
|
|
| 337 | `log.Printf("SPAN EXTEND node=%s gap=%ds...")` | No unique error code |
|
|
| 342-351 | `log.Printf("OUTAGE SPAN node=%s...")` | No unique error code |
|
|
| 367-370 | `log.Printf("OUTAGE SPAN... alerting disabled")` | No unique error code |
|
|
| 383 | `log.Printf("OUTAGE SPAN ntfy error creating request: %v", err)` | No unique error code |
|
|
| 395 | `log.Printf("OUTAGE SPAN ntfy error sending alert: %v", err)` | No unique error code |
|
|
| 398 | `log.Printf("OUTAGE SPAN ntfy alert sent for node=%s", nodeID)` | No unique error code |
|
|
|
|
File: `clavis/clavis-telemetry/kuma.go`
|
|
|
|
| Line | Current Code | Violation |
|
|
|------|--------------|-----------|
|
|
| 56-58 | Silent fail on Kuma push error | Missing error handling entirely |
|
|
|
|
---
|
|
|
|
## Required Fix
|
|
|
|
1. Assign unique error codes for each error path (e.g., `ERR-TELEMETRY-001` through `ERR-TELEMETRY-020`)
|
|
2. Format: `ERR-TELEMETRY-XXX: <actionable message>`
|
|
3. Include error codes in:
|
|
- Fatal logs (database/CA loading failures)
|
|
- Certificate validation failures
|
|
- External alerting failures (ntfy)
|
|
- Kuma push failures (currently silent)
|
|
|
|
---
|
|
|
|
## Example Fix
|
|
|
|
```go
|
|
// Before:
|
|
log.Fatalf("Failed to open operations.db: %v", err)
|
|
|
|
// After:
|
|
log.Fatalf("ERR-TELEMETRY-001: Failed to open operations.db at %s - %v. Check permissions and disk space.", dbPath, err)
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
- [ ] All `log.Fatalf` calls include `ERR-TELEMETRY-XXX` codes
|
|
- [ ] All `log.Printf` error logs include `ERR-TELEMETRY-XXX` codes
|
|
- [ ] Kuma push errors are no longer silent (line 56-58 kuma.go)
|
|
- [ ] Certificate validation failures include error codes
|
|
- [ ] External alert failures (ntfy) include error codes
|
|
- [ ] Test cases verify error codes appear in output
|
|
|
|
---
|
|
|
|
**Reporter:** Yurii (Code & Principle Review)
|
|
**Reference:** CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes"
|