clavitor/issues/003-silent-kuma-failure.md

113 lines
3.2 KiB
Markdown

# Issue: Silent failure in Kuma push — no error handling
**Domain:** clavis-telemetry
**Assignee:** @hans
**Labels:** `violation`, `cardinal-rule-part-1`, `error-handling`, `silent-failure`
**Priority:** High
**Date:** 2026-04-08
---
## Violation
**Cardinal Rule Violated:** Part 1 — "Mandatory error handling with unique codes" AND Part 1 — "Silent fallbacks are not fixes"
Per CLAVITOR-AGENT-HANDBOOK.md Part 1:
> Quick fixes are not fixes. A "temporary" hack that ships is permanent.
> Every `if` needs an `else`. The `if` exists because the condition IS possible.
---
## Location
File: `clavis/clavis-telemetry/kuma.go`
Lines 53-59:
```go
// POST to Kuma
payload := `{"status":"` + status + `","msg":"` + strings.ReplaceAll(msg, `"`, `\"`) + `","ping":60}`
resp, err := http.Post(kumaURL, "application/json", strings.NewReader(payload))
if err != nil {
// Silent fail - Kuma will detect silence as down
return
}
resp.Body.Close()
```
---
## The Violation
1. **Silent failure:** The error is caught and completely ignored with only a comment
2. **No error code:** No `ERR-XXXXX` code for operational forensics
3. **No logging:** The failure is invisible in logs
4. **Comment is misleading:** "Kuma will detect silence as down" — but operators won't know WHY Kuma shows down
---
## Why This Matters
When Kuma shows "down", operators need to know if it's because:
- The telemetry service is actually down (DB failure)
- The telemetry service can't reach Kuma (network issue)
- Kuma itself is having issues
Silent failures create blind spots in operational monitoring. The telemetry service could be failing to report health for hours, and the only symptom would be Kuma showing red — with no logs explaining why.
---
## Required Fix
1. Log Kuma push failures with unique error code
2. Include the error details in the log
3. Consider retry logic or backoff (optional)
4. Document the failure mode
---
## Example Fix
```go
// POST to Kuma
payload := `{"status":"` + status + `","msg":"` + strings.ReplaceAll(msg, `"`, `\"`) + `","ping":60}`
resp, err := http.Post(kumaURL, "application/json", strings.NewReader(payload))
if err != nil {
log.Printf("ERR-TELEMETRY-020: Failed to push health to Kuma at %s - %v", kumaURL, err)
return
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Printf("ERR-TELEMETRY-021: Kuma returned non-OK status %d from %s", resp.StatusCode, kumaURL)
}
```
---
## Additional Issue: resp.Body.Close() Error Ignored
Line 60: `resp.Body.Close()` returns an error that is silently discarded.
Fix:
```go
if err := resp.Body.Close(); err != nil {
log.Printf("ERR-TELEMETRY-022: Failed to close Kuma response body - %v", err)
}
```
---
## Verification Checklist
- [ ] Kuma push failures logged with `ERR-TELEMETRY-020`
- [ ] Non-OK HTTP responses logged with `ERR-TELEMETRY-021`
- [ ] Response body close errors handled with `ERR-TELEMETRY-022`
- [ ] All errors include actionable context (URL, status, error details)
- [ ] Test case added for Kuma push failure scenario
---
**Reporter:** Yurii (Code & Principle Review)
**Reference:** CLAVITOR-AGENT-HANDBOOK.md Part 1, "Mandatory error handling with unique codes" and "Silent fallbacks are not fixes"