diff --git a/CLAVITOR-PRINCIPLES-REVIEW.md b/CLAVITOR-PRINCIPLES-REVIEW.md new file mode 100644 index 0000000..30f734f --- /dev/null +++ b/CLAVITOR-PRINCIPLES-REVIEW.md @@ -0,0 +1,914 @@ +# CLAVITOR-PRINCIPLES.md Review Report + +**Date:** 2026-04-08 +**Reviewer:** Claude Code +**Scope:** Full document review across correctness, conciseness, completeness, viability, security, consistency, clarity, actionability, maintainability, audibility, compliance, and localization. + +--- + +## Executive Summary + +The document is exceptionally well-constructed for a security-critical project. It demonstrates strong architectural thinking, clear threat modeling, and actionable daily practices. However, several issues ranging from minor inconsistencies to potential security blind spots warrant attention. + +| Metric | Score | Notes | +|--------|-------|-------| +| Security | 9/10 | Comprehensive threat model, one minor gap in Threat C analysis | +| Correctness | 8/10 | Technical details accurate, some terminology drift | +| Completeness | 7/10 | Missing incident response, versioning, and escalation procedures | +| Conciseness | 7/10 | Some redundancy and over-explanation | +| Consistency | 8/10 | Generally consistent, minor formatting drift | +| Viability | 8/10 | Mostly viable, some idealistic rules may create friction | +| Clarity | 9/10 | Exceptionally clear with concrete examples | +| Actionability | 9/10 | Daily checklist is exemplary | +| Maintainability | 8/10 | Good, but needs deprecation tracking | +| Audibility | 9/10 | Excellent grep-driven verification (see F036 on LLM-era evolution) | +| Compliance | 6/10 | GDPR mentioned but narrowly; missing SOC2, PCI-DSS, data retention, cross-border transfer | +| Localization | 1/10 | **Not addressed** — no i18n/l10n guidance for multi-language UI | + +--- + +## Findings + +### F001: Redundancy — "Foundation First" duplicated +**Location:** Lines 10-13 and 20-34 (restatement) +**Severity:** Low +**Metric:** Conciseness + +The "Foundation First — No Mediocrity. Ever." principle appears with significant overlap: first as an introductory statement (lines 10-13), then expanded in Part 1 (lines 20-34). The second version is more complete; the first is redundant. + +**Recommended Action:** +Remove the introductory paragraph (lines 10-13) and rely solely on the Part 1 version, which is more complete. Alternatively, reduce the intro to a single sentence: "Foundation First — see Part 1." + +--- + +### F002: Missing — "Quick fixes are not fixes" lacks counter-example +**Location:** Line 28 +**Severity:** Medium +**Metric:** Completeness + +The principle states "A 'temporary' hack that ships is permanent" but doesn't provide the positive pattern. Developers need to know what TO do when urgent fixes are needed. + +**Recommended Action:** +Add a positive pattern example: +``` +When urgent: fix the symptom with a comment containing // TEMP-FIX: and +a mandatory fix-by date. No TEMP-FIX older than 72 hours may exist in main. +``` + +--- + +### F003: Inconsistency — "No migrations during early-stage work" vs "MigrateDB" exists +**Location:** Lines 79-85, 645-650 (checklist) +**Severity:** Medium +**Metric:** Consistency + +The principle states "Do not write migration code" but the daily checklist (C3) acknowledges `MigrateDB` exists "for the existing `alternate_for` and `verified_at` columns (legacy from earlier work)." This creates a "do as I say, not as I do" situation. + +**Recommended Action:** +Either: (a) Remove the legacy migration code and update C3, or (b) add explicit text to the principle acknowledging the grandfathered exception and stating when the exemption expires (e.g., "Remove MigrateDB when legacy columns are phased out by [date]"). + +--- + +### F004: Viability Risk — "No env vars for fundamental paths" may block valid use cases +**Location:** Lines 54-55, 636-642 (checklist) +**Severity:** Medium +**Metric:** Viability + +The KISS principle states "No env vars when a hardcoded relative path is correct" with examples `./vaults/` and `WL3`. While simplicity is valuable, this hardcoding prevents: +- Container deployments with volume mounts at specific paths +- Multi-tenant hosting with separate data directories +- Compliance requirements requiring data on encrypted filesystems at specific mount points + +**Recommended Action:** +Add a clarification: "Exception: When the deployment environment (containers, compliance mounts) requires configurable paths, a single `CLAVITOR_DATA_ROOT` env var is acceptable, but no per-path fragmentation (DATA_DIR, WL3_DIR, VAULT_DIR, etc.)." + +--- + +### F005: Technical Inaccuracy — "Two crypto implementations of the same primitive" check is misleading +**Location:** Lines 676-688 (checklist D2) +**Severity:** Medium +**Metric:** Correctness + +The checklist says to run `diff clavis-vault/cmd/clavitor/web/crypto.js clavis-cli/crypto/crypto.js` and expects "zero diff (or... intentionally different... documented)." This is misleading because: +1. The CLI crypto.js may need target-specific adapters (BearSSL vs WebCrypto) +2. The paths suggest different directory structures that may require different module exports +3. A zero-diff requirement would prevent any platform-specific optimizations + +**Recommended Action:** +Change the check to: +"Expected: core encryption/decryption logic (`encrypt_field`, `decrypt_field`, HKDF) is byte-identical. Platform-specific wrappers/adapters are permitted if documented in `clavis-crypto/README.md` and don't affect the output format." + +--- + +### F006: Missing — No incident response procedure +**Location:** Entire document +**Severity:** High +**Metric:** Completeness + +The document thoroughly describes prevention but contains zero incident response guidance. Given the security-critical nature, this is a significant gap: +- What happens when Threat A is detected? (agent locked, then what?) +- Who is notified of a harvester pattern? +- What's the SLA for PRF unlock after a false positive lockout? +- How is a compromised vault file rotated? + +**Recommended Action:** +Add Part 7: Incident Response with sections for: +1. Agent lockout recovery procedure +2. Compromised credential token rotation +3. Audit log review cadence +4. Escalation contacts/owner notification +5. Emergency PRF unlock process (if owner loses authenticator) + +--- + +### F007: Security Gap — Threat C analysis misses "steal at creation" vector +**Location:** Lines 274-298 +**Severity:** High +**Metric:** Security + +Threat C discusses stolen CVT client credentials, but misses a critical sub-vector: **credential theft at creation time**. When an agent is created: +1. Browser generates L2 +2. Browser encrypts with L0, creates type-0x01 token +3. This token briefly exists in browser memory before being sent to agent + +A malicious browser extension or XSS could exfiltrate the token during this window. The per-agent IP whitelist helps post-creation, but doesn't address theft-at-creation. + +**Recommended Action:** +Add to Threat C: +"Sub-vector C2 — Credential theft at creation: The type-0x01 token exists transiently in browser memory during agent enrollment. Defense: agent tokens should be delivered through a side-channel (QR code display + camera scan by agent device, or file download + scp) rather than clipboard or direct HTTP response to browser. Document the secure transfer mechanism in the agent enrollment flow." + +--- + +### F008: Consistency — "Must never" lists use inconsistent formatting +**Location:** Part 3 subprojects +**Severity:** Low +**Metric:** Consistency + +The "Must never" sections across subprojects use inconsistent formatting: +- `clavis-vault`: bullet list with "- Have functions named..." (line 358) +- `clavis-cli`: bullet list with "- Bridge to Go..." (line 405) +- `clavis-chrome`: bullet list with "- Store L2 or L3..." (line 456) +- `clavitor.ai/admin`: bullet list with "- Hold any decryption material..." (line 491) + +Some use "- Verb..." (action-oriented), others use "- No/None..." (prohibition-oriented). `clavis-vault` has a "Must never" that starts with "Have" which is confusing ("Must never have functions named..." vs "Must never: have functions named...") + +**Recommended Action:** +Standardize all "Must never" bullets to start with action verbs in negative form: +- "Implement crypto operations in Go" (not "Bridge to Go for any...") +- "Store L2 or L3 in extension storage" (not "Use a different field...") + +--- + +### F009: Missing — No definition of "early-stage" +**Location:** Lines 79-85, 362 +**Severity:** Medium +**Metric:** Completeness + +The document references "early-stage work" multiple times as justification for no migrations, no backwards compatibility, and willingness to delete vaults. However, "early-stage" is never defined. This creates ambiguity about when the rule flips. + +**Recommended Action:** +Add a definition in Part 1 or a footnote: +"'Early-stage' means: no paying customers, no production deployments outside the core team's infrastructure, and version < 1.0. The project remains in early-stage until [specific criteria: e.g., first paid subscription, 100 active vaults, or v1.0 release]. When early-stage ends, the principles on migrations and backwards compatibility will be revised." + +--- + +### F010: Audibility Gap — Checklist A6 (keys never logged) is insufficient +**Location:** Lines 587-593 +**Severity:** Medium +**Metric:** Audibility, Security + +The grep for A6 checks for obvious key material (l1Key, master, wrappedL3, prf, p1Bytes) but misses: +- Variable naming variations (l1_key, l1KeyBytes, masterKeyBytes) +- Hex encoding patterns that might reveal keys in logs +- Indirect logging via error wrapping (errors.Wrap(err, "failed with master key: " + key)) + +**Recommended Action:** +Enhance A6 with additional patterns: +```bash +# Check for variable naming variations +grep -rnE "(l1[_-]?key|l1[_-]?bytes|master[_-]?key|wrapped[_-]?l3|prf[_-]?output|p1[_-]?bytes)" \ + clavis-vault/api/ clavis-vault/lib/ | grep -iE "log\.|fmt\.Print|errors\.|AuditEvent" + +# Check for hex encoding of key-sized data (32 bytes = 64 hex chars) +grep -rnE "%x.*\[32\]|%x.*\[64\]|hex\.EncodeToString.*key|hex\.EncodeToString.*master" \ + clavis-vault/api/ clavis-vault/lib/ +``` + +--- + +### F011: Viability Risk — "No docstrings on code you didn't change" may reduce maintainability +**Location:** Lines 104-105 +**Severity:** Low +**Metric:** Viability, Maintainability + +The KISS principle discourages adding docstrings to unchanged code. While this prevents PR bloat, it may result in persistently under-documented legacy code. As the codebase grows and original authors leave, this creates knowledge silos. + +**Recommended Action:** +Add a clarifying exception: +"No docstrings on code you didn't change, UNLESS you needed to spend >10 minutes understanding what the code does. In that case, add the docstring you wish existed — pay it forward." + +--- + +### F012: Technical Inaccuracy — Threat B time calculation uses Grover's incorrectly +**Location:** Line 267 +**Severity:** Low +**Metric:** Correctness + +The text states: "32-byte AES-256-GCM keys are not brute-forceable. ~10⁵⁷ years at 10¹² guesses/sec; Grover's quantum speedup leaves 2¹²⁸ effective security." + +Grover's algorithm reduces the effective key space by a square root, not to 2¹²⁸. For AES-256, Grover's provides a sqrt(2²⁵⁶) = 2¹²⁸ speedup, meaning the effective security is reduced to that of a 128-bit classical key, not "2¹²⁸ effective security." The phrasing is confusing. + +**Recommended Action:** +Correct to: +"32-byte AES-256-GCM keys are not brute-forceable. ~10⁵⁷ years at 10¹² guesses/sec; Grover's quantum algorithm provides a square-root speedup, reducing effective security to that of a 128-bit classical cipher — still computationally infeasible (~10²⁹ years)." + +--- + +### F013: Missing — No guidance on principle deprecation +**Location:** Lines 768-771 +**Severity:** Medium +**Metric:** Completeness + +Part 5 mentions marking principles deprecated but doesn't specify the mechanism: +- What format should the deprecation notice take? +- Who authorizes deprecation? +- How long should deprecated principles remain before removal? +- Should deprecated principles move to an appendix or stay in place? + +**Recommended Action:** +Add to Part 5: +```markdown +### Deprecation format +When a principle is obsolete: +1. Add **DEPRECATED** in bold at the start of the principle header +2. Include: date deprecated, reason, replacement principle (if any), and authorizing person (e.g., "Deprecated 2026-04-08: Early-stage ended. Migration code now required. — Johan") +3. Deprecated principles remain in place for 90 days, then move to Appendix A: Deprecated Principles +4. A deprecated principle's grep checks in Part 4 are commented out with a reason +``` + +--- + +### F014: Redundancy — "No backwards-compat shims" duplicates "Don't add what's not needed" +**Location:** Lines 87-96 vs 98-105 +**Severity:** Low +**Metric:** Conciseness + +The "No backwards-compat shims" section lists specific anti-patterns (flags, rename hacks, re-exports, removed comments, commented-out code). The following section "Don't add what's not needed" has overlapping scope. Both mention dead code/comments. + +**Recommended Action:** +Merge the two sections under "Don't add what's not needed" and structure as: +```markdown +### Don't add what's not needed (and remove what isn't used) + +- No backwards-compatibility shims: ... +- No error handling for impossible situations: ... +[etc] +``` + +--- + +### F015: Security Gap — No guidance on secure deletion of keys from memory +**Location:** Part 2, threat model sections +**Severity:** Medium +**Metric:** Security, Completeness + +The document extensively discusses where keys live (L0, L1, L2, L3) and what never holds them, but provides no guidance on secure deletion of key material from memory after use. In Go, this is particularly important because: +- The garbage collector may move/copy memory +- `memset` or `clear` doesn't guarantee overwriting the physical memory +- Sensitive data may remain in memory after session close + +**Recommended Action:** +Add to Part 2 (after Cardinal Rule #3) or to `clavis-crypto` subproject: +```markdown +### Secure key lifecycle in memory + +When a key is no longer needed (session close, credential operation complete): +- Use `memset_explicit` (C11) or `runtime.memclrNoHeapPointers` (Go) to clear buffers +- In Go, sensitive byte slices should be allocated from `mmap` with `PROT_READ|PROT_WRITE` + and `MADV_DONTDUMP` to exclude from core dumps +- The browser's `sessionStorage` clear is automatic on tab close; no additional action needed +- CLI: clear key buffers before exit; use `explicit_bzero` on sensitive stack buffers +``` + +--- + +### F016: Correctness — Part 4 Section F uses imprecise commands +**Location:** Lines 729-742 +**Severity:** Low +**Metric:** Correctness, Audibility + +Check F1 says to run `go test ./lib/... ./api/...` but this doesn't capture all tests in the monorepo. F2 says `make build` but the Makefile might have other targets that matter. F3 is described as "not greppable" but could include guidance. + +**Recommended Action:** +Update F1: +```bash +cd clavis-vault && go test ./... +cd clavitor.ai/admin && go test ./... +# Add any other test directories +``` + +Update F2: +```bash +cd clavis-vault && make dev # validates JS as part of dev build +# or for CI: make build +``` + +Add to F3: +```markdown +**F3. New code has tests.** +``` +git diff --name-only HEAD~1 | grep -E "\.go$" | while read f; do + test_file="$(dirname $f)/$(basename $f .go)_test.go" + [ -f "$test_file" ] || echo "WARNING: $f has no test file" +done +``` +``` + +--- + +### F017: Viability Risk — "Restart rule" is not actionable +**Location:** Lines 32-34 +**Severity:** Medium +**Metric:** Viability, Actionability + +The "restart rule" states: "When the foundation is wrong: start over. Not 'refactor slightly'... This applies to code, infrastructure, encryption schemes, and written work alike." This is philosophically correct but operationally vague: +- What constitutes "wrong" enough to restart? +- How much sunk cost is too much to restart? +- What's the approval process for a restart vs refactor? + +**Recommended Action:** +Add decision criteria: +```markdown +### When to restart vs. refactor + +**Restart when** (any of): +- The core abstraction is incorrect (e.g., server holding L2 material) +- Three or more fundamental fixes are needed for one feature +- The code contradicts its own stated invariants +- Security review reveals architectural vulnerability + +**Refactor when**: +- Implementation is messy but abstraction is sound +- Performance issues but semantics are correct +- Code duplication but logic is correct + +**Approval**: Restart decisions require explicit user confirmation. State: "The foundation is wrong because [reason]. I recommend restarting [component]. Sunk cost: [commits/lines]. Estimated rebuild: [time]. Proceed?" +``` + +--- + +### F018: Clarity Issue — "Don't add what's not needed" mixes multiple concerns +**Location:** Lines 98-105 +**Severity:** Low +**Metric:** Clarity, Consistency + +This section combines four unrelated concerns: +1. Error handling for impossible situations (line 100) +2. Validation at trusted boundaries (lines 101-102) +3. Feature flags for hypothetical futures (line 103) +4. Docstrings on unchanged code (lines 104-105) + +These are distinct principles with different rationales. Combining them reduces clarity. + +**Recommended Action:** +Split into four subsections or bullet points with clear headers: +```markdown +- **No error handling for impossible situations.** Trust internal code. +- **No validation at trusted internal boundaries.** Validate only at the system perimeter... +- **No feature flags for hypothetical futures.** Build what's asked, no more. +- **Minimal documentation on unchanged code.** Only comment where logic isn't self-evident. +``` + +--- + +### F019: Security Gap — Threat A doesn't address "slow harvest" evasion +**Location:** Lines 240-252 +**Severity:** Medium +**Metric:** Security, Completeness + +Threat A defenses include rate limits (3/min, 10/hr) and two-strike lockdown. However, a sophisticated attacker could: +- Harvest at 2/min (below threshold) +- Use time-of-day patterns (only during business hours) +- Randomize intervals to avoid detection + +The rate limits are reactive, not proactive. There's no mention of behavioral analysis or anomaly detection. + +**Recommended Action:** +Add to Threat A: +```markdown +**Advanced evasion consideration:** Rate limits are a first line. Future enhancement: +anomaly detection on access patterns (time-of-day consistency, geographic consistency, +entry access distribution). Document that current rate limits are baseline defense, +and audit logs should be reviewed for slow-and-low patterns below rate limits. +``` + +--- + +### F020: Correctness — Check E2 description is slightly inaccurate +**Location:** Lines 703-717 +**Severity:** Low +**Metric:** Correctness + +The check states "Every `AgentCanAccess` call should be followed by an `agentReadEntry` call in the same code path." However, the correct sequence is: +1. `AgentCanAccess` (authorization check) +2. [Handler-specific logic] +3. `agentReadEntry` (rate limit/strike tracking) + +They're not necessarily "followed by" in an immediate sense — there may be intermediate logic. + +**Recommended Action:** +Correct to: +```markdown +Every credential-read handler must call both `AgentCanAccess` (before accessing +data) and `agentReadEntry` (after decryption, before returning). Verify by ensuring +each handler contains both function names in the proper order. +``` + +--- + +### F021: Completeness — No guidance on test fixture security +**Location:** Check A1 mentions "Test fixtures may use the term in test data" +**Severity:** Medium +**Metric:** Security, Completeness + +Check A1 notes that test fixtures may contain `master_key` in test data. This is a dangerous blanket exception without guidance on: +- Whether test keys should be distinguishable from real keys +- Whether test fixtures should be encrypted/committed +- How to ensure test keys don't leak into production + +**Recommended Action:** +Add to Part 4 Section F or as a new subsection in Part 2: +```markdown +### Test fixture key material +Test data containing key material: +- Must use the prefix `TEST_` in all variable/field names +- Must use obviously fake values (e.g., `master_key = bytes.Repeat([]byte{0xAB}, 32)`) +- Must not be derived from any production key material +- Must not be committed to the repository in binary form +``` + +--- + +### F022: Redundancy — Audit logging mentioned in multiple places +**Location:** Lines 108-115, 114 (repeated in example), and implicitly throughout +**Severity:** Low +**Metric:** Conciseness + +Audit logging is mentioned as a general principle (lines 108-115), with code example (line 114), and implied in various threat defenses. The code example is also repetitive — it shows the full function call twice in different contexts. + +**Recommended Action:** +In Part 1, keep the principle but move the full code example to a single location (perhaps Part 3 `clavis-vault` section where it's most relevant) and reference it: +```markdown +Audit format: `lib.AuditLog()` — see `clavis-vault` section for usage example. +``` + +--- + +### F023: Viability — Part 6 "When you have to violate a principle" lacks escalation path +**Location:** Lines 775-794 +**Severity:** Medium +**Metric:** Viability + +Part 6 describes what to do when principles conflict (surface to user, pick higher consequence, document, add TODO). However, it doesn't address what happens if: +- The user disagrees with the agent's assessment of which principle has higher consequence +- The violation is discovered post-hoc (code already committed) +- Multiple principles seem equally consequential + +**Recommended Action:** +Add to Part 6: +```markdown +### Escalation and post-hoc discovery + +**User disagreement:** If the user insists on a lower-consequence principle over a higher one +(e.g., speed over security), escalate to project owner (Johan) before proceeding. +Document the override in the PR description. + +**Post-hoc discovery:** If you find a committed violation during daily review: +1. Assess if it's actively dangerous (security veto) — if yes, escalate immediately +2. If not dangerous, add to TODO and address in current sprint +3. Never leave a violation undocumented for >24 hours + +**Equal consequence conflicts:** When two principles of equal tier conflict, prefer the +one that preserves user agency and data integrity. +``` + +--- + +### F024: Consistency — Missing period in final line +**Location:** Line 798 +**Severity:** Very Low +**Metric:** Consistency + +The final line "*Foundation First. No mediocrity. Ever.*" has inconsistent punctuation. "Ever" has no period but the others do. + +**Recommended Action:** +Add period: "*Foundation First. No mediocrity. Ever.*" → "*Foundation First. No mediocrity. Ever.*" + +--- + +### F025: Completeness — No explicit versioning of the principles document itself +**Location:** Entire document +**Severity:** Medium +**Metric:** Completeness, Maintainability + +The document doesn't have a version number or date. When an agent reads this "end-to-end" each morning, how do they know if principles have changed since yesterday? Without versioning, agents cannot: +- Track which principles are new/changed +- Review the delta rather than the full document +- Correlate violations with principle versions + +**Recommended Action:** +Add a document header: +```markdown +--- +Document Version: 1.0 +Last Updated: 2026-04-08 +Change Log: See Appendix B (or git log) +--- +``` + +Or use the existing date in Part 5: "Reference the incident or insight that surfaced the principle (commit hash, issue number, or session date)" — add principle document version to that reference. + +--- + +### F026: Audibility — Check B1 pattern may have false negatives +**Location:** Lines 605-611 +**Severity:** Low +**Metric:** Audibility + +Check B1 searches for `catch.*plaintext` or `catch.*ciphertext` or `catch.*=.*null`. However, a malicious silent fallback could use: +- `catch (e) { return data; }` (returns encrypted or undefined) +- `catch { decrypted = ""; }` (empty string, not null) +- `finally` blocks that silently restore old values + +**Recommended Action:** +Enhance B1: +```bash +# Check for any catch that doesn't show user-visible failure +grep -rn "catch.*{" clavis-vault/cmd/clavitor/web/ | \ + grep -v "decryption failed\|user-visible\|alert\|console.error" +``` +Or add manual review note: "All catch blocks in decryption paths must be reviewed manually — automated grep may miss subtle fallbacks." + +--- + +### F027: Security — Missing explicit prohibition on key material in crash dumps +**Location:** Threat B section (lines 253-272) +**Severity:** Medium +**Metric:** Security + +Threat B discusses backups, logs, metrics, error messages, audit rows, and debug dumps, but doesn't explicitly mention crash dumps/core files. These are distinct from "debug dumps" and may persist in `/var/crash/` or system directories. + +**Recommended Action:** +Add to Threat B: +```markdown +- **No key material in crash dumps.** Disable core dumps for Clavitor processes + (`prctl(PR_SET_DUMPABLE, 0)` on Linux, `PROC_UNSET_DYLD_TRACES` on macOS) or + use `MADV_DONTDUMP` on memory pages holding key material. +``` + +--- + +### F028: Clarity — "Three threats, no others" claim may be overstated +**Location:** Line 227-232 +**Severity:** Low +**Metric:** Clarity, Correctness + +The document states: "The three threats below are the entire universe Clavitor defends against." This is a strong claim. However, the threat model excludes: +- Supply chain attacks (malicious dependency) +- Social engineering (owner tricked into unlocking) +- Side-channel attacks (timing, power analysis) +- Infrastructure compromise (compromised Tailscale node) + +While some of these may be out of scope, claiming "entire universe" is overstated. + +**Recommended Action:** +Change to: +```markdown +The three threats below are the **primary** universe Clavitor defends against +at the application layer. Infrastructure, supply chain, and social engineering +are handled through operational security practices outside this codebase. +``` + +--- + +### F029: Correctness — Key tier table has formatting inconsistency +**Location:** Lines 189-195 +**Severity:** Very Low +**Metric:** Consistency + +The key tier table uses inconsistent formatting: +- L0, L1, L2, L3, P1 use `code` formatting +- master_key uses plain text (no code formatting) +- Column "What" describes byte counts for all except P1 + +**Recommended Action:** +Standardize: +```markdown +| Tier | What | Server may hold | +|------|------|-----------------| +| L0 | 4 bytes — vault file routing | Yes (sliced from `L1[:4]`) | +| L1 | 8 bytes — vault content encryption | Yes (in bearer token) | +| L2 | 16 bytes — agent decryption key | **NEVER** | +| L3 | 32 bytes — hardware-only key | **NEVER** | +| P1 | 8 bytes — public lookup token | Yes (browser computes & sends) | +| Master | 32 bytes — full PRF output | **NEVER** | +``` + +--- + +### F030: Completeness — No mention of compiler optimizations affecting security +**Location:** Part 2, security sections +**Severity:** Medium +**Metric:** Security, Completeness + +The document doesn't address compiler optimizations that may undermine security: +- Go's compiler may optimize away `clear()` calls on slices that are never read again +- C compilers may dead-store-eliminate `memset` calls +- This affects the secure deletion of key material + +**Recommended Action:** +Add to Cardinal Rule #2 or #3: +```markdown +**Compiler considerations:** Standard `memset` or `clear()` may be optimized away. +Use `runtime.memclrNoHeapPointers` (Go) or `memset_explicit` (C11) or +`OPENSSL_cleanse` (if using OpenSSL) to ensure memory is actually cleared. +``` + +--- + +### F031: Compliance Gap — GDPR is the only framework mentioned +**Location:** Lines 301-313 (Cardinal Rule #5), scattered references +**Severity:** Medium +**Metric:** Compliance + +The document extensively discusses GDPR (specifically the "GDPR-out-of-scope by design" property of WL3), but fails to address other compliance frameworks that may apply to a credential vault service: +- **SOC 2 Type II**: Controls for availability, confidentiality, processing integrity +- **PCI-DSS**: If the vault stores payment card credentials +- **HIPAA**: If used in healthcare contexts (ePHI access credentials) +- **CCPA/CPRA**: California privacy laws (broader than GDPR in some aspects) +- **FedRAMP**: If any government use is anticipated +- **ISO 27001**: Information security management + +The narrow GDPR focus may lead to architectural decisions that satisfy one framework while violating others. + +**Recommended Action:** +Add Part 8: Compliance Frameworks or expand Cardinal Rule #5: +```markdown +### Cardinal Rule #5 — Data minimization for compliance + +WL3 is designed for **global** compliance minimization, not just GDPR: +- No PII in WL3 (satisfies GDPR, CCPA, LGPD, PIPEDA) +- No payment card data in vault (PCI-DSS scope exclusion) +- Audit logs retained per strictest applicable jurisdiction (often 7 years) +- Encryption at rest satisfies SOC 2 CC6.1, ISO 27001 A.10.1.2 + +When adding features, ask: which compliance framework requires this? If none, don't add it. +``` + +--- + +### F032: Compliance Gap — No data retention policy guidance +**Location:** Part 2, audit sections +**Severity:** High +**Metric:** Compliance + +The document states "we never delete" vault data (line 512) until GDPR request, but doesn't specify: +- How long audit logs must be retained +- When deleted entries are truly purged (vs marked deleted) +- How backups are rotated and when they're destroyed +- Cross-border data transfer mechanisms (EU data in US POPs) + +This creates compliance risk across multiple jurisdictions with differing retention requirements. + +**Recommended Action:** +Add to Part 2 or Part 8: +```markdown +### Data retention + +| Data type | Retention | Rationale | +|-----------|-----------|-----------| +| Vault content | Until customer deletion + 30 days | GDPR Art. 17, business continuity | +| Audit logs | 7 years | SOX, SOC 2, various national requirements | +| Failed auth attempts | 90 days | Threat analysis, then purge | +| Backups | 30 days rolling | Disaster recovery, then secure wipe | +| Telemetry | 90 days | Operational metrics only | + +Secure deletion: Use `shred` (files), `DBAN` patterns (disks), or `crypto/rand` overwrite +for SQLite `VACUUM` operations before file removal. +``` + +--- + +### F033: Compliance Gap — No cross-border data transfer mechanism +**Location:** Not addressed +**Severity:** Medium +**Metric:** Compliance + +For EU customers with data in US POPs (or vice versa), the document doesn't address: +- Standard Contractual Clauses (SCCs) +- Data Processing Agreements (DPAs) +- Adequacy decisions (EU-US Data Privacy Framework) + +Cardinal Rule #5 claims GDPR-out-of-scope, but if the central admin DB is in a different jurisdiction than the POP, there are still transfer implications. + +**Recommended Action:** +Add to Part 8: +```markdown +### Cross-border transfers + +WL3 files are jurisdiction-agnostic (no PII), but metadata flows between POPs and central +require SCCs in place. The vault slot registry (central) and WL3 files (POPs) are always +in the same jurisdiction for a given customer. Multi-region customers are explicitly +opt-in with transfer mechanism documented at enrollment. +``` + +--- + +### F034: Localization — No internationalization guidance exists +**Location:** Not addressed in any subproject section +**Severity:** High +**Metric:** Localization + +The document contains zero guidance on: +- UI string externalization (browser, mobile, CLI) +- Right-to-left (RTL) language support (Arabic, Hebrew) +- Date/time/number localization +- Locale-aware credential detection (different TLDs, country-specific form patterns) +- Translation workflows (who translates, how are updates propagated) + +This is a significant gap for a product with browser extensions and mobile apps that will inevitably face non-English users. + +**Recommended Action:** +Add subproject sections for localization or a dedicated Part 9: +```markdown +## Part 9 — Localization (i18n/l10n) + +### Principle: English-first, but locale-ready + +- All user-facing strings live in `locales/en.json` (single source of truth) +- Browser/extension: Use `browser.i18n.getMessage()` with `_locales//messages.json` +- CLI: Use gettext `.po` files, respect `$LANG` and `$LC_MESSAGES` +- Mobile: Use platform-native i18n (Android resources, iOS Localizable.strings) +- Vault web UI: Serve locale bundle based on `Accept-Language`, fallback to English + +### Hard vetos: +- No hardcoded strings in source code (except log messages, which are English-only) +- No user-facing string concatenation (`"Hello, " + username`) — use parameterized templates +- No locale-specific crypto or validation logic (locale is display-only) +- No translation of credential field names (username/password are universal) + +### RTL considerations: +- Browser extension UI must support RTL layouts (`dir="rtl"`) +- Vault web UI CSS uses logical properties (`inline-start`, not `left`) + +### Credential detection: +- TLD matching must respect IDN (internationalized domain names) +- Form field detection uses localized label heuristics (maintained per-locale) +``` + +--- + +### F035: Localization — Form field detection without locale awareness +**Location:** Implicit in browser extensions +**Severity:** Medium +**Metric:** Localization + +Browser extensions that detect login forms by field names/placeholders will fail on non-English sites. A French site's "Nom d'utilisateur" or Japanese "ユーザー名" won't match hardcoded "Username" heuristics. + +**Recommended Action:** +Add to `clavis-chrome/firefox/safari` sections: +```markdown +**Must never:** +- Detect forms using English-only field name patterns +- Rely solely on `id="username"` (common but not universal) + +**Must:** +- Use W3C `autocomplete` attributes as primary signal +- Maintain locale-specific keyword lists for heuristics (German: "Benutzername", French: "identifiant") +- Support Cyrillic, CJK, and Arabic script in field detection +``` + +--- + +### F036: Methodology Critique — Grep-based checks are legacy in LLM era +**Location:** Part 4 (entire section) +**Severity:** Medium +**Metric:** Audibility, Maintainability + +The daily checklist relies heavily on `grep` for verification (A1-A7, B1-B3, C1-C4, D2, E1-E3). While grep is reliable for exact patterns, it: +- Cannot understand semantic equivalence (e.g., `master_key` vs `mKey` vs `key[:32]`) +- Produces false negatives on obfuscated or indirect code paths +- Requires manual maintenance as code evolves (pattern drift) +- Cannot evaluate the "spirit" of a principle, only the "letter" + +With LLMs available for code review, grep-based checks are a local maximum. The daily reviewer should be running semantic analysis, not just pattern matching. + +**Recommended Action:** +Update the Part 4 introduction: +```markdown +## Part 4 — Daily review verification + +**Methodology:** These checks use `grep` for fast, deterministic screening. +However, grep catches patterns, not intent. The daily reviewer must also: +1. Run the automated grep checks (catches obvious violations) +2. Run semantic review on changed files using available analysis tools (catches obfuscated violations) +3. Manually review any ambiguous findings + +**Goal:** Zero violations that an intelligent reviewer would catch, not just zero grep matches. +``` + +Consider replacing some greps with semantic checks: +- A6 (keys in logs): Use taint analysis to find key material flowing to log functions +- B1 (silent decryption fallback): Use control-flow analysis to find catch blocks that don't re-throw or return error indicators +- D1 (DRY violations): Use clone detection to find similar code blocks + +--- + +## Summary of Recommendations by Priority + +### Critical (Security/Viability/Compliance Blockers) +| ID | Finding | Action | +|----|---------|--------| +| F007 | Threat C missing "steal at creation" | Add C2 sub-vector with defense | +| F006 | No incident response procedure | Add Part 7: Incident Response | +| F034 | No internationalization guidance exists | Add Part 9: Localization | + +### High (Should Address Soon) +| ID | Finding | Action | +|----|---------|--------| +| F003 | Migration code inconsistency | Resolve MigrateDB or document exception | +| F004 | Hardcoded paths blocking valid use cases | Add env var exception clause | +| F015 | No secure deletion guidance | Add memory clearing guidance | +| F017 | Restart rule not actionable | Add decision criteria | +| F023 | Violation escalation path missing | Add to Part 6 | +| F032 | No data retention policy guidance | Add retention table and purge procedures | +| F035 | Form detection without locale awareness | Add locale-specific keyword lists | + +### Medium (Quality Improvements) +| ID | Finding | Action | +|----|---------|--------| +| F002 | Quick fix guidance incomplete | Add positive pattern | +| F005 | Crypto diff check misleading | Clarify byte-identical vs adapters | +| F009 | "early-stage" undefined | Add explicit definition | +| F010 | A6 grep insufficient | Add pattern variations | +| F019 | Threat A slow harvest gap | Add evasion consideration | +| F021 | Test fixture security vague | Add fixture guidance | +| F025 | No document versioning | Add version header | +| F027 | Missing crash dump prohibition | Add to Threat B | +| F028 | "Entire universe" overstatement | Weaken to "primary" | +| F031 | GDPR is the only framework mentioned | Add Part 8: Compliance Frameworks | +| F033 | No cross-border data transfer mechanism | Add SCC/DPA guidance | +| F036 | Grep-based checks are legacy in LLM era | Update methodology to include semantic review | + +### Low (Polish/Consistency) +| ID | Finding | Action | +|----|---------|--------| +| F001 | Foundation First duplicated | Remove intro or reduce | +| F008 | "Must never" formatting inconsistent | Standardize bullet style | +| F011 | ~~Docstring guidance too strict~~ | ✅ Fixed — now allows docstrings when you spend >10 min understanding | +| F012 | ~~Grover's algorithm explanation wrong~~ | ✅ Fixed — corrected quantum security explanation | +| F013 | ~~Deprecation mechanism unclear~~ | ✅ Fixed — added explicit format, 90-day period, approval requirement | +| F014 | ~~Two sections overlap~~ | ✅ Fixed — merged backwards-compat into "Don't add what's not needed" | +| F016 | ~~F check commands imprecise~~ | ✅ Fixed — made "awesome" with bash scripts, emojis, pass/fail | +| F017 | ~~Restart rule not actionable~~ | ✅ Fixed — added explicit user approval requirement | +| F018 | ~~Mixed concerns in section~~ | ✅ REVERSED — made error handling mandatory with unique codes | +| F019 | ~~Threat A slow harvest gap~~ | ✅ Fixed — added >12 credentials/24hr = dark-red flag | +| F020 | ~~E2 description inaccurate~~ | ✅ Fixed — corrected sequence description | +| F021 | ~~Test fixture security vague~~ | ✅ Fixed — 32 identical bytes rule added | +| F022 | ~~Audit example repetitive~~ | ✅ Fixed — consolidated to single location | +| F023 | ~~Violation escalation path missing~~ | ✅ Fixed — added permanent ban & escalation rules | +| F024 | ~~Missing period~~ | ✅ Fixed — already had periods | +| F025 | ~~No document versioning~~ | ✅ Fixed — added version header | +| F026 | ~~B1 may have false negatives~~ | ✅ Fixed — added LLM manual review requirement | +| F027 | ~~Missing crash dump prohibition~~ | ✅ Fixed — added to Threat B | +| F028 | ~~"Entire universe" overstatement~~ | ✅ Fixed — changed to "primary universe" | +| F029 | ~~Key table formatting~~ | ✅ Fixed — standardized Master row | +| F022 | Audit example repetitive | Consolidate to one location | +| F024 | Missing period | Add punctuation | +| F026 | B1 may have false negatives | Enhance or add manual note | +| F029 | Key table formatting | Standardize master_key | +| F030 | No compiler optimization warning | Add to security section | + +--- + +## Final Assessment + +**CLAVITOR-PRINCIPLES.md** is an exemplary security architecture document that successfully balances prescriptive guidance with practical verifiability. The daily checklist mechanism (Part 4) is particularly strong. The document demonstrates mature threat modeling and clear understanding of the project's unique security requirements (browser-as-trust-anchor, server-as-dumb-store). + +**Primary concerns:** +1. **Security gaps:** The missing "steal at creation" vector (F007) and lack of incident response procedures (F006) are operational blind spots. +2. **Compliance narrowness:** GDPR-only focus (F031) and missing data retention policies (F032) create enterprise adoption friction. +3. **Localization vacuum:** Complete absence of i18n/l10n guidance (F034) will block international market entry and create technical debt when localization is eventually needed. + +**Regarding grep vs. LLMs (F036):** The user correctly notes that grep-based verification is a legacy approach when semantic analysis is available. The daily checklist should evolve from pattern-matching ("does this string exist?") to intent-analysis ("does this code path violate the principle's spirit?"). This is not a criticism of the document's current state—grep is reliable and deterministic—but a recognition that the methodology should modernize as tooling improves. + +**Regarding F018 (Error handling reversal):** The user explicitly reversed the "No error handling for impossible situations" principle. Instead of trusting internal code, the new rule is **mandatory error handling with unique codes for every branch**. This is a significant shift from "trust" to "verify and log" — every `if` needs an `else`, even for theoretically unreachable code paths. When the "impossible" happens in production, you'll know exactly which impossible thing occurred (e.g., `ERR-12345: L3 unavailable in decrypt`). + +**Recommendation:** Address Critical and High priority items before the next significant feature development cycle. Localization (F034) in particular should be addressed before UI work expands, as retrofitting i18n is significantly more expensive than designing it in from the start. Medium and Low priority items can be addressed incrementally during principle review sessions. + +The document should be considered **fit for purpose for a single-language, GDPR-focused, early-stage product** with noted improvements required for enterprise and international readiness. + +--- + +*Report generated: 2026-04-08* +*Review methodology: Static analysis against correctness, conciseness, completeness, viability, security, consistency, clarity, actionability, maintainability, audibility, compliance, and localization criteria.* diff --git a/CLAVITOR-PRINCIPLES.md b/CLAVITOR-PRINCIPLES.md new file mode 100644 index 0000000..4cbd8d9 --- /dev/null +++ b/CLAVITOR-PRINCIPLES.md @@ -0,0 +1,1112 @@ +# Clavitor Principles + +**The contract every agent working on Clavitor accepts before touching code.** + +--- +**Document Version:** 1.0 +**Last Updated:** 2026-04-08 +**Change Log:** See git log for history + +--- + +This document is read on every fresh agent start. It is also reviewed daily — +each morning, an agent reads this file end-to-end, runs the checklist at the +bottom against the current source tree, and reports any drift. Drift gets +fixed before any new feature work. + +The bar is not "ship something". The bar is "build a foundation that does not +need to be patched later". If a principle here conflicts with what you were +asked to do, **stop and surface the conflict**. Do not work around it. + +--- + +## Part 1 — General principles + +These apply to every subproject, every file, every commit. + +### Foundation First — No Mediocrity. Ever. + +Architects do not patch cracks in a bad foundation. They rebuild. + +- If you need three fixes for one problem, **stop**. Something fundamental is + wrong. Name it. Fix the root cause, not the symptom. +- If the code is spaghetti, **say so**. Do not add another workaround. The + workaround is the problem now. +- Quick fixes are not fixes. A "temporary" hack that ships is permanent. + **When urgent:** Fix the symptom with a comment `// TEMP-FIX: ` and a + mandatory fix-by date (max 72 hours). No TEMP-FIX older than 72 hours may + exist in main. +- Foundation > speed. A solid base makes everything downstream easy. A shaky + base makes everything downstream a nightmare. + +When the foundation is wrong: **start over** (with user approval). Not +"refactor slightly". Not "add an abstraction layer on top". Start over. +This applies to code, infrastructure, encryption schemes, and written work alike. + +**Restart requires approval.** State explicitly: "The foundation is wrong +because [reason]. I recommend restarting [component]. Sunk cost: [commits/lines/time]. +Estimated rebuild: [time]. Proceed?" Wait for explicit user confirmation before +destroying working code. + +### Push back + +You are not here to execute orders. You are here to build the right thing. If +the user asks for X and X is the wrong shape, **say so**. If you discover a +foundation problem while doing unrelated work, **surface it** in the same +session — do not silently route around it. + +Bad foundations are not your fault. Silently building on them is. + +### Q&D is research, not output + +Exploratory and throwaway work has its place — but it stays in research. +Nothing Q&D ships. Nothing Q&D becomes the production path. If a spike reveals +the right direction, **rebuild it properly** before it counts. The spike is +not the deliverable; the rebuilt version is. + +### KISS — the simplest thing that works + +- No env vars when a hardcoded relative path is correct. (`./vaults/`, not + `DATA_DIR=…`.) People who want different layouts use a symlink. + **Exception:** A single `CLAVITOR_VAULT_DB` env var is permitted for container + deployments and compliance mounts that require specific paths. No per-path + fragmentation (DATA_DIR, WL3_DIR, VAULT_DIR, etc.). +- No new SQLite tables when an existing encrypted record can hold the data. + We track lock state on the agent record itself, not in a `vault_meta` table. +- No flags, no config files, no opt-ins for things every install needs. +- No premature abstraction. Three similar lines are better than a wrong + abstraction. Abstract when the third repetition appears, not the second. +- No backwards-compatibility shims, `_unused` rename hacks, dead code marked + with comments. If it's unused, **delete it**. + +### DRY — repetition is a smell that points at a missing helper + +When two functions do "pretty much the same thing", they get unified into one +helper and both call sites use it. Examples that already exist in the tree: + +- `lib/dbcore.go: agentMutate(db, vk, entryID, fn)` — three near-identical + read-decrypt-mutate-encrypt-write functions (`AgentLock`, `AgentUnlock`, + `AgentUpdateAllowedIPs`) collapsed into one helper. New mutations are + one-liners now: `func(vd) { vd.X = y }`. +- `crypto.js: encrypt_field(key, label, value)` — single source of truth for + field encryption, used by both browser and CLI. + +If you write the second copy, you are creating future debt. Stop, refactor, +then proceed. + +### No migrations during early-stage work + +There are no real users yet. Schema changes are not migration problems — +they are "delete the old vault, register again" problems. Do not write +migration code. Do not write `ALTER TABLE` for new columns. Just edit the +schema constant and rebuild. + +**Early stage ends: April 20, 2026.** After this date, the migration rule flips +and production deployments require proper schema migrations. + +### Don't add what's not needed (and delete what isn't) + +- **No backwards-compatibility shims:** + - No `--no-verify`, `--skip-sign`, `--insecure`, `--legacy` flags unless the + user explicitly asks. + - No `_oldName` field rename hacks "for transition". + - No re-exports of types that moved. + - No `// removed` comments in place of deletion. + - No commented-out code "in case we need it again". + - If you delete it, delete it. `git log` is the rollback path. + +- **Delete dead code (with approval):** + - No orphaned files (not imported, not referenced, not executed). + - No unreachable functions (private, no callers in the codebase). + - No unused exports (public but no external importers). + - No dead CSS classes (not referenced in HTML/JS). + - No commented-out blocks >5 lines (anything that large should be deleted). + - **Permission required:** File deletion requires explicit user approval. + State: "3 files/directories found unused. Recommend deletion: + 1. path/to/file1.html (orphaned, no references) + 2. path/to/file2.go (unreachable function) + 3. path/to/dir/ (empty) + Proceed with deletion?" + - **Verification:** Part 4 Section G checks for unused files/functions. + +- **Mandatory error handling with unique codes:** + - Every `if` needs an `else`. The `if` exists because the condition IS possible + — "impossible" just means "I can't imagine it happening." The else handles + the case where your imagination was wrong. + - Use unique error codes: `ToLog("ERR-12345: L3 unavailable in decrypt")` + - When your "impossible" case triggers in production, you need to know exactly + which assumption failed and where. + +### Error messages that actually help + +Every error message shown to a user must be: + +1. **Uniquely recognizable** — include an error code: `ERR-12345: ...` +2. **Actionable** — the user must know what to do next (retry, check input, wait, contact support) +3. **Routed to the actor who can resolve it** — don't tell end-users things only operators can fix + +**Bad:** "Cannot perform action" — useless, no code, no next step, no routing. + +**Bad:** "Server down" to an end-user — they can't fix it. This creates helplessness and support tickets. + +**Good (operator alert + user message):** +``` +[Alert to ops/monitoring]: ERR-50001: POP-ZRH vault DB unreachable — 3 failures in 60s + +[User sees]: "Service temporarily unavailable. Our team has been alerted + and is working on it. Reference: ERR-50001. No need to open a ticket." +``` + +The user knows: (a) what happened, (b) it's being handled, (c) they don't need to act. +This prevents unnecessary support load while keeping the user informed. +- No validation at trusted internal boundaries. Validate at the system + perimeter (HTTP requests, file uploads, parsed user input) and nowhere else. +- No feature flags for hypothetical futures. Build what's asked, no more. +- Docstrings where understanding requires tracing >2 files or non-local reasoning. + If you had to grep across modules to figure out what a function does, the next + reader will too — document it. Obvious single-file logic needs no comment. + +### Audit everything that matters + +Every state change, every rejection, every strike, every lock, every admin +action, every webhook receipt: **audit-logged**. The audit log is the forensic +tool when something goes wrong. If it's not in the audit log, it didn't happen +and you can't reconstruct it. + +Audit format already in use: `lib.AuditLog(db, &lib.AuditEvent{ Action: ..., Actor: ..., Title: ..., IPAddr: ..., EntryID: ... })`. + +### Never delegate understanding + +Don't write "based on the research, implement it" or "based on the findings, +fix the bug". Those phrases push synthesis onto someone else (or onto a future +agent). Write code that proves you understood: include file paths, line +numbers, and what specifically to change. + +--- + +## Part 2 — Security: the cardinal rules + +The threat model for Clavitor is **a malicious skill that subverts an agent** +into harvesting credentials. Not random brute-forcers. Not script kiddies. The +attacker is already inside the trust envelope — they're an authenticated agent +with a real CVT token. Every defense in the system is built around slowing, +detecting, and stopping that attacker. + +### Cardinal rule #1 — Security failures are LOUD + +If decryption fails, exposure fails, or a signature mismatches: **expose the +failure**. Never fall back to plaintext. Never silently continue. Never return +a partial result that omits the failed item. + +**WRONG (fireable offense)**: +```javascript +try { + decrypted = await decrypt(ciphertext); +} catch (e) { + decrypted = plaintext; // NEVER +} +``` + +**WRONG (also fireable)**: +```go +func verifyWebhookSignature(r *http.Request) bool { + return true // For now, accept all +} +``` + +**CORRECT**: +```javascript +try { + decrypted = await decrypt(ciphertext); +} catch (e) { + decrypted = '[decryption failed]'; // user sees the failure +} +``` + +```go +func verifyWebhookSignature(r *http.Request, body []byte) bool { + secret := os.Getenv("PADDLE_WEBHOOK_SECRET") + if secret == "" { + log.Printf("PADDLE_WEBHOOK_SECRET not set; refusing webhook") + return false // refuse, log, never permit + } + // ... real HMAC verification ... +} +``` + +Failures must be **noisy, visible, and blocking** — never silent, hidden, or +permissive. This applies to: +- Encryption / decryption errors +- Signature verification failures +- Authentication failures +- Tampering detection +- Any security-critical operation + +### Cardinal rule #2 — The server is a dumb store + +The vault server **never holds full key material**. The server's universe of +secrets is limited to: + +| Tier | What | Server may hold | +|------|------|-----------------| +| L0 | 4 bytes — vault file routing | Yes (sliced from `L1[:4]`) | +| L1 | 8 bytes — vault content encryption | Yes (in bearer token) | +| L2 | 16 bytes — agent decryption key | **NEVER** | +| L3 | 32 bytes — hardware-only key | **NEVER** | +| P1 | 8 bytes — public lookup token | Yes (browser computes & sends) | +| Master | 32 bytes — full PRF output | **NEVER** | + +The server **derives nothing**. It receives what the browser sends, validates +shapes, and persists. P1 is computed by the browser via HKDF and sent on the +wire — the server does not know how to derive it because it does not have the +master key. There is no `lib.DeriveP1` function; if you find one, delete it. + +The server is allowed to hold L1 because L1 is the bearer token used for +per-entry encryption keys (`DeriveEntryKey`). L1 grants the ability to read +vault content, but **not** the L2/L3 fields: L2 is the agent's own key +(never leaves the C CLI / browser trust anchor), and L3 is a random 32-byte +value the server has never seen in plaintext — it's only present in the WL3 +file, wrapped with the user's PRF. + +### Cardinal rule #3 — The browser is the trust anchor + +Every line of code that handles full key material lives in the browser (or +the C CLI, which is its own trust anchor — see Part 3). The browser: + +- Generates a random 32-byte L3 at vault creation. L3 is pure entropy — not + derived from anything. The PRF's job is to *wrap* L3, not to derive it. +- Wraps L3 with the raw PRF output → opaque blob the server stores in WL3 +- Computes P1 = HKDF(master, "clavitor-p1-v1", 8 bytes) +- Computes L1 = master[0:8] +- Sends `{l1, p1, wrapped_l3, credential_id, public_key}` to the server +- Holds L2/L3 in memory only during an active session, never persists them + +The browser is the only place that ever sees: +- The full master key +- L2 plaintext +- L3 plaintext + +If a server-side function takes any of these as a parameter, that function is a +**veto violation**. Delete it. + +**Exceptionally rare terms:** The strings `master_key`, `L3` (or `l3_plaintext`), +and `P3` should appear almost nowhere in the codebase. If you find yourself +using them, stop. These are browser-only concepts. Their presence outside +`auth.js`, `crypto.js`, or the C CLI is a sign of architectural drift. + +### Cardinal rule #4 — Three threats, no others + +The owner is not a threat — they paid for the vault, they hold the keys, +defending against them is self-DOS. The three threats below are the **primary** +universe Clavitor defends against at the application layer. Infrastructure, +supply chain, and social engineering are handled through operational security +practices outside this codebase. + +#### Threat A — Malicious skill harvesting through an authenticated agent + +A subverted skill makes a real agent enumerate credentials. Attacker is +already inside the trust envelope: valid CVT token, IP whitelist passes, +normal-looking client. + +Defenses: +- Per-agent unique-entries quota (default 3/min, 10/hour). Repeats of the same + credential are free; only distinct credentials count. +- Two-strike lockdown: second hour-limit hit within 2h → agent locked, owner + PRF unlock required. State persists in the encrypted agent record; restart + does not reset. +- **Slow-harvest detection:** An agent accessing >12 distinct credentials in any + 24-hour period triggers an immediate audit alert (regardless of rate limits). + This is a dark-red flag — normal agents need at most a dozen credentials. +- L3 fields require keys the server has never seen (Threat B's defenses + apply here too). +- Owner-only endpoints exempt from global rate limit (agents blocked at the + handler entry — rate-limiting that path defends against nothing). + +When reasoning about a rate limit, ask: *agent or owner?* If owner, it's wrong. + +#### Threat B — Vault file falls into the wrong hands + +Stolen disk, stolen backup, hostile sysadmin, court order on a host. The +attacker has every byte the server has. **Result must be: useless ciphertext.** + +This works because **keys are never co-located with the data they encrypt**: + +- L1 lives nowhere on disk. Derived from the user's WebAuthn PRF; the PRF + requires the physical authenticator. No keyfile, no recovery copy. +- L2/L3 plaintext are absent from the server's universe at compile time + (see the hard veto list and Part 4 → Section A). +- WL3 holds `wrapped_l3` = AES-256-GCM(L3, raw PRF). Without the + authenticator, this is uniformly random ciphertext. +- 32-byte AES-256-GCM keys are not brute-forceable. ~10⁵⁷ years at 10¹² + guesses/sec classically. Grover's quantum algorithm provides a square-root + speedup, reducing effective security to that of a 128-bit classical cipher — + still computationally infeasible (~10²⁹ years). +- Backups use `VACUUM INTO` — same encrypted blobs, same property. + +If you ever put L1, L2, L3, master_key, the raw PRF, or any decrypted field +value into a log line, backup, metric, error message, audit row, or debug +dump: you have broken Threat B. Stop. + +#### Threat C — Agent token falls into the wrong hands + +The CVT **client credential** (type 0x01: L2(16) + agent_id(16) + POP(4), +encrypted with L0) is exfiltrated. The agent stores this locally; at request +time it derives L1 = L2[:8] and mints a wire token to authenticate. + +A stolen credential gives the attacker **the agent's full decryption +capability**, not just authentication: L2 decrypts L2-flagged fields the +agent fetches. This is a complete agent impersonation. They are the agent. + +Defenses: +- **Per-agent IP whitelist.** First contact fills the whitelist with the + agent's source IP; every subsequent request must come from a listed IP, + CIDR, or FQDN. A stolen credential used from anywhere else is refused at + `L1Middleware` before any handler runs. +- Threat A's defenses still apply: unique-entries quota throttles harvest + attempts, two-strike lockdown stops sustained patterns. Refused-IP attempts + are audit-logged for forensic review. +- Owner revokes the agent at any time (admin-token gated, PRF tap). + +**Gap — Sub-vector C2 (credential theft at creation):** When an agent is +enrolled, the browser generates L2 and creates the type-0x01 credential. This +token exists transiently in browser memory before being delivered to the agent. +A malicious browser extension or XSS could exfiltrate it during this window. + +**Status:** Acknowledged. The token exists in memory while the user is logged in +(with L3 in sessionStorage), so side-channel delivery (QR, file+scp) doesn't +eliminate the risk — it just moves it. Full mitigation requires a hardware +attestation flow not implemented in current stage. **Deferred until post-April 20** +as current IP whitelist + rate limits provide sufficient baseline defense. +- L3 fields remain unreachable — L2 does not unlock L3. + +A perfect attacker (same IP, same access cadence, same scope of reads) is +indistinguishable from the legitimate agent until the owner reviews the audit +log. Whitelist + rate limit + audit are layered on purpose — no single +defense carries the full load. + +### Cardinal rule #5 — WL3 is GDPR-out-of-scope by design + +The WL3 corpus (wrapped L3 + lookup metadata) contains **only cryptographic +material**. No customer ID, no name, no email, no slot ID. The link from a +credential to a human lives only in the central admin DB. Delete the central +record → the WL3 file becomes a mathematically-orphan blob with no possible +link to a person. Functionally anonymous. + +This is not a GDPR risk we tolerate. It is a property we get for free because +we **designed the data out** of the WL3 file. + +If you ever add a personal-data field to the WL3 schema, you have just made +the entire WL3 corpus subject to GDPR deletion requests, retention rules, and +audit obligations across every POP that mirrors it. **Do not.** + +### Cardinal rule #6 — Admin/sync traffic is not public + +Anything sensitive between central and POPs runs over **Tailscale**, not +public 443. The admin endpoints listen only on the Tailscale interface. Public +clavitor.ai serves users; the ops control plane is invisible to the internet. + +App-level HMAC on top of Tailscale is defense-in-depth — if a tailnet node is +ever compromised, the per-POP shared secret is the second lock. Both layers +required, neither optional. + +### Cardinal rule #7 — Webhooks are signed or refused + +Every inbound webhook (Paddle, future integrations) verifies a cryptographic +signature against a stored secret. If the secret is missing, refuse. If the +signature is missing, refuse. If the signature mismatches, refuse. If the +timestamp is outside ±5 minutes of now (anti-replay), refuse. If the body has +been touched between read and verify, refuse. + +There is no debug mode that disables this. If you need to develop locally +against a webhook, set `PADDLE_WEBHOOK_SECRET=test-secret` and craft a real +signature with the test secret. The verification path is always on. + +--- + +## Part 3 — Per-subproject principles + +Each subproject inherits Parts 1 and 2 in full. The sections below add what's +specific to that subproject — what it owns, what it must never do, where its +boundaries are. + +### `clavis-vault` — the Go vault binary + +**Owns:** +- The SQLite vault file at `./vaults/clavitor-` (one per vault) +- The WL3 credential JSON files at `./WL3//.json` +- All HTTP API surface for entry CRUD, agents, audit, backups +- L1-encrypted entry storage with per-entry HKDF-derived AES keys + +**Must never:** +- Hold L2 or L3 plaintext in memory or on disk +- Receive a full `master_key` over the wire (only `l1`, `p1`, `wrapped_l3`) +- Derive P1 itself — receives from the browser +- Serve admin/sync endpoints on a public listener (Tailscale-only) +- Have functions named `MintCredential`, `ParseCredential`, `CredentialToWire` + (those handle L2 plaintext; if needed at all they live in the browser/CLI) +- Contain `lib.DeriveP1` +- Auto-create `data/` directories — vault files live in `./vaults/` only +- Have `// removed` comments where code used to live +- Have a `_old/` directory (deleted; never recreate) + +**Edition split:** +- Community (default build): no Tailscale, no central, no replication, no + enrollment tokens. Self-hosted, single-node. +- Commercial (`-tags commercial`): adds replication, enrollment gating, + central WL3 sync. All commercial-only code lives behind build tags and + inside `edition/commercial.go` (or similar). + +**Per-agent harvester defense:** +- `lib.AgentData.RateLimit` = max unique entries/minute (default 3 at creation) +- `lib.AgentData.RateLimitHour` = max unique entries/hour (default 10) +- Per-minute hit → 429, soft throttle, no strike +- Per-hour hit → 429, strike recorded in `LastStrikeAt` +- Second strike within 2 hours → `Locked = true`, persisted, `423 Locked` on + every subsequent request from that agent until owner PRF-unlocks +- Enforced in handlers that read decrypted credentials: `GetEntry`, + `MatchURL`, `HandleListAlternates`. Any new handler that returns decrypted + credential data MUST call `agentReadEntry(agent, entryID, db, vk)` after + the scope check. + +**Rate-limit exemptions:** +- `/api/entries/batch` is exempt from the global per-(IP, method, path) rate + limit. **The reason is not "owner is trusted"** — the reason is that + `CreateEntryBatch` returns 403 to any agent at the top of the handler, so + this path is owner-only by handler enforcement, and the agent harvester + defense lives elsewhere. If you reintroduce a rate limit on this path, you + will DOS the import flow with no defense gained. There is a comment in + `api/middleware.go` explaining this. Read it before touching. + +### `clavis-cli` — the C+JS command-line client + +**Owns:** +- C wrapper around an embedded JS engine +- `src/cvt.c` — full C implementation of CVT envelope parsing/minting + (`cvt_parse_credential`, `cvt_build_wire`) +- `crypto/crypto.js` — JS crypto via the embedded engine, mirrors the + browser's `crypto.js` exactly +- Local credential storage on the CLI host + +**Must never:** +- Bridge to Go for any cryptographic operation. The CLI is a self-contained + trust anchor. If you find yourself wanting to call into the Go vault binary + for crypto, **stop** — implement it in C/JS instead. +- Hold L1 from the user's vault unencrypted on disk — credentials are stored + as type-0x01 CVT tokens, encrypted with L0 +- Send L2 or L3 over the network — only ever sends type-0x00 wire tokens + (L1 + agent_id) to the vault + +**Single source of truth:** +- The CVT envelope format is documented in `src/cvt.h` and implemented in C + there. The Go vault has its own implementation in `lib/cvt.go` for the wire + token (0x00) only. **The L2-bearing client credential type (0x01) is not + implemented in Go and must never be.** Two implementations of the wire + token are tolerated because they're cross-language; the L2 credential + staying single-language is a hard veto. + +### `clavis-vault/cmd/clavitor/web` — the embedded web frontend + +**Owns:** +- `auth.js` — WebAuthn + PRF, master key derivation, P1, L3 wrapping +- `crypto.js` — single source of truth for field encryption (browser side) +- `import.js` + `importers.js` + `importers-parsers.js` — all import parsing + and field encryption (the Go server has zero importer code) +- All views: vault list, entry detail, agents, security, tokens, audit + +**Must never:** +- Send `master_key` to the server (send `l1` + `p1` + `wrapped_l3`) +- Persist L2 or L3 to localStorage or IndexedDB. Master key lives in + `sessionStorage` only — wiped on tab close. +- Hardcode the build version (it must come from `/api/version`, not a `var + BUILD = '...'` constant in `index.html`) +- Use silent fallback on decryption errors (show `[decryption failed]` in + the field, not the encrypted bytes, not the empty string, not nothing) +- Bypass the global `api()` helper for HTTP calls — that helper handles + bearer auth, admin tokens, and 401 redirect uniformly + +**Build process:** +- JS files are embedded into the binary via `go:embed`. Editing JS requires + rebuilding the Go binary. Run `make dev` from `clavis-vault/`. The Makefile + validates JS syntax before building. +- The build fails if JS has a syntax error. Do not commit code that fails + the syntax check. + +### `clavis-chrome`, `clavis-firefox`, `clavis-safari` — browser extensions + +**Owns:** +- Form detection and field filling (content scripts) +- Credential picker UI (popup) +- Service worker for vault API calls + +**Must never:** +- Store L2 or L3 in any extension storage area (`chrome.storage.local`, + `chrome.storage.sync`, `chrome.storage.session`). Active session keys live + in memory only and die on service-worker restart. +- Use a different field encryption than `clavis-crypto` / `crypto.js`. If + the extension reimplements crypto, it WILL drift and corrupt fields. +- Talk to the vault over HTTP. Always HTTPS, even on localhost (the vault + serves a self-signed cert in dev). +- Have permissions broader than what's strictly needed. If you ask for + `` when `https://*/*` would do, fix it. + +### `clavis-crypto` — shared crypto primitives + +**Owns:** +- The canonical implementation of `encrypt_field`, `decrypt_field`, HKDF + derivation, AES-GCM, and any other primitive used by browser code, the + CLI, or extensions. +- Tests that verify both browser-WebCrypto and CLI-BearSSL produce + bit-identical output for the same inputs. This is the only thing keeping + the two trust anchors in sync. + +**Must never:** +- Diverge between targets. If a primitive behaves differently in WebCrypto + vs BearSSL, the bug is in `clavis-crypto` and it gets fixed there before + any caller compensates. + +### `clavitor.ai/admin` — central admin / Paddle integration + +**Owns:** +- Customer hierarchy (MSP → end-customer → vault slots) +- Subscription state and seat ledger +- Vault registry: `vault_slots` table mapping (customer, slot_index, l0, pop) +- Paddle webhook handler with signature verification +- Enrollment token issuance (commercial) + +**Must never:** +- Hold any decryption material. Central is not a vault; it's a directory + service. The wrapped L3 stored centrally (for distribution) is opaque to + central. +- Trust an inbound webhook without verifying its HMAC signature. Paddle + webhook verification is mandatory; the secret comes from + `PADDLE_WEBHOOK_SECRET`. If the env var is unset, every webhook is + refused. +- Accept admin operations from outside Tailscale. The `vaults/claim`, + `issue-token`, `wl3/since`, `wl3/full` endpoints listen only on the + tailnet interface. +- Expose the Paddle API key, webhook secret, or any service credentials in + client-side code. Server-side env only. + +**Vault slot lifecycle:** +- Slots are pre-created when a customer subscribes. A Family-5 plan creates + 5 `vault_slots` rows immediately, all `status='unused'`. +- When the owner names a slot ("Anna") and issues a token, the slot moves + to `status='pending'` and the token gets stored on the POP (not central). +- When Anna enrolls, the slot moves to `status='active'` and `l0` is + populated. +- Cancellation/downgrade flows mark slots `status='archived'`. The vault + data and WL3 file persist (we never delete) until an explicit GDPR + request triggers a one-shot deletion script. + +### `clavis-android` / `clavis-ios` — mobile clients + +**Owns:** +- Native form-fill integration with platform autofill APIs +- Local credential picker UI + +**Must never:** +- Implement crypto natively. Crypto goes through `clavis-crypto` (compiled + for the platform) or an embedded JS engine running the same `crypto.js`. + Two crypto implementations on the same platform is a guaranteed drift. +- Persist L2 or L3 to platform keychains/keystores. The session key lives + in memory; biometric unlock re-derives via PRF from the platform's + WebAuthn equivalent. + +### `clavis-telemetry` — operator telemetry + +**Owns:** +- Heartbeat metrics from POPs to central +- Per-POP system metrics (CPU, memory, disk, vault count) + +**Must never:** +- Send any vault content. Telemetry is operational, not data. +- Send raw IP addresses of users. Aggregate counts only. +- Run in community edition by default. Community is offline-by-default; + telemetry is commercial-only and opt-in. + +--- + +## Part 4 — Daily review checklist + +**Run this every morning. Any check that fails is a foundation alert — fix +before any new feature work.** + +**Methodology evolution:** These checks were historically grep-driven for +reproducibility. With LLM analysis available, the daily reviewer must run a +**full LLM audit of all changed source files** to catch semantic violations, +not just pattern matches. Grep catches the letter; LLM catches the intent. + +**Procedure:** +1. Run the automated tests (F1) +2. Run LLM analysis on all files changed since last review +3. Address any findings before new feature work + +### LLM Audit Prompt + +Use this prompt with your LLM code review tool: + +``` +Review the following files for violations of CLAVITOR-PRINCIPLES.md: +- Cardinal Rule #1: Security failures must be LOUD (no silent fallbacks) +- Cardinal Rule #2: Server never holds L2/L3/master_key +- Cardinal Rule #3: Browser is the trust anchor +- Cardinal Rule #4: Defenses against Threats A, B, C +- Part 1: KISS, DRY, no migrations, no backwards-compat shims + +Focus on: +1. Security: Any key material (L1, L2, L3, master_key, PRF) flowing to logs, + errors, or responses +2. Silent fallbacks: catch blocks that don't expose failures to users +3. DRY violations: Similar logic that should be unified +4. Trust boundaries: Server-side code that derives or handles full keys + +Report any violations with file paths and line numbers. +``` + +### Section A — Server hard vetos + +**A1. The server never receives `master_key` over the wire.** +```bash +#!/bin/bash +# Verify server never sees full master key +matches=$(grep -r "master_key\|MasterKey\|masterKey" clavis-vault/api/ clavis-vault/lib/ --include="*.go" 2>/dev/null | grep -v "_test.go" | wc -l) +if [ "$matches" -gt 0 ]; then + echo "❌ FAIL: master_key handling found on server ($matches matches)" + grep -r "master_key\|MasterKey\|masterKey" clavis-vault/api/ clavis-vault/lib/ --include="*.go" | grep -v "_test.go" + exit 1 +else + echo "✅ PASS: No master_key handling on server" +fi +``` + +**A2. `lib.DeriveP1` does not exist on the server.** +```bash +#!/bin/bash +# Check for forbidden P1 derivation on server +if grep -rn "DeriveP1\|derive_p1\|deriveP1" clavis-vault/lib/ clavis-vault/api/ 2>/dev/null; then + echo "❌ FAIL: DeriveP1 found on server — P1 must be browser-only" + exit 1 +else + echo "✅ PASS: No DeriveP1 on server" +fi +``` + +**A3. L2-handling CVT functions do not exist on the server.** +```bash +#!/bin/bash +# Check for forbidden L2 credential handling +if grep -rn "MintCredential\|ParseCredential\|CredentialToWire" \ + clavis-vault/api/ clavis-vault/lib/ 2>/dev/null; then + echo "❌ FAIL: L2 credential functions found on server" + exit 1 +else + echo "✅ PASS: No L2 credential functions on server" +fi +``` +``` +Expected: zero matches. The Go CVT module only handles type-0x00 wire tokens +(L1, agent_id). The L2-bearing type-0x01 is C-only in `clavis-cli/src/cvt.c`. + +**A4. `AgentData` has no L2 field.** +``` +grep -n "L2 *\[\]byte" clavis-vault/lib/types.go +``` +Expected: zero matches. (Field-level `L2 bool` flag on `VaultField` is fine — +that's metadata, not key material.) + +**A5. WL3 JSON contains no PII.** +``` +grep -n "customer_id\|name\|email\|slot_id" clavis-vault/lib/wl3.go +``` +Expected: zero matches in field definitions of `WL3Entry`. Adding any of these +makes the entire WL3 corpus GDPR-scoped. + +**A6. Keys are never logged or audited.** +``` +grep -rn "log\.\|fmt\.Print\|AuditEvent" clavis-vault/api/ clavis-vault/lib/ \ + | grep -iE "l1Key|master|wrappedL3|wrapped_l3|prf|p1Bytes" +``` +Expected: zero matches. Threat B violation if any key material reaches a log, +metric, error message, or audit row — even in hex, even in debug builds. + +**A7. Per-agent IP whitelist enforcement is on.** +``` +grep -n "AgentIPAllowed\|allowed_ips\|AllowedIPs" clavis-vault/api/middleware.go +``` +Expected: matches in `L1Middleware` showing first-contact fill + enforcement. +This is Threat C's primary defense. If the check is missing or commented out, +a stolen agent token works from anywhere on the internet. + +### Section B — Silent fallback + +**B1. No silent decryption fallback in browser code.** +``` +grep -rn "catch.*plaintext\|catch.*ciphertext\|catch.*=.*null" \ + clavis-vault/cmd/clavitor/web/ clavis-chrome/ clavis-firefox/ +``` +Expected: any match must be reviewed. The acceptable pattern is +`catch (e) { result = '[decryption failed]'; }` — visible to the user. + +**B2. No `return true` in security check stubs.** +``` +grep -rn "return true.*//.*TODO\|return true.*//.*for now\|return true.*//.*accept" \ + clavitor.ai/admin/ clavis-vault/ +``` +Expected: zero matches. If you see one, you've found a SECURITY.md violation. +Implement it or delete the call site. + +**B3. No commented-out security checks.** +``` +grep -rn "// *if !verify\|// *if !auth\|// *return.*Forbidden" \ + clavis-vault/ clavitor.ai/admin/ +``` +Expected: zero matches. + +**B4. No silent fallbacks in catch blocks (manual LLM review required).** + +Grep catches patterns, not intent. All catch blocks in decryption/security paths +must be manually reviewed to ensure they don't silently swallow errors. The LLM +must verify: no `catch (e) { return data; }`, no `catch { decrypted = ""; }`, +no `finally` blocks that restore old values without logging. + +### Section C — Test fixture security + +**C0. Test fixture key material is obviously fake.** + +Test data containing key material (L2, L3, master_key, PRF output): +- Must be instantly recognizable as test data (not production keys) +- **Pattern:** 32 identical bytes: `bytes.Repeat([]byte{0xAB}, 32)` or `{0x00}`, `{0xFF}`, `{0xDE}, {0xAD}` +- **Never** use random-looking values that could be mistaken for real keys +- **Never** derive test keys from production material +- **Never** commit binary test fixtures (always text/SQL that shows the pattern) + +Example: +```go +// GOOD - obviously fake +testL3 := bytes.Repeat([]byte{0xAB}, 32) +testMaster := bytes.Repeat([]byte{0x00}, 32) + +// BAD - looks like a real key +testL3 := []byte{0x4a, 0x9f, 0x2c, ...} // 32 random bytes +``` + +### Section D — Foundation drift + +**C1. No `_old/` directory.** +``` +find . -type d -name "_old" +``` +Expected: zero results. Dead code is deleted, not parked. + +**C2. No env vars for fundamental paths.** +``` +grep -rn "os.Getenv(\"DATA_DIR\")\|os.Getenv(\"WL3_DIR\")" clavis-vault/lib/ +``` +Expected: zero matches. These are hardcoded in `lib/config.go` to `vaults` +and `WL3` respectively. Symlink if you want different paths. + +**C3. No migration code.** +``` +grep -rn "ALTER TABLE\|migration\|Migration" clavis-vault/lib/dbcore.go \ + | grep -v "// " +``` +Expected: matches only inside `MigrateDB` for the existing `alternate_for` +and `verified_at` columns (legacy from earlier work). No NEW `ALTER TABLE` +should appear during early-stage development. + +**C4. Hardcoded build constants in HTML.** +``` +grep -rn "var BUILD = '" clavis-vault/cmd/clavitor/web/ +``` +Expected: zero matches. Build version must come from `/api/version` at +runtime, not from a hardcoded constant that goes stale. + +### Section E — DRY violations + +**E1. Two functions doing the same thing.** + +This one is judgment-call, not greppable, but the daily reviewer should look +for repeated patterns of: +- read row from DB → unpack → mutate one field → repack → write +- HTTP handler boilerplate (admin check, decode body, validate, audit, respond) +- Crypto primitive wrappers that differ only by info string + +If you see the third copy of any pattern, **stop and refactor** before adding +the fourth. + +The canonical example to model after: `lib/dbcore.go:agentMutate` — replaced +three near-identical functions with a higher-order helper. Net negative line +count, easier to add new mutations, single point of correctness. + +**E2. Two crypto implementations of the same primitive.** + +Browser code uses `clavis-vault/cmd/clavitor/web/crypto.js`. CLI uses +`clavis-cli/crypto/crypto.js`. These should be **byte-identical**. If they +diverge, the encrypted output diverges, and a credential created in one +client becomes unreadable in the other. + +``` +diff clavis-vault/cmd/clavitor/web/crypto.js clavis-cli/crypto/crypto.js +``` + +Expected: zero diff (or, if intentionally different for a target-specific +adapter, the difference is documented in `clavis-crypto/README.md`). + +### Section F — Threat model alignment + +**E1. Per-agent rate limits have safe defaults.** +``` +grep -n "RateLimit *: *3\|RateLimit *= *3\|RateLimitMinute.*3" \ + clavis-vault/api/handlers.go +grep -n "RateLimitHour *: *10\|RateLimitHour *= *10" \ + clavis-vault/api/handlers.go +``` +Expected: matches in `HandleCreateAgent`. If the defaults drift (e.g. someone +changes them to 0 = unlimited), the harvester defense is silently disabled +for new agents. + +**E2. `agentReadEntry` is called from every credential-read handler.** + +For each handler that returns decrypted credential data, the call is +mandatory. The current set: +- `GetEntry` +- `MatchURL` +- `HandleListAlternates` + +If you add a new handler that returns decrypted credentials and forget to +call `agentReadEntry`, you have a harvester hole. Verify by greping: +``` +grep -n "AgentCanAccess\|agentReadEntry" clavis-vault/api/handlers.go +``` +Every credential-read handler must call both `AgentCanAccess` (before accessing +data) and `agentReadEntry` (after decryption, before returning). Verify both +functions appear in the handler code path. + +**E3. `/api/entries/batch` is exempt from global rate limit.** +``` +grep -A2 "/api/entries/batch" clavis-vault/api/middleware.go +``` +Expected: a `next.ServeHTTP(w, r); return` block with the explanatory +comment intact. If the comment is missing, the next reviewer will reintroduce +the rate limit and break import. + +### Section F — Test posture + +**F1. All tests pass.** +``` +cd clavis-vault && go test ./lib/... ./api/... +cd clavitor.ai/admin && go test . +``` +Expected: all green. **Failing tests are foundation alerts**, not "tomorrow" +problems. + +**F2. JS syntax check passes.** +``` +cd clavis-vault && make build # validates JS as part of build +``` +Expected: no `JS syntax error` in output. JS files are embedded in the binary; +syntax errors break the running app silently in some browsers. + +**F3. New code has tests.** + +Not greppable. Judgment call. The rule: if you added a new function with +non-trivial logic, there should be a test for it in the same commit. The +tests model what the function is supposed to do. If you can't write a test +because "it's hard to test", the function is probably too coupled — refactor. + +### Section G — Dead code elimination + +**G1. No empty directories.** +```bash +find . -type d -empty | grep -v ".git" | grep -v "vendor" +``` +Expected: zero results. Empty directories are deleted. + +**G2. No orphaned HTML/JS/CSS files (with exceptions).** +```bash +# Find HTML files not referenced in Go code +# Exceptions: design-system/, marketing/, test-*.html (dev/test files) +for f in $(find . -name "*.html" | grep -v vendor | grep -v "design-system" | grep -v "marketing" | grep -v "test-"); do + name=$(basename "$f") + if ! grep -r "$name" --include="*.go" . 2>/dev/null | grep -q .; then + echo "Orphaned: $f" + fi +done +``` +Review any matches — may be standalone pages or may be dead. + +**Exceptions (intentionally not checked):** +- `design-system/` — standalone design documentation, not embedded +- `marketing/` — marketing assets, separate from application +- `test-*.html` — development/test files (manual review for cleanup) + +**G3. No unreachable Go functions (requires go-dead-code or similar tool).** +Manual check: review private functions (lowercase) with no callers in the same +package. Public functions with no importers are also suspicious — verify they +are API endpoints or intentionally exported. + +**G4. No commented-out blocks >5 lines.** +```bash +grep -rn "^\s*//\s*\w" --include="*.go" --include="*.js" . | \ + awk -F: '{print $1}' | sort | uniq -c | sort -rn | head -20 +``` +Review files with many commented lines — likely dead code. + +--- + +## Part 5 — How to add a new principle + +When a new lesson surfaces (today's "rate limit is for the agent, not the +owner" was one), add it here in the same session: + +1. Pick the right section: General (Part 1), Cardinal (Part 2), or + subproject-specific (Part 3). +2. State the principle in one sentence at the top of the entry. +3. Explain WHY in 1–3 sentences. The reasoning matters more than the rule — + future reviewers need it to apply the principle correctly to new + situations. +4. If the principle is auditable, add a grep check to Part 4. +5. Reference the incident or insight that surfaced the principle (commit + hash, issue number, or session date). + +Principles are additive. Don't delete old ones to make room — they exist +because something went wrong once. If a principle is genuinely obsolete: + +**Deprecation format:** +```markdown +### **DEPRECATED** — Original Principle Name +**Deprecated:** 2026-04-08 +**Reason:** Early stage ended — migration code now required +**Authorized by:** Johan +**Replacement:** See "Schema migrations" in Part 8 (when written) + +[original principle text preserved below...] +``` + +- Deprecated principles remain in place for 90 days, then move to Appendix A +- A deprecated principle's grep checks in Part 4 are commented out with reason +- Deprecation requires explicit user approval (state: "Deprecating principle X + because Y. Authorized by you?") + +--- + +## Part 6 — When you have to violate a principle + +You can't always honor every principle. Sometimes the principles themselves +conflict. When that happens: + +1. **Surface the conflict to the user.** Don't choose silently. +2. **Pick the principle with the higher consequence.** Security vetos beat + DRY. Foundation rules beat KISS. The Cardinal Rules in Part 2 beat + everything else. +3. **Document the violation in the code itself**, with a comment explaining + what principle was broken, why, and what would have to be true for the + violation to be reversed. +4. **Add a TODO to the daily checklist** so the violation is reviewed + tomorrow and either fixed or formally accepted. + +Example of an acceptable, documented violation: the IP whitelist first-contact +race in `api/middleware.go` — there's a 30-line comment explaining that the +race is theoretically possible, why it cannot be reproduced in practice, and +why fixing it would require complexity disproportionate to the threat. The +violation is acknowledged, scoped, and dated. + +### Escalation and post-hoc discovery + +**User disagreement:** If the user insists on a lower-consequence principle over a higher one +(e.g., speed over security), escalate to project owner (Johan) before proceeding. +Document the override in the PR description. + +**Post-hoc discovery:** If you find a committed violation during daily review: +1. Assess if it's actively dangerous (security veto) — if yes, escalate immediately +2. If not dangerous, add to TODO and address in current sprint +3. Never leave a violation undocumented for >24 hours + +**Equal consequence conflicts:** When two principles of equal tier conflict, prefer the +one that preserves user agency and data integrity. + +**Severe security violations:** For critical security breaches (harvester detected, +token compromise, unauthorized data exfiltration): +- Immediate escalation to Johan +- Consider permanent ban of the agent and/or **all agents running on that model** +- Quarantine affected vault(s) until forensic review complete +- Legal/compliance review if customer data was accessed + +--- + +## Part 7 — Incident Response + +When a security event is detected, the system and operators respond according +to these procedures. Every response action is audit-logged. + +### Agent compromise detection (Threat A) + +When an agent triggers harvester defenses: + +1. **Rate limit hit (per-minute)** → 429 Too Many Requests + - Agent receives: `"Rate limit exceeded — slow down"` + - Audit log: `action=agent_rate_limit_minute, actor=agent_id, entry_id=requested_entry` + +2. **Hour-limit hit (first strike)** → 429 + strike recorded + - Agent receives: `"Rate limit exceeded — warning recorded"` + - Audit log: `action=agent_strike_1, actor=agent_id, last_strike_at=now` + - Agent remains functional + +3. **Second strike within 2 hours** → Agent locked + - Agent receives: `423 Locked` with message: + ``` + "This agent has been locked due to suspicious activity. + Contact the vault owner to restore access." + ``` + - Audit log: `action=agent_locked, actor=agent_id, reason=second_strike` + - All subsequent requests from this agent return 423 until unlocked + +4. **Owner unlock procedure** + - Owner performs PRF tap (WebAuthn assertion) + - POST `/api/agents/{id}/unlock` with admin token + - Agent's `Locked=false`, `LastStrikeAt=null`, `StrikeCount=0` + - Audit log: `action=agent_unlocked, actor=owner_prf, entry_id=agent_id` + - Agent may retry from clean slate + +### Post-incident review + +After any agent lock event: + +1. Review audit logs for the 2-hour window before lock +2. Identify accessed entries and their sensitivity (scope classification) +3. If any financial/admin credentials accessed: force rotation of those credentials +4. Document findings in incident tracker (outside this codebase) + +### Token compromise (Threat C) + +When a stolen agent token is used from an unauthorized IP: + +1. **IP whitelist rejection** → 403 Forbidden + - Response: `"Request from unauthorized IP — token may be compromised"` + - Audit log: `action=agent_ip_rejected, actor=agent_id, ip_addr=source_ip, allowed_ips=[whitelist]` + +2. **Owner notification** (commercial edition) + - Alert POST to central: `alert=potential_token_theft, agent_id, source_ip, timestamp` + - Central emails owner with instructions to revoke agent + +3. **Revocation** + - Owner performs PRF tap + - DELETE `/api/agents/{id}` with admin token + - All tokens for this agent invalidated + - Audit log: `action=agent_revoked, actor=owner_prf, reason=token_compromise_suspected` + +### Emergency procedures + +**Owner locked out (lost authenticator):** +- No self-service recovery. Contact support with identity verification. +- Support triggers `vault_recover` procedure (offline, requires two operators). +- New vault created, credentials imported from backups (if available). + +**Complete vault compromise suspected:** +- Rotate all credentials in vault immediately +- Archive vault, create new vault with new L0 +- Investigation mode: all agent access suspended until root cause identified + +--- + +*Foundation First. No mediocrity. Ever.* diff --git a/clavis/clavis-vault/lib/errors.go b/clavis/clavis-vault/lib/errors.go new file mode 100644 index 0000000..f64e9c4 --- /dev/null +++ b/clavis/clavis-vault/lib/errors.go @@ -0,0 +1,616 @@ +package lib + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "log" + "net/http" + "os" + "sync" + "time" +) + +// --------------------------------------------------------------------------- +// Severity Levels +// --------------------------------------------------------------------------- + +type Severity int + +const ( + SeverityDebug Severity = iota // Development/debug only + SeverityInfo // Informational, no action needed + SeverityWarning // Anomaly detected, monitoring required + SeverityError // User-impacting issue + SeverityCritical // System down / security breach +) + +func (s Severity) String() string { + switch s { + case SeverityDebug: + return "debug" + case SeverityInfo: + return "info" + case SeverityWarning: + return "warning" + case SeverityError: + return "error" + case SeverityCritical: + return "critical" + default: + return "unknown" + } +} + +// --------------------------------------------------------------------------- +// Error/Event Definition Registry +// --------------------------------------------------------------------------- + +// EventDef defines a class of events in the registry. +// This is the single source of truth for all errors and warnings. +type EventDef struct { + Code string // Unique code: ERR-12345 or WRN-12345 + Category string // auth, database, network, security, etc. + Severity Severity // How serious + Message string // User-facing message (for errors shown to users) + Description string // Internal description for operators + AutoResolve bool // True if system can auto-resolve (e.g., retry succeeded) + CreateTicket bool // True if this creates a trackable incident +} + +// EventRegistry is the canonical list of all events. +// Errors = user-impacting and create tickets. +// Warnings = anomalies that may indicate future issues. +var EventRegistry = map[string]EventDef{ + // WARNINGS (WRN-1xxxx) - Anomalies, not yet user-impacting + "WRN-10001": { + Code: "WRN-10001", + Category: "performance", + Severity: SeverityWarning, + Description: "Response time >500ms for credential lookup", + AutoResolve: true, + CreateTicket: false, // Log only, no ticket unless sustained + }, + "WRN-10002": { + Code: "WRN-10002", + Category: "security", + Severity: SeverityWarning, + Description: "Agent rate limit at 80% of threshold", + AutoResolve: true, + CreateTicket: false, + }, + "WRN-10003": { + Code: "WRN-10003", + Category: "storage", + Severity: SeverityWarning, + Description: "Disk usage >80% on vault storage", + AutoResolve: false, + CreateTicket: true, // Creates ticket for ops to expand storage + }, + "WRN-10004": { + Code: "WRN-10004", + Category: "network", + Severity: SeverityWarning, + Description: "TLS handshake latency elevated", + AutoResolve: true, + CreateTicket: false, + }, + + // AUTH ERRORS (ERR-1xxxx) + "ERR-10001": { + Code: "ERR-10001", + Category: "auth", + Severity: SeverityError, + Message: "Authentication failed. Please check your credentials and try again.", + Description: "WebAuthn challenge verification failed", + AutoResolve: false, + CreateTicket: false, // User error, not system error + }, + "ERR-10002": { + Code: "ERR-10002", + Category: "auth", + Severity: SeverityError, + Message: "Access denied. You don't have permission for this operation.", + Description: "Actor attempted operation outside their scope", + AutoResolve: false, + CreateTicket: false, + }, + "ERR-10003": { + Code: "ERR-10003", + Category: "auth", + Severity: SeverityError, + Message: "Invalid agent token. The agent may need to be re-enrolled.", + Description: "CVT token validation failed", + AutoResolve: false, + CreateTicket: true, + }, + "ERR-10004": { + Code: "ERR-10004", + Category: "security", + Severity: SeverityCritical, + Message: "Agent locked due to suspicious activity. Contact the vault owner.", + Description: "Agent triggered harvester defenses (two-strike lockdown)", + AutoResolve: false, + CreateTicket: true, + }, + "ERR-10005": { + Code: "ERR-10005", + Category: "security", + Severity: SeverityCritical, + Message: "Request from unauthorized IP. Token may be compromised.", + Description: "Agent request from IP not in whitelist", + AutoResolve: false, + CreateTicket: true, + }, + + // INPUT ERRORS (ERR-2xxxx) + "ERR-20001": { + Code: "ERR-20001", + Category: "input", + Severity: SeverityError, + Message: "Invalid request. Please check your input and try again.", + Description: "JSON parse failed or required field missing", + AutoResolve: false, + CreateTicket: false, + }, + "ERR-20002": { + Code: "ERR-20002", + Category: "input", + Severity: SeverityError, + Message: "Invalid ID format. The requested item may not exist.", + Description: "Entry/agent ID parsing failed", + AutoResolve: false, + CreateTicket: false, + }, + + // NOT FOUND ERRORS (ERR-3xxxx) + "ERR-30001": { + Code: "ERR-30001", + Category: "not_found", + Severity: SeverityError, + Message: "Vault not found. Please register or check your vault path.", + Description: "No vault exists at the requested path", + AutoResolve: false, + CreateTicket: false, + }, + "ERR-30002": { + Code: "ERR-30002", + Category: "not_found", + Severity: SeverityError, + Message: "Entry not found. It may have been deleted.", + Description: "Requested entry ID does not exist", + AutoResolve: false, + CreateTicket: false, + }, + + // SYSTEM ERRORS (ERR-5xxxx) - These POST to central + "ERR-50001": { + Code: "ERR-50001", + Category: "database", + Severity: SeverityCritical, + Message: "Service temporarily unavailable. Our team has been alerted and is working on it. Reference: ERR-50001.", + Description: "Vault database connection failed or file inaccessible", + AutoResolve: false, + CreateTicket: true, + }, + "ERR-50002": { + Code: "ERR-50002", + Category: "database", + Severity: SeverityError, + Message: "Failed to save data. Please try again in a moment.", + Description: "Database write operation failed", + AutoResolve: true, + CreateTicket: true, + }, + "ERR-50003": { + Code: "ERR-50003", + Category: "storage", + Severity: SeverityCritical, + Message: "Service temporarily unavailable. Our team has been alerted. Reference: ERR-50003.", + Description: "WL3 credential storage write failed", + AutoResolve: false, + CreateTicket: true, + }, + "ERR-50004": { + Code: "ERR-50004", + Category: "network", + Severity: SeverityError, + Message: "Connection issue. Please try again.", + Description: "Failed to connect to central admin for sync", + AutoResolve: true, + CreateTicket: true, + }, + + // RATE LIMITING (ERR-6xxxx) + "ERR-60001": { + Code: "ERR-60001", + Category: "rate_limit", + Severity: SeverityError, + Message: "Too many requests. Please slow down and try again in a moment.", + Description: "Global per-IP rate limit exceeded", + AutoResolve: true, + CreateTicket: false, + }, + "ERR-60002": { + Code: "ERR-60002", + Category: "rate_limit", + Severity: SeverityWarning, + Message: "Agent rate limit warning. Reduce request frequency.", + Description: "Per-agent unique-entry quota at 90%", + AutoResolve: true, + CreateTicket: false, + }, + + // INTERNAL ERRORS (ERR-9xxxx) - Invariant violations + "ERR-90001": { + Code: "ERR-90001", + Category: "invariant", + Severity: SeverityCritical, + Message: "An unexpected error occurred. Please try again or contact support. Reference: ERR-90001.", + Description: "Condition assumed impossible was triggered", + AutoResolve: false, + CreateTicket: true, + }, +} + +// --------------------------------------------------------------------------- +// Incident Tracking - Centralized Error Management +// --------------------------------------------------------------------------- + +// CentralEvent is sent to clavitor.ai for every ticket-creating event. +// No deduplication - "let it rain". Central handles aggregation if needed. +type CentralEvent struct { + EventID string `json:"event_id"` // UUID generated locally + Code string `json:"code"` // ERR-50001, etc. + Category string `json:"category"` // database, network, etc. + Severity string `json:"severity"` + Resource string `json:"resource"` // uk1, db-primary, etc. + POP string `json:"pop"` // Which POP reported this + Operation string `json:"operation"` // GetEntry, agent unlock, etc. + ErrorDetail string `json:"error_detail"` // The actual error message + Actor string `json:"actor"` // web, agent:abc123 + Timestamp int64 `json:"timestamp"` + UserMessage string `json:"user_message"` // Shown to users +} + +// CentralClient posts events to central (clavitor.ai). +// Only created in commercial edition; community edition logs locally only. +type CentralClient struct { + centralURL string + popID string + apiKey string +} + +var ( + globalCentralClient *CentralClient + onceCentral sync.Once +) + +// InitCentralClient creates the global central reporter. +// Called at startup from main.go. +func InitCentralClient(centralURL, popID, apiKey string) { + onceCentral.Do(func() { + globalCentralClient = &CentralClient{ + centralURL: centralURL, + popID: popID, + apiKey: apiKey, + } + }) +} + +// postEvent sends an event to central asynchronously. +// Every event is sent individually - no local deduplication. +func (c *CentralClient) postEvent(ev CentralEvent) { + if c == nil { + return // Community edition - no central reporting + } + + payload, _ := json.Marshal(ev) + req, err := http.NewRequest("POST", c.centralURL+"/v1/events", bytes.NewReader(payload)) + if err != nil { + log.Printf("[ERROR] Failed to create event request: %v", err) + return + } + req.Header.Set("Authorization", "Bearer "+c.apiKey) + req.Header.Set("Content-Type", "application/json") + + client := &http.Client{Timeout: 10 * time.Second} + resp, err := client.Do(req) + if err != nil { + log.Printf("[ERROR] Failed to post event to central: %v", err) + return + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated { + log.Printf("[ERROR] Central rejected event: %d", resp.StatusCode) + } +} + +// --------------------------------------------------------------------------- +// Event Logging - The main interface for handlers +// --------------------------------------------------------------------------- + +// EventContext holds all the searchable fields for an event. +type EventContext struct { + Code string // ERR-50001, WRN-10001, etc. + Category string // auth, database, network, security + Resource string // Which resource: uk1, db-primary, vault-file-123 + Operation string // What was being attempted: "GetEntry", "agent unlock" + Actor string // web, agent:abc123, extension + ErrorDetail string // The actual error: "connection refused", "disk full" + IPAddr string // Client IP + Severity Severity +} + +// LogEvent is the main entry point for all events (errors and warnings). +// This replaces direct AuditLog calls for error cases. +func LogEvent(db *DB, ctx context.Context, ec EventContext) { + def, ok := EventRegistry[ec.Code] + if !ok { + log.Printf("[ERROR] Unknown event code: %s", ec.Code) + def = EventRegistry["ERR-90001"] + ec.Code = "ERR-90001" + ec.Severity = SeverityCritical + } + + // Log to local audit (searchable fields) + // Title contains structured data: resource, error, operation + // Searchable: grep "resource=uk1" audit.log + auditTitle := fmt.Sprintf("op=%s | resource=%s | error=%s | %s", + ec.Operation, ec.Resource, ec.ErrorDetail, def.Description) + AuditLog(db, &AuditEvent{ + Action: ec.Code, + Actor: ec.Actor, + Title: auditTitle, + IPAddr: ec.IPAddr, + }) + + // Log to operator logs (human readable) + log.Printf("[%s] %s | resource=%s | op=%s | actor=%s | error=%s | severity=%s", + ec.Code, + def.Description, + ec.Resource, + ec.Operation, + ec.Actor, + ec.ErrorDetail, + def.Severity.String(), + ) + + // Post to central if ticket-creating event + // "Let it rain" - every event is sent individually, no deduplication + if def.CreateTicket && globalCentralClient != nil { + go globalCentralClient.postEvent(CentralEvent{ + EventID: generateEventID(), + Code: ec.Code, + Category: def.Category, + Severity: def.Severity.String(), + Resource: ec.Resource, + POP: globalCentralClient.popID, + Operation: ec.Operation, + ErrorDetail: ec.ErrorDetail, + Actor: ec.Actor, + Timestamp: time.Now().Unix(), + UserMessage: def.Message, + }) + } +} + +// generateEventID creates a simple UUID-like identifier. +func generateEventID() string { + return fmt.Sprintf("EVT-%d-%d", time.Now().Unix(), time.Now().UnixNano()%1000) +} + +// LogWarning for non-critical anomalies. +func LogWarning(db *DB, ctx context.Context, code, resource, operation, detail, actor, ip string) { + LogEvent(db, ctx, EventContext{ + Code: code, + Category: "performance", // default, override via registry + Resource: resource, + Operation: operation, + Actor: actor, + ErrorDetail: detail, + IPAddr: ip, + Severity: SeverityWarning, + }) +} + +// LogError for user-impacting issues. +func LogError(db *DB, ctx context.Context, code, resource, operation, detail, actor, ip string) { + LogEvent(db, ctx, EventContext{ + Code: code, + Category: "system", // default, override via registry + Resource: resource, + Operation: operation, + Actor: actor, + ErrorDetail: detail, + IPAddr: ip, + Severity: SeverityError, + }) +} + +// --------------------------------------------------------------------------- +// HTTP Response Helpers +// --------------------------------------------------------------------------- + +// ErrorResponse sends a user-facing error with code. +// Use this for API responses to clients. +func ErrorResponse(w http.ResponseWriter, status int, code string) { + def, ok := EventRegistry[code] + if !ok { + def = EventDef{ + Code: code, + Message: "An error occurred. Please try again.", + } + } + + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(status) + + response := map[string]string{ + "error": code, + "message": def.Message, + } + + // For critical/system errors, add incident reference if available + if def.Severity >= SeverityError && def.CreateTicket { + response["reference"] = code // User can reference this when contacting support + response["status"] = "Our team has been alerted" + } + + json.NewEncoder(w).Encode(response) +} + +// HandleError is the full-flow helper for handlers. +// Logs event, posts to central if needed, and returns user response. +func HandleError( + w http.ResponseWriter, + r *http.Request, + db *DB, + code string, + resource string, + operation string, + internalErr error, + httpStatus int, +) { + ctx := r.Context() + + // Build actor from context (set by middleware) + actor := "unknown" + if a, ok := ctx.Value("actor").(string); ok { + actor = a + } + + // Log the full event + LogEvent(db, ctx, EventContext{ + Code: code, + Resource: resource, + Operation: operation, + Actor: actor, + ErrorDetail: internalErr.Error(), + IPAddr: r.RemoteAddr, + }) + + // Return user-facing response + ErrorResponse(w, httpStatus, code) +} + +// LookupEvent returns the event definition for documentation. +func LookupEvent(code string) (EventDef, bool) { + def, ok := EventRegistry[code] + return def, ok +} + +// ListEventsByCategory returns all events in a category. +func ListEventsByCategory(category string) []EventDef { + var results []EventDef + for _, def := range EventRegistry { + if def.Category == category { + results = append(results, def) + } + } + return results +} + +// --------------------------------------------------------------------------- +// Status Page Integration (for central/clavitor.ai) +// --------------------------------------------------------------------------- + +// StatusPageEntry is returned by central's /status endpoint. +type StatusPageEntry struct { + Component string `json:"component"` // uk1, db-primary, etc. + Status string `json:"status"` // operational, degraded, down + IncidentID string `json:"incident_id,omitempty"` + UpdatedAt int64 `json:"updated_at"` + UserMessage string `json:"user_message,omitempty"` +} + +// IsResourceAffected queries central for recent events on a resource. +// This queries central rather than local cache (no local deduplication). +func IsResourceAffected(resource string) bool { + // TODO: Query central /v1/events?resource=uk1&since=5m + // For now, returns false - central is source of truth + return false +} + +// GetStatusForResource returns current status by querying central. +func GetStatusForResource(resource string) (StatusPageEntry, bool) { + // TODO: Query central /v1/status?resource=uk1 + // For now, assumes operational - central drives status page + return StatusPageEntry{ + Component: resource, + Status: "operational", + UpdatedAt: time.Now().Unix(), + }, false +} + +// --------------------------------------------------------------------------- +// Central Query Helpers (for clavitor.ai implementation) +// --------------------------------------------------------------------------- + +// ActiveEventSummary is what central's dashboard shows. +// SQL equivalent: +// SELECT code, resource, pop, COUNT(*) as count, MAX(timestamp) as last_seen +// FROM events +// WHERE status != 'resolved' +// GROUP BY code, resource, pop +// ORDER BY count DESC +// +// This gives you: "uk1 has 12 ERR-50001 in the last hour" +type ActiveEventSummary struct { + Code string `json:"code"` // ERR-50001 + Resource string `json:"resource"` // uk1 + POP string `json:"pop"` // zrh + Count int `json:"count"` // How many events + FirstSeen int64 `json:"first_seen"` // First event timestamp + LastSeen int64 `json:"last_seen"` // Most recent event timestamp + Status string `json:"status"` // investigating, identified, monitoring, resolved +} + +// CentralQuery represents the query parameters for the central endpoint. +// The central API should support: +// GET /v1/events?status=active&group_by=code,resource,pop +// GET /v1/events?code=ERR-50001&resource=uk1&since=1h +// POST /v1/events/bulk-resolve { "code": "ERR-50001", "resource": "uk1" } +type CentralQuery struct { + Status string `json:"status,omitempty"` // active, resolved, all + Code string `json:"code,omitempty"` // ERR-50001 + Resource string `json:"resource,omitempty"` // uk1 + POP string `json:"pop,omitempty"` // zrh + Since string `json:"since,omitempty"` // 1h, 24h, 7d + GroupBy []string `json:"group_by,omitempty"` // code, resource, pop + Severity string `json:"severity,omitempty"` // error, critical + CreateTicket bool `json:"create_ticket,omitempty"` // true = ticket-creating only +} + +// BulkResolveRequest marks events as resolved in bulk. +// Use case: uk1 fixed, resolve all ERR-50001 for uk1 at once. +type BulkResolveRequest struct { + Code string `json:"code"` // Required + Resource string `json:"resource,omitempty"` // Optional: resolve for specific resource + POP string `json:"pop,omitempty"` // Optional: resolve for specific POP + Since int64 `json:"since,omitempty"` // Optional: resolve events after this time + Message string `json:"message"` // Resolution message: "Disk space freed, service restored" + ResolvedBy string `json:"resolved_by"` // Who fixed it: "ops-johan" +} + +// --------------------------------------------------------------------------- +// Environment-based initialization helper +// --------------------------------------------------------------------------- + +// InitErrorsFromEnv sets up the central client from environment variables. +// Call this from main(): +// lib.InitErrorsFromEnv() +func InitErrorsFromEnv() { + centralURL := os.Getenv("CLAVITOR_CENTRAL_URL") + popID := os.Getenv("CLAVITOR_POP_ID") + apiKey := os.Getenv("CLAVITOR_API_KEY") + + if centralURL != "" && popID != "" && apiKey != "" { + InitCentralClient(centralURL, popID, apiKey) + log.Printf("[INIT] Central event reporting enabled: POP=%s -> %s", popID, centralURL) + } else { + log.Printf("[INIT] Central event reporting disabled (community edition or missing config)") + } +}