JKHeadley · JKHeadley · May 28, 2026
diff --git a/docs/postmortems/2026-05-27-release-readiness-eval-failure-topics.md b/docs/postmortems/2026-05-27-release-readiness-eval-failure-topics.md
@@ -0,0 +1,68 @@
+# Post-mortem — Release-readiness eval-failure Telegram topics (2026-05-27)
+
+## Summary
+
+A new monitoring sentinel (`ReleaseReadinessSentinel`, shipped over PRs #433 / #442 / #443) emitted a per-stage Attention item — and therefore a new Telegram topic — every time the watchdog's own fetch / analyzer / tick stage broke. Across the v1.3.38 → v1.3.43 dogfood window on Echo, two such topics surfaced ("Release-readiness check could not evaluate"), with bodies that were inscrutable to a user ("analyze-release returned no report"). This pattern was banned six days earlier by the silently-stopped-trio fix (2026-05-22, post-topic-spam flood): internal-plumbing failures belong in the audit log + server log, not on the user's Telegram surface.
+
+The user caught it. The spec passed conformance. The conformance gate did not see this class of violation.
+
+## Timeline
+
+- **2026-05-22** — Silently-stopped-trio fix lands (#334, then wired in #340). Establishes the canonical "Sentinel Notifications" pattern: housekeeping by default → `logs/sentinel-events.jsonl` + `server.log`, Telegram escalation off by default, coalesced into ONE consolidated message in the existing system topic when opted in. Codified in agent `CLAUDE.md` and `docs/specs/silently-stopped-trio.md`.
+- **2026-05-26..27** — `RELEASE-READINESS-VISIBILITY-SPEC.md` converges and lands as #433/#442/#443. §4.2.4 says the spec is "near-silent" (✓), and §4.2 explicitly says **any evaluation failure raises a low-priority Attention item — a silent catch is forbidden**. The two-option framing (loud-attention vs silent-catch) skipped over the housekeeping path the trio standard establishes. No cross-reference to `silently-stopped-trio.md`.
+- **2026-05-27 (Echo dogfood window)** — Echo enabled the sentinel. Several ticks ran. The 23:54Z tick fetched canonical and failed (`canonical ref unreachable`); the 01:25Z tick reached the analyzer and got back no report. Each emitted a new Telegram topic via the Attention queue's "create-a-topic-per-item" design.
+- **2026-05-27 18:30 PT** — User: "These topics keep popping up in Instar agents which goes directly against instar standards: they produce topic clutter; the messages are completely unhelpful."
+- **2026-05-27 18:30..18:46 PT** — Diagnosis → branch `echo/release-readiness-housekeeping` → fix + tests + migrator + side-effects artifact + this post-mortem.
+- **2026-05-27 18:35 PT** — Two stale items live-cleaned on Echo via `DELETE /attention/release-readiness-eval-failure-{fetch,analyzer}` (soft-delete; topics closed).
+
+## Root cause
+
+A spec-time framing error. The spec author treated the choice as binary:
+1. **Loud signal** → post to Attention queue (creates Telegram topic).
+2. **Silent catch** → eat the error → recreate the very bug §3 fixes.
+
+The trio standard establishes a third path:
+3. **Housekeeping** → write to `logs/sentinel-events.jsonl` + `server.log` + emit an in-process event. Fully observable for diagnostics, never a user-facing topic. Optional, coalesced, single-hub-topic escalation behind a config flag.
+
+For evaluator-self-failures (the watchdog's own fetch / analyzer / tick stages), path 3 is the correct fit — they are internal plumbing the user can't act on. Path 1 was the wrong choice but was actively defended by the spec text. Path 2 was never on the table.
+
+## Contributing factors
+
+1. **No conformance check for sentinel emit-sites.** The Self-Hosting conformance gate exercises many checks (near-silent, 3-tier testing, migration parity, structure-over-willpower, no-manual-work). It does NOT, today, flag "this new `*Sentinel.ts` calls `postAttention` directly without classifying the emit-site against the silently-stopped-trio housekeeping/escalation taxonomy."
+2. **No cross-spec consistency requirement.** A spec referencing the trio standard's pattern was not required. The spec mentioned "near-silent" but didn't cite the trio doc as a peer authority.
+3. **No structural primitive.** SocketDisconnectSentinel / ActiveWorkSilenceSentinel implement the housekeeping pattern by hand. There is no shared `SentinelEmitter` primitive that bakes in the housekeeping default + escalation gate. Each new sentinel re-derives (or fails to re-derive) the pattern from prose.
+4. **Dogfood-to-ship caught it — at the topic-clutter cost.** The "Echo dogfoods first" gate worked: the issue was caught by a real user before the sentinel shipped on default. But the catch came AFTER the user saw two topics, not before. Dogfood-as-only-safety-net is a smell — design-time review should have caught this.
+5. **Spec language reinforced the bug.** "A silent catch is forbidden" framed loud-Attention as the only acceptable alternative. Housekeeping is not silent — it's persistent, structured, queryable observability — but the spec used "silent" pejoratively without distinguishing from "audited but not chat-surfacing."
+
+## What we're changing
+
+### Immediate (this PR)
+
+- `ReleaseReadinessSentinel.failLoud()` demoted to audit-only by default; opt-in via `monitoring.releaseReadiness.escalateEvalFailures`.
+- `migrateRetireStaleReleaseReadinessEvalFailureAttention()` cleans up stale rows on existing agents.
+- Spec text (next slice) — see "Follow-ups" below.
+
+### Follow-ups (tracked as separate work)
+
+1. **Sentinel-emit-site lint.** A pre-commit / CI lint that scans `src/monitoring/**/*Sentinel*.ts` for direct `postAttention(` calls and flags any that aren't either:
+   - Behind a config flag of the shape `*TelegramEscalation` / `escalate*Failures` / `*ChatEscalation`, OR
+   - Annotated `// @user-actionable-attention-ok — <one-line justification>` in the same expression.
+   This is the structural equivalent of the trio standard. Implements "structure > willpower" for the housekeeping taxonomy.
+
+2. **Sentinel emitter primitive.** Extract a small `SentinelEmitter` class with two methods:
+   - `recordHousekeeping(event, payload)` → audit + event (no user-facing emit by default)
+   - `escalate(item)` → routes to Attention iff the per-sentinel escalation flag is on, with built-in coalescing per the trio standard.
+   New sentinels use the primitive. Existing housekeeping-pattern sentinels (`SocketDisconnectSentinel`, `ActiveWorkSilenceSentinel`) migrate at leisure. Spec-time discussion becomes "which emit-sites are housekeeping vs user-actionable," not "do we postAttention."
+
+3. **Spec template update.** Any spec introducing a sentinel must include a "Failure-mode emit-site table" classifying each error path as (a) user-actionable Attention, (b) housekeeping audit-only, (c) opt-in escalation. The /spec-converge conformance pass requires this section.
+
+4. **Cross-reference rule.** `/spec-converge` flags any spec touching `src/monitoring/` that does NOT cite `docs/specs/silently-stopped-trio.md`. Mechanical, easy.
+
+5. **Spec text fix on `RELEASE-READINESS-VISIBILITY-SPEC.md`.** Replace the §4.2 "fail-loud Attention" language with the housekeeping default + escalation flag pattern; cite the trio standard. A follow-up PR (the spec is converged, the runtime behaviour now contradicts it — the doc must match the code).
+
+## Lessons
+
+- **Two coexisting standards is one standard not yet generalized.** When a class of failure (silently-stopped trio) gets a careful design and a separate class (release-readiness eval) reinvents a worse version of it, that's not two design problems — that's the trio standard wanting to be extracted into a primitive. Do the primitive.
+- **"Fail-loud" is not a synonym for "Telegram topic."** Loud means observable and surfaced where the next operator looks. For internal-plumbing failures, that's `logs/sentinel-events.jsonl` and `server.log`. For user-actionable failures, it's the Attention queue. The spec should classify each emit-site explicitly.
+- **Dogfood-to-ship works but is the last line of defense.** Catches at design time are cheaper than catches at dogfood time. Conformance checks are how we move catches earlier without slowing review.
+- **A bad analogy in a spec writes itself into every implementation.** "A silent catch is forbidden" was true but framed the choice wrongly. Better: "Every failure must be audited; user-facing emission is a separate decision." Words matter; choose them so they don't preclude the right answer.
diff --git a/src/commands/server.ts b/src/commands/server.ts
@@ -8147,6 +8147,7 @@ export async function startServer(options: StartOptions): Promise<void> {
             backlogAgeDaysHigh: rrCfg.backlogAgeDaysHigh,
             hysteresisHours: rrCfg.hysteresisHours,
             staleEpisodeTtlDays: rrCfg.staleEpisodeTtlDays,
+            escalateEvalFailures: rrCfg.escalateEvalFailures,
           });
           console.log(pc.green('  ReleaseReadinessSentinel enabled (release-hygiene watchdog — job-driven)'));
         } else {

diff --git a/src/config/ConfigDefaults.ts b/src/config/ConfigDefaults.ts
@@ -88,6 +88,12 @@ const SHARED_DEFAULTS: Record<string, unknown> = {
       hysteresisHours: 12,
       staleEpisodeTtlDays: 30,
       fetchTimeoutMs: 30_000,
+      // Evaluator-self-failures (fetch / analyzer / top-level tick) are
+      // HOUSEKEEPING by default — they write to the audit log + server.log
+      // but do not post a per-stage Attention item / Telegram topic. The
+      // user-actionable "release blocked" signal is unaffected. Set true to
+      // surface catastrophic watchdog failures in chat. Sentinel-trio standard.
+      escalateEvalFailures: false,
     },
     // Master gate for Telegram delivery of silently-stopped-sentinel
     // escalations. Default false → sentinel notices are housekeeping and stay

diff --git a/src/core/PostUpdateMigrator.ts b/src/core/PostUpdateMigrator.ts
@@ -229,10 +229,82 @@ export class PostUpdateMigrator {
     this.migrateBootWrapperAbiCheck(result);
     this.migrateStaleLifelineSignal(result);
     this.migrateThreadlineConversationStore(result);
+    this.migrateRetireStaleReleaseReadinessEvalFailureAttention(result);
 
     return result;
   }
 
+  /**
+   * Retire stale `release-readiness-eval-failure-*` attention items left behind
+   * by the pre-housekeeping watchdog. From v1.3.43 down, ReleaseReadinessSentinel
+   * posted an Attention item — and therefore a new Telegram topic — every time
+   * the watchdog's own fetch / analyzer / tick stage broke. That violated the
+   * sentinel-trio standard (post-2026-05-22 topic-spam fix): internal-plumbing
+   * failures are housekeeping and belong in logs/sentinel-events.jsonl +
+   * server.log, not on the user's Telegram surface.
+   *
+   * The code-level fix demotes those emissions to audit-only (gated behind
+   * `monitoring.releaseReadiness.escalateEvalFailures`, default false). This
+   * migration cleans up the stragglers already on-disk so the topics don't
+   * keep haunting the topic list after update.
+   *
+   * Behaviour:
+   *   - Reads .instar/state/attention-items.json. If absent, skip.
+   *   - For every item whose id starts with `release-readiness-eval-failure-`:
+   *     drop it from the items array. (The Telegram topic itself is left as-is;
+   *     it was either /done'd by the user already, or will be unreferenced. We
+   *     don't synchronously call Telegram from PostUpdateMigrator — the
+   *     adapter isn't constructed at this point in startup.)
+   *   - Atomic write (tmp + rename) so a crash mid-migration can't corrupt
+   *     attention-items.json.
+   *   - Idempotent: a second run finds zero matches and no-ops.
+   *
+   * Origin: 2026-05-27 dogfood feedback on Echo — repeated
+   * "Release-readiness check could not evaluate" topics violating the user's
+   * "no topic clutter for housekeeping" standard.
+   */
+  private migrateRetireStaleReleaseReadinessEvalFailureAttention(result: MigrationResult): void {
+    const attentionPath = path.join(this.config.stateDir, 'state', 'attention-items.json');
+    if (!fs.existsSync(attentionPath)) {
+      result.skipped.push('retire-stale-release-readiness-eval-failure-attention: no attention-items.json');
+      return;
+    }
+
+    let parsed: { items?: Array<{ id?: string }> };
+    try {
+      parsed = JSON.parse(fs.readFileSync(attentionPath, 'utf-8')) as { items?: Array<{ id?: string }> };
+    } catch (err) {
+      result.errors.push(`retire-stale-release-readiness-eval-failure-attention read: ${err instanceof Error ? err.message : String(err)}`);
+      return;
+    }
+
+    if (!Array.isArray(parsed.items) || parsed.items.length === 0) {
+      result.skipped.push('retire-stale-release-readiness-eval-failure-attention: empty attention items');
+      return;
+    }
+
+    const before = parsed.items.length;
+    const filtered = parsed.items.filter((it) => {
+      const id = typeof it?.id === 'string' ? it.id : '';
+      return !id.startsWith('release-readiness-eval-failure-');
+    });
+    const dropped = before - filtered.length;
+    if (dropped === 0) {
+      result.skipped.push('retire-stale-release-readiness-eval-failure-attention: none on disk');
+      return;
+    }
+
+    parsed.items = filtered;
+    try {
+      const tmpPath = `${attentionPath}.${process.pid}.tmp`;
+      fs.writeFileSync(tmpPath, JSON.stringify(parsed, null, 2));
+      fs.renameSync(tmpPath, attentionPath);
+      result.upgraded.push(`retire-stale-release-readiness-eval-failure-attention: dropped ${dropped} stale item(s)`);
+    } catch (err) {
+      result.errors.push(`retire-stale-release-readiness-eval-failure-attention write: ${err instanceof Error ? err.message : String(err)}`);
+    }
+  }
+
   /**
    * Regenerate the boot wrapper when it predates the ABI-aware node
    * self-heal (recurring-SQLite-bane fix).

diff --git a/src/core/types.ts b/src/core/types.ts
@@ -2769,6 +2769,17 @@ export interface MonitoringConfig {
     canonicalRemote?: string;
     /** Override the instar repo path to analyze (default: the agent home). */
     repoPath?: string;
+    /**
+     * When true, evaluator-self-failures (fetch / analyzer / top-level tick
+     * stages of the watchdog itself) post a LOW-priority Attention item — and
+     * therefore a Telegram topic — in addition to the audit-log entry. Default
+     * false: per the sentinel-trio standard ("Sentinel Notifications" in the
+     * agent CLAUDE.md, post-2026-05-22 topic-spam fix), internal-plumbing
+     * failures are housekeeping and stay in logs/sentinel-events.jsonl +
+     * server.log. The user-actionable "release blocked — unreleased work
+     * piling up" signal always posts regardless of this flag. Flip on only if
+     * you also want catastrophic-failure surfacing in chat. */
+    escalateEvalFailures?: boolean;
   };
   /**
    * Master gate for Telegram delivery of silently-stopped-sentinel escalations

diff --git a/src/monitoring/ReleaseReadinessSentinel.ts b/src/monitoring/ReleaseReadinessSentinel.ts
@@ -20,9 +20,17 @@
  *     item per stall episode above it, keyed on the OLDEST unreleased commit
  *     SHA (stable across ticks — not a resettable per-tick id), priority scaled
  *     by backlog age, 12h hysteresis on re-raise after an auto-resolve.
- *   - Fail-loud: any evaluation failure (fetch error, analyzer error) raises a
- *     low-priority Attention item — never a silent catch (that would re-create
- *     the exact bug this fixes).
+ *   - Fail-loud: any evaluation failure (fetch error, analyzer error, top-level
+ *     tick error) writes a structured audit entry (sentinel-events.jsonl) and a
+ *     dedup-keyed `eval-failed` emit — never a silent catch (that would re-create
+ *     the exact bug this fixes). User-facing Telegram escalation of these
+ *     evaluator-self-failures is HOUSEKEEPING by default, gated behind
+ *     `escalateEvalFailures` (config: `monitoring.releaseReadiness.escalateEvalFailures`,
+ *     default false), per the sentinel-trio standard ("Sentinel Notifications"
+ *     in CLAUDE.md, post-2026-05-22 topic-spam fix). The audit log + server.log
+ *     are the canonical observability surface; only the user-actionable
+ *     "release blocked / unreleased work piling up" signal posts to Attention
+ *     by default — that one is genuinely actionable.
  *   - Lifecycle owner: detect → surface → auto-resolve → reap, with
  *     resolveEpisodesInRange consulted by the publish-finalize path.
  *   - Repo-gated: needs an analyzable instar git repo (dev/maintainer env). On
@@ -120,6 +128,16 @@ export interface ReleaseReadinessSentinelConfig {
   backlogAgeDaysHigh?: number;
   hysteresisHours?: number;
   staleEpisodeTtlDays?: number;
+  /**
+   * When true, evaluator-self-failures (fetch / analyzer / top-level tick stages)
+   * post a low-priority Attention item in addition to the audit log. Default
+   * false: housekeeping per the sentinel-trio standard — the audit log
+   * (logs/sentinel-events.jsonl) + server.log are the canonical observability
+   * surface for internal-plumbing failures, so the user is not spammed with a
+   * Telegram topic per stage that breaks. The user-actionable "release blocked"
+   * signal is unaffected by this flag and always posts to Attention.
+   */
+  escalateEvalFailures?: boolean;
 }
 
 const DEFAULTS: Required<ReleaseReadinessSentinelConfig> = {
@@ -131,6 +149,7 @@ const DEFAULTS: Required<ReleaseReadinessSentinelConfig> = {
   backlogAgeDaysHigh: 7,
   hysteresisHours: 12,
   staleEpisodeTtlDays: 30,
+  escalateEvalFailures: false,
 };
 
 const DAY_MS = 24 * 60 * 60 * 1000;
@@ -346,17 +365,28 @@ export class ReleaseReadinessSentinel extends EventEmitter {
 
   private async failLoud(state: ReadinessState, stage: string, err: unknown): Promise<void> {
     const key = `failure:${stage}`;
+    // Always audit — the audit log is the canonical observability surface for
+    // evaluator-self-failures. Both the dedup-suppressed and the un-suppressed
+    // paths produce an audit line so frequency is countable from disk.
     this.deps.audit({ kind: 'release-readiness', event: 'eval-failed', stage, error: String(err) });
     if (state.lastFailureKey === key) return; // dedupe per failure episode
     state.lastFailureKey = key;
-    await this.deps.postAttention({
-      id: `release-readiness-eval-failure-${stage}`,
-      title: 'Release-readiness check could not evaluate',
-      summary: `The release-readiness check failed at the "${stage}" stage: ${String(err)}. Last evaluated ${state.lastSignalAt ? new Date(state.lastSignalAt).toISOString() : 'never'}.`,
-      category: 'degradation',
-      priority: 'LOW',
-    });
-    state.lastSignalAt = this.deps.now();
+    // HOUSEKEEPING by default: do NOT post a per-stage Attention item (which
+    // would create a per-event Telegram topic — the exact anti-pattern banned
+    // by the sentinel-trio standard post-2026-05-22 topic-spam fix). The user
+    // hears about this kind of failure only when escalateEvalFailures is
+    // explicitly enabled. The audit emission above + the `eval-failed` event
+    // remain the supported observability handles.
+    if (this.cfg.escalateEvalFailures) {
+      await this.deps.postAttention({
+        id: `release-readiness-eval-failure-${stage}`,
+        title: 'Release-readiness check could not evaluate',
+        summary: `The release-readiness check failed at the "${stage}" stage: ${String(err)}. Last evaluated ${state.lastSignalAt ? new Date(state.lastSignalAt).toISOString() : 'never'}.`,
+        category: 'degradation',
+        priority: 'LOW',
+      });
+      state.lastSignalAt = this.deps.now();
+    }
     this.emit('eval-failed', { stage });
   }