Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/specs/quiet-update-mechanics.eli16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Quiet Update Mechanics — Plain-English Overview

> The one-line version: update messages full of version numbers and restart plumbing now go to the logs instead of buzzing the user — they only hear about an update when something is genuinely new, a restart is about to interrupt them, or an update is truly stuck.

## The problem in one breath

The agent's "Updates" channel was filling up with messages the user can't use: "Just updated to v1.3.217. Restarting…", "v1.3.217 was applied but I'm still running v1.3.218, the next restart should pick it up", "rolling into the pending restart at 02:42". To the user these read as meaningless version churn — exactly the "notifications that reference things I have no clue about" they complained about. None of it is something they need to read or act on.

## What already exists

- **The maturity layer (shipped as #698)** — makes *feature announcements* silent-by-default and honest about how finished a feature is (Experimental / Preview / Stable). It only governs "here's a new capability" messages, NOT restart/version status.
- **The auto-updater + restart handshake** — the machinery that downloads an update, restarts the server to load it, verifies the new version booted, and coordinates so two updates don't cause two back-to-back restarts. Along the way it emits a dozen hardcoded status messages — and those are what leaked to the user.
- **The Agent Updates topic** — the dedicated channel all of these messages route to.

## What this adds

A single rule that sorts every update message into one of four buckets before it can reach the user: **mechanics** (version/restart churn — goes to the logs), **interruption** (a restart is hitting the user's active work right now — they hear a plain "back in a few seconds"), **actionable** (auto-updates are off, so they must say "update"), and **stuck** (an update genuinely failed after retries). Only the last three reach the user, and all the restart/interruption wording was rewritten to drop version numbers entirely.

Secondary changes: the default bucket is "mechanics" (silent), so any future update message a developer forgets to classify stays quiet instead of accidentally spamming. And there's an opt-in flag for people who'd rather get one quiet "just refreshed in the background" note than total silence.

## The new pieces

- **`updateNotifyPolicy`** — a tiny, pure decision function: given a message's bucket, it answers "does this reach the user, yes or no?" It does no I/O and holds no state, so it's easy to prove correct on every branch. It is NOT allowed to send anything itself or know anything about Telegram — it only decides; the caller sends. That line matters because it keeps the "who hears what" rule in one tested place instead of scattered across a dozen message sites.

## The safeguards

**Prevents the version-churn flood from coming back.** The opt-in "background heartbeat" flag can only ever surface ONE specific message (the post-restart "I'm current" note). Every other mechanics message stays silent even with the flag on, so the flag can't be a backdoor that reopens the flood.

**Prevents accidental future spam.** Because the default bucket is "silent mechanics," a new update message added later is quiet unless someone deliberately marks it user-facing. Forgetting is safe; the failure mode points at silence, not noise.

**Prevents losing a genuinely-important signal.** A restart that actually interrupts the user, and an update that's truly stuck after retries, still reach them — just without the version jargon. We did not silence everything; we silenced the noise.

## What ships when

One change, one PR: the policy module, the classification wired through the auto-updater and the restart handshake, the version-free message rewrites, the config flag, full unit tests, the agent-awareness note (template + migration), and the release fragment. It ships on the next npm update; existing agents get the behavior automatically and the awareness note through the migration.

## What you actually need to decide

You already decided: option A (full silence for no-op updates), which is the default this ships with — do you agree this is clear to ship?
112 changes: 112 additions & 0 deletions docs/specs/quiet-update-mechanics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Spec — Quiet Update Mechanics

**Status:** shipped
**Parent principle:** Near-Silent Notifications (housekeeping → logs, not the user)
**Sibling:** `mature-update-announcements.md` (the *feature-announcement* layer; this is the *mechanics* layer)

## Problem

The Agent Updates topic was flooding with update **mechanics** — raw version
numbers and restart plumbing the user has no use for. Real messages observed in
a live Updates topic:

- `Just updated to v1.3.181. Restarting to pick up the changes.`
- `Update to v1.3.183 was applied but I'm still running v1.3.184. The next restart should pick it up.`
- `Update v1.3.215 queued — rolling into the pending restart at 02:21 (about 10m)…`
- `Update to v1.3.217 was applied but I'm still running v1.3.218. The next restart should pick it up.`

None of this is user-relevant. It is operational status that leaked into a
user-facing topic, and it reads as meaningless version churn — "notifications
that reference features the user has no clue about" (user feedback, 2026-06-04).

`mature-update-announcements` (PR #698) made the **feature-announcement** path
silent-by-default and maturity-honest, but it never touched the **mechanics**
path — these messages are not announcements, they are hardcoded restart/version
status emitted from `AutoUpdater` and the restart handshake. So the noise
remained.

## Policy (option A — full silence, the default)

The user hears about an update only when one of these is true:

1. **A genuinely new capability shipped** — governed by the `user_announcement`
front-matter layer (`mature-update-announcements`), NOT by this module.
2. **A restart is actually interrupting their active work right now**
(`interruption`) — a plain, **version-free** heads-up ("back in a few
seconds"), never "v1.3.X".
3. **They must take an action** (`actionable`) — e.g. auto-apply is off and a
manual update is available.
4. **An update is genuinely stuck after retries** (`failure-escalated`) — the
restart isn't taking the new code after repeated attempts.

Everything else — version churn, restart-batch coordination, transient version
skew that self-heals on the next restart, a transient apply failure that retries
next cycle — is `mechanics` and goes to the **logs only**.

### Option B — background-refresh heartbeat (opt-in)

Some operators prefer a single quiet "I just refreshed in the background, I'm
current" note over total silence. Opt in with
`updates.backgroundRefreshHeartbeat: true` in `.instar/config.json` (default
`false` = option A). It surfaces ONLY the post-restart background-refresh
confirmation as a plain, version-free line; every other `mechanics` event stays
silent regardless, so the flag can never re-introduce the version-churn flood.

## Design — single classification at the funnel ("Structure > Willpower")

A pure policy module, `src/core/updateNotifyPolicy.ts`, exports
`decideUpdateNotify(kind, opts) → { reachUser, reason }` over four kinds:
`mechanics | interruption | actionable | failure-escalated`.

`AutoUpdater.notify(message, kind = 'mechanics', opts)` consults the policy at
the single notify funnel: a non-`reachUser` decision logs and returns; otherwise
it sends to the Updates topic exactly as before. The **default kind is
`mechanics`**, so any future, un-audited `notify()` callsite is silent by default
rather than accidentally spamming the user. The restart handshake's failure emit
in `src/commands/server.ts` applies the same policy (non-escalated mismatch →
silent; escalated → reaches the user, version-free).

### Callsite map

| Source | Old message (gist) | Kind | Reaches user? |
|---|---|---|---|
| `AutoUpdater` version-skew nudge | "vX downloaded, still running vY" | `mechanics` | no (silent) |
| `AutoUpdater` manual-update-available (auto-apply off) | "new version vX available, say update" | `actionable` | yes (version-free) |
| `AutoUpdater` transient apply failure | "tried vX, didn't work, retry next cycle" | `mechanics` | no (silent) |
| `AutoUpdater` max-deferral forced restart | "deferred… max wait reached, restarting now" | `interruption` | yes (version-free) |
| `AutoUpdater` restart narration (active sessions) | "Just updated to vX. Restarting…" | `interruption` | yes (version-free) |
| `AutoUpdater` idle restart | (was silent) | `mechanics` + bg-confirmation | only under option B |
| `AutoUpdater` deferral threshold warnings | "vX installed, restart in ~5m / ~30m" | `interruption` | yes (version-free) |
| `AutoUpdater` cascade-batch | "rolling into the pending restart at HH:MM" | `mechanics` | no (silent) |
| `server.ts` handshake mismatch (non-escalated) | "vX applied but still running vY" | `mechanics` | no (silent) |
| `server.ts` handshake mismatch (escalated) | "restart didn't pick up the new code" | `failure-escalated` | yes (version-free) |

Patch-only restart narration remains suppressed (Fork 3 of
`mature-update-announcements`); the handshake still runs for verification.

## Tests

- `tests/unit/update-notify-policy.test.ts` — pure policy, both sides of every
branch (silent vs reach-user, heartbeat on/off, confirmation vs not, unknown
kind → silent fail-safe).
- `tests/unit/update-notify-routing.test.ts` — wiring integrity: a real
`AutoUpdater.notify()` of kind `mechanics` never calls `telegram.sendToTopic`,
while `interruption`/`actionable`/`failure-escalated` do; option-B gating.
- Updated `notification-spam-prevention`, `auto-updater-failures`,
`graceful-updates-phase2`, `update-notification-topic-lock` to assert the new
contract (mechanics silent; interruption/actionable version-free).

## Migration parity

- **Code-only behavior change** — the silencing ships in `AutoUpdater` /
`server.ts` / `updateNotifyPolicy`, so existing agents get it on npm update; no
config migration is required (absence of `backgroundRefreshHeartbeat` = option
A = the desired default).
- **Agent awareness** — a "Quiet update mechanics" block is added to the
CLAUDE.md template (`generateClaudeMd`) and backfilled to existing agents via
`PostUpdateMigrator.migrateClaudeMd()` under its own content-sniff guard
(separate from the maturity-honesty marker, so agents that already have the
maturity section still receive this one).
- **Type** — `UpdateConfig.backgroundRefreshHeartbeat?: boolean` and
`AutoUpdaterConfig.backgroundRefreshHeartbeat?: boolean`, wired through the
`server.ts` construction.
27 changes: 21 additions & 6 deletions src/commands/server.ts
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ import { lifelineOwnsPoll as lifelineOwnsTelegramPoll } from '../lifeline/Telegr
import { DispatchManager } from '../core/DispatchManager.js';
import { UpdateChecker } from '../core/UpdateChecker.js';
import { AutoUpdater } from '../core/AutoUpdater.js';
import { decideUpdateNotify } from '../core/updateNotifyPolicy.js';
import { UpdateRestartHandshake, verifyRestartHandshake } from '../core/UpdateRestartHandshake.js';
import { AutoDispatcher } from '../core/AutoDispatcher.js';
import { DispatchExecutor } from '../core/DispatchExecutor.js';
Expand Down Expand Up @@ -5475,18 +5476,28 @@ export async function startServer(options: StartOptions): Promise<void> {
}
restartHandshake.clearHandshake();
} else if (outcome.kind === 'failed') {
const msg = outcome.escalate
? `Heads up — I tried to update to v${outcome.expectedVersion} but the restart didn't pick up the new code. Still running v${outcome.runningVersion}. Retry ${outcome.retryCount}.`
: `Update to v${outcome.expectedVersion} was applied but I'm still running v${outcome.runningVersion}. The next restart should pick it up.`;
// A non-escalated mismatch is the transient, self-healing case the user
// flagged as noise ("vX applied but I'm still running vY — the next
// restart should pick it up"). It's update mechanics → log only. Only an
// ESCALATED failure (the restart genuinely isn't taking the new code
// after repeated retries) reaches the user, and version-free.
const kind = outcome.escalate ? 'failure-escalated' : 'mechanics';
const userMsg =
`Heads up — I tried to apply an update but the restart didn't take, so I'm still on the previous version. ` +
`I'll keep retrying automatically; flagging it in case it persists.`;
const logMsg =
`[restart-handshake] failed (escalate=${outcome.escalate}, retry=${outcome.retryCount}): ` +
`expected v${outcome.expectedVersion}, running v${outcome.runningVersion}`;
const decision = decideUpdateNotify(kind);
const topicId = state.get<number>('agent-updates-topic') || 0;
if (telegram && topicId) {
if (decision.reachUser && telegram && topicId) {
try {
await telegram.sendToTopic(topicId, msg);
await telegram.sendToTopic(topicId, userMsg);
} catch (err) {
console.warn(`[restart-handshake] failure notification failed: ${err instanceof Error ? err.message : String(err)}`);
}
} else {
console.warn(`[restart-handshake] ${msg}`);
console.warn(`${logMsg} — ${decision.reason}`);
}
}
} catch (err) {
Expand All @@ -5509,6 +5520,10 @@ export async function startServer(options: StartOptions): Promise<void> {
// Primary-developer mode (per-agent opt-in) — never defer a restart for
// active sessions or the restart window; always roll onto the latest.
restartImmediately: config.updates?.restartImmediately ?? false,
// Option B for update messaging — surface a single quiet "refreshed in
// the background" heartbeat instead of full silence. Default false
// (= option A, full silence). Spec: docs/specs/quiet-update-mechanics.md.
backgroundRefreshHeartbeat: config.updates?.backgroundRefreshHeartbeat ?? false,
// codex-instar audit Item 4 — wire the handshake into AutoUpdater so
// the pre-restart notification is DEFERRED into the marker file.
restartHandshake,
Expand Down
Loading
Loading