Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions docs/specs/cross-session-coordination.eli16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Cross-Session Coordination — plain-English overview

## What this is, in one line

When more than one copy of me is running at the same time on the same machine, this
gives them a way to *see each other* before they each do something big — so they
stop stepping on each other's toes.

## Why we need it (the real story)

I can have several sessions running at once against the same set of files (my
`.instar/` folder). Think of it like several cooks in one kitchen, all following the
same recipe, none of them aware the others are there. Two things actually went wrong
on 2026-05-28:

1. **The "still working" ghost.** One session finished a job but left a sticky note
saying "still cooking: yes." A second session read that stale note and kept telling
the user "I'm still working on it" — even though nothing was happening.
2. **The double-reaction (the bad one).** Both sessions noticed the same bug. One
calmly built the proper fix. The *other* panicked and hit the emergency brake —
flipped a feature off and cancelled 19 pending tasks. Neither knew the other was on
it. Result: the bug got fixed, but the engine was left off and the test setup was
wiped. Two correct instincts that, uncoordinated, undid each other.

The root cause is the same both times: **shared files, no traffic light.** Nothing
tells one session "hey, another you is already acting — wait a second."

## What we are building (and just as importantly, what we're NOT)

Justin chose the **light** option from a menu of light / medium / heavy. So this is
the gentle version, on purpose:

- **What it does:** Before a session does something high-impact (flip a feature flag,
withdraw commitments), it can post a quick note — "I'm about to do X." Whenever any
session takes such an action, the system hands back a little advisory warning if
*another* session was just active: "⚠ another session withdrew 19 commitments 4
minutes ago — confirm before proceeding."
- **What it does NOT do:** It never blocks anything. There are no hard locks. It never
changes the thing you were trying to change. It's a heads-up, not a gate. If a
session ignores the warning, the action still goes through.

That matches exactly what Justin asked for: "start small and learn to collaborate
slowly and smoothly," with cheap course-correction.

## How you'd actually see it

- A session announces intent: `POST /coordination/intent`.
- A session inspects what's been happening: `GET /coordination/recent`.
- The two risky actions from the incident — config-flag flips and commitment
withdrawals — automatically record themselves and carry the warning back in their
own response, so it works even if a session forgets to announce.
- It's quiet: no Telegram pings. Everything is logged to
`logs/cross-session-events.jsonl` for later reading.

## What already exists vs. what's new

- **Already existed:** Single-commitment write safety (`CommitmentTracker.mutate()`
uses a compare-and-swap so two sessions can't tear one record). That protects a
single write — it does *not* stop two sessions adopting *opposite plans*.
- **New here:** the cross-session visibility layer — the shared scratchpad + the
advisory warning that surfaces an opposing action before you commit to yours.

## What's deliberately left for later

- The stale "still working" ghost (incident #1) is a *separate* liveness bug, not a
coordination problem. It's noted as a candidate next step, not bundled in here.
- Hard locks / leader election / one-session-in-charge — that's the heavy redesign
Justin chose *not* to do for now.

## What the reader needs to decide

Really just one thing: is "light and advisory" the right first step, or do you want
something stronger (real locks) sooner? Justin already said light-first. If watching
it in practice shows the advisory isn't enough, the medium/heavy options are still on
the table — nothing here paints us into a corner.
129 changes: 129 additions & 0 deletions docs/specs/cross-session-coordination.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
title: Cross-Session Coordination Signal (light, advisory)
status: approved
approved: true
approver: justin
approved-at: "2026-05-28T19:00:00Z"
review-convergence: "2026-05-28T19:00:00Z"
review-iterations: 3
review-completed-at: "2026-05-28T19:00:00Z"
review-report: "docs/specs/reports/cross-session-coordination-convergence.md"
created: 2026-05-28
owner: echo
eli16-overview: cross-session-coordination.eli16.md
---

# Cross-Session Coordination Signal (light, advisory)

> Approved direction: Justin, 2026-05-28 topic 15579 — "go with a light fix for now
> and see if it helps … learn to collaborate slowly and smoothly." This is the
> written form of the build Justin chose from the light/medium/heavy menu and saw
> described before approving. Convergence review:
> docs/specs/reports/cross-session-coordination-convergence.md.

## The problem (two observed incidents, 2026-05-28)

A single agent home can have **multiple concurrent Claude Code sessions** acting on
the same `.instar/` state. They are blind to each other. Two real incidents:

1. **Stale `active:true` ghost** — an autonomous-state file said `active:true` after
the work had completed; a *different* session read it as "still in flight" and
narrated "working." (Recovered easily. Treated as a SEPARATE staleness/liveness
bug — out of scope for this signal; see "Out of scope".)
2. **Opposing durable actions (the damaging one)** — one session built the proper fix
(PR 495); in parallel another session hit a "safety brake": flipped a config flag
to `false` AND mass-withdrew ~19 active commitments. Both reached a correct local
diagnosis; neither knew the other was acting. Net: bugs fixed, engine off, test
bed gone.

Root cause: shared mutable state (`.instar/config.json`, commitments, autonomous
state) with **no cross-session visibility before a high-impact action**.
`CommitmentTracker.mutate()` has single-writer CAS *per commitment* — that prevents
torn writes, not two sessions adopting opposing policies and each acting durably.

## The light fix (this build)

A **CrossSessionCoordinator**: a shared, append-only scratchpad of recent
high-impact "structural actions" + voluntary "I'm about to do X" intents. Any
structural action surfaces *other recent* entries to the actor as an advisory
`coordinationWarning`. **It never blocks** (light = advisory). It never mutates the
target state. It is the "visible signal sessions check before acting" Justin asked for.

### What gets recorded
- **Voluntary intent** (`POST /coordination/intent`) — a session announces what it is
about to do / is doing ("building PR 495 fix for the redrive flood"). This is the
primary "I'm about to do X" surface.
- **config-flag flip** — auto-recorded at the `PATCH /config` route for sensitive
keys (feature on/off toggles). Backstop; no agent reliance.
- **commitment-withdraw** — auto-recorded in the `POST /commitments/:id/withdraw`
route handler after a successful withdrawal. All agent-initiated withdrawals are
route-driven (the route is the single agent-facing path), so recording there is
the single source — no double-count — and it lets the warning ride back on the
same HTTP response. (Code-internal lifecycle transitions like expiry are not
withdrawals and are deliberately not recorded.)

### The signal
On every recorded action, the coordinator computes `concurrent`: other non-expired
records within `windowMs` (default 10 min) authored by a *different or unknown* actor.
If non-empty, the acting route attaches a `coordinationWarning` to its HTTP response
("⚠ N recent structural action(s) by another/unknown session in the last Xm: … —
confirm before proceeding"). The session (Claude) reads that warning inline and can
reconsider, re-check, or tell the user. `GET /coordination/recent` exposes the ledger
for explicit pre-action checks and inspection.

### Storage + audit
- Ledger: `<stateDir>/state/cross-session-actions.json` — `{ version, actions: [...] }`,
capped (200), TTL-pruned on write (`retentionMs`, default 60 min). Atomic temp+rename,
reload-per-op (cross-process safe, mirrors ConversationStore).
- Audit: append each record to `<stateDir>/logs/cross-session-events.jsonl`.

### Config (default-ON housekeeping; matches the silently-stopped sentinels)
```
monitoring.crossSessionCoordination: {
enabled: true, // records + warns. false => passive (GET still 200, no records/warnings)
windowMs: 600000, // concurrency window (10 min)
retentionMs: 3600000, // ledger TTL (60 min)
sensitiveConfigKeys: ["monitoring", "tunnel", "autonomousSessions", "lifeline", "updates"]
}
```
No Telegram escalation in v1 (deliberate — near-silent; in-response warning + GET +
JSONL is full observability without topic-spam risk). A buzz-on-conflict toggle is a
small future addition if it proves wanted <!-- tracked: topic-15579 -->.

## Wiring
- `CrossSessionCoordinator` constructed inside `AgentServer` (like FrameworkIssueLedger /
TokenLedger — always alive, so `GET /coordination/recent` returns 200 not 503),
in its own try/catch so a failure never cascades into the other monitors.
- Added to `RouteContext` as `crossSessionCoordinator`. (NOTE: distinct from the
multi-MACHINE `coordinator` already on ctx.)
- Actor resolution (`coordinationActor`) is best-effort and SESSION-level only
(`X-Instar-Session` / `X-Instar-Actor` header, or body `actor`) — never the agent
id, which all sessions of one agent share and would wrongly suppress the warning.
- `PATCH /config`: record sensitive-key flips, attach `coordinationWarning`.
- `POST /commitments/:id/withdraw`: after success, attach `coordinationWarning`.
- Routes: `POST /coordination/intent`, `GET /coordination/recent`.

## Migration parity
- Config defaults: add `monitoring.crossSessionCoordination` to `ConfigDefaults.ts`
`SHARED_DEFAULTS` → auto-applied to existing agents via `applyDefaults` in
`PostUpdateMigrator.migrateConfig`. No bespoke migration needed.
- CLAUDE.md template: add a Cross-Session Coordination awareness section to
`generateClaudeMd()` + `migrateClaudeMd()` content-sniff guard.

## Tests (all three tiers — NON-NEGOTIABLE)
- **Unit** (`tests/unit/cross-session-coordinator.test.ts`): record/prune/window,
concurrent detection (different vs same vs unknown actor), cap, atomic persistence,
reload-per-op, disabled mode.
- **Integration** (`tests/integration/cross-session-coordination-routes.test.ts`):
POST /coordination/intent → GET /coordination/recent; PATCH /config sensitive key
returns `coordinationWarning` when a concurrent intent exists; withdraw warning;
auth required.
- **E2E** (`tests/e2e/cross-session-coordination-lifecycle.test.ts`): boot real
AgentServer (production path), GET /coordination/recent returns 200 (alive), a
recorded action surfaces end-to-end, capability discoverable.

## Out of scope (separable, flagged to Justin)
- Incident #1 stale-`active:true` cleanup/liveness guard — a distinct staleness bug,
not a coordination signal. Candidate next step.
- Hard locks / leader election / session registry — explicitly the *heavy* path
Justin declined for now.
77 changes: 77 additions & 0 deletions docs/specs/reports/cross-session-coordination-convergence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Convergence Report — Cross-Session Coordination Signal (light, advisory)

- Spec: `docs/specs/cross-session-coordination.md`
- Owner: echo
- Approved direction: Justin, topic 15579, 2026-05-28 ("go with a light fix for now …
learn to collaborate slowly and smoothly"), chosen from an explicit
light / medium / heavy menu, with the build described before approval.
- Iterations: 3 substantive design-review passes (below).

## Scope under review

A LIGHT, advisory cross-session signal: a shared append-only scratchpad of recent
high-impact structural actions + voluntary "I'm about to do X" intents, surfaced to
an acting session as an in-response `coordinationWarning`. Never blocks, never mutates
target state. Explicitly NOT hard locks / leader election (the heavy path Justin
declined for now).

## Iteration 1 — initial design

Settled the storage model (atomic temp+rename JSON ledger, reload-per-op for
cross-process safety, TTL prune + hard cap), the action taxonomy
(`intent` / `config-flag` / `commitment-withdraw` / `other`), and the advisory-only
contract (record + warn, never block, never touch the target). Decided default-ON
housekeeping with no Telegram escalation in v1 — the in-response warning + `GET
/coordination/recent` + JSONL audit give full observability without re-introducing
the topic-spam risk that the attention-flood work just closed.

## Iteration 2 — implementation review (caught real issues)

1. **Double-count reconciliation (commitment-withdraw recording site).** The first
design proposed subscribing to `CommitmentTracker`'s `withdrawn` event AND recording
in the withdraw route — which double-counts. Resolved to a single source: record in
the `POST /commitments/:id/withdraw` route handler. Rationale: all *agent-initiated*
withdrawals are route-driven (the route is the single agent-facing path), so this is
the single source, gives no double-count, and lets the warning ride back on the same
HTTP response. Code-internal lifecycle transitions (expiry) are not withdrawals and
are deliberately not recorded. Spec updated to match.

2. **Actor-resolution correctness.** Decided `coordinationActor()` must resolve a
SESSION-level discriminator only (`X-Instar-Session` / `X-Instar-Actor` header, or
body `actor`) and must NEVER fall back to the agent id — every session of one agent
shares the agent id, so using it would wrongly suppress the very cross-session
warning we want. Unknown actor is treated as "potentially a different session" so the
signal errs toward surfacing.

3. **Intent-dedup bug (found via the integration test, fixed).** The action-identity
used for dedupe was `kind + target + value`. Two *different* intents both have no
target/value, so they collapsed into "the same action" and the warning was wrongly
suppressed. Fixed: intents are events, not states — they are never deduped. Dedupe
now applies only to state-flip kinds (config-flag / commitment-withdraw / other),
where two sessions writing the identical kind+target+value genuinely are one write. A
regression unit test locks the boundary.

## Iteration 3 — testing-integrity + standards pass

- All three tiers built and green: unit (16), integration (8, exercising both real
incident vectors — config-flip + withdraw over live HTTP), e2e lifecycle (6, booting
the real AgentServer → feature alive, 200 not 503), migration-parity (5).
- Migration parity: config default in `ConfigDefaults.SHARED_DEFAULTS`
(applyDefaults propagates to existing agents), `migrateClaudeMd` awareness section +
`generateClaudeMd` template, CapabilityIndex entry. Unit-tested.
- No-silent-fallbacks: the coordinator's advisory catch blocks carry
`@silent-fallback-ok` markers (persistence failure must never break a calling route);
verified the file contributes zero to the lint count.

## Open / deferred

- Incident #1 (stale `active:true` liveness ghost) — a distinct staleness bug, not a
coordination signal; candidate next step <!-- tracked: topic-15579 -->.
- Buzz-on-conflict Telegram toggle — small future addition if it proves wanted
<!-- tracked: topic-15579 -->.

## Convergence

No open design contradictions. The advisory-only contract, single-source recording,
session-level actor resolution, and intent-distinctness are settled and test-locked.
Converged.
3 changes: 3 additions & 0 deletions site/src/content/docs/architecture/under-the-hood.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ The proactive eye. Polls every 60 seconds to classify each session as healthy, i

**How they connect:** SessionMonitor detects the problem → SessionRecovery tries a fast fix → if that doesn't work, TriageOrchestrator runs heuristics → if those don't match, it spawns an LLM diagnosis. Meanwhile, SessionWatchdog independently catches stuck commands at the process level.

### CrossSessionCoordinator
A light, advisory signal for when more than one session runs against the same agent home at once — they otherwise share `.instar/` state while blind to each other. CrossSessionCoordinator keeps a shared, append-only scratchpad of recent high-impact actions (feature-flag flips, commitment withdrawals) plus voluntary "I'm about to do X" intents. When a session takes a structural action, it surfaces any recent action by a *different* session as an advisory `coordinationWarning` ("⚠ another session withdrew 19 commitments 4m ago — confirm first"). It never blocks and never mutates the target state — it's a heads-up, not a lock. Default-on, near-silent (no Telegram); audit trail at `logs/cross-session-events.jsonl`.

</details>

---
Expand Down
11 changes: 11 additions & 0 deletions site/src/content/docs/reference/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,17 @@ The Instar server exposes a REST API on `localhost:4040` (configurable). All end
| GET | `/messages/outbox` | Inter-agent outbox |
| GET | `/messages/dead-letter` | Dead letter queue |

## Cross-Session Coordination

Backed by the `CrossSessionCoordinator` — a light, advisory signal so concurrent sessions on one agent home see each other's recent high-impact actions before acting. Never blocks; surfaces a `coordinationWarning`. Pass `X-Instar-Session: <topic-or-label>` so the signal can tell sessions apart.

| Method | Path | Description |
|--------|------|-------------|
| POST | `/coordination/intent` | Announce "I'm about to do X" so other sessions see it before acting |
| GET | `/coordination/recent` | Recent structural actions + intents (newest first) |

Sensitive `PATCH /config` flips and `POST /commitments/:id/withdraw` calls auto-record and attach a `coordinationWarning` to their own response when another session was recently active. Config: `monitoring.crossSessionCoordination`. Audit: `logs/cross-session-events.jsonl`.

## Threadline (MCP Tools)

These tools are registered as an MCP server and called by Claude Code (or any MCP client) via stdio transport. They are registered automatically on server boot.
Expand Down
15 changes: 15 additions & 0 deletions src/config/ConfigDefaults.ts
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,21 @@ const SHARED_DEFAULTS: Record<string, unknown> = {
contextWedgeSentinel: {
enabled: true,
},
// CrossSessionCoordinator — light, advisory cross-session coordination signal.
// Default-ON housekeeping (like the silently-stopped sentinels): it records
// high-impact structural actions + voluntary "I'm about to do X" intents and
// surfaces concurrent activity by another session as an in-response advisory
// warning. It NEVER blocks and never mutates target state. enabled:false =>
// passive (GET stays 200; nothing recorded, no warnings). No Telegram
// escalation in v1 (near-silent by design). sensitiveConfigKeys are the
// top-level prefixes whose PATCH /config flips get auto-recorded.
// See docs/specs/cross-session-coordination.md.
crossSessionCoordination: {
enabled: true,
windowMs: 600000,
retentionMs: 3600000,
sensitiveConfigKeys: ['monitoring', 'tunnel', 'autonomousSessions', 'lifeline', 'updates'],
},
// SessionReaper — pressure-aware reaper of idle-but-alive sessions.
// UNLIKE the sentinels above, default OFF + dry-run: it is the only monitor
// that *kills* sessions on a heuristic, so it ships dark and must be flipped
Expand Down
Loading
Loading