fix: pre-check for session.shutdown before sending prompts (fixes #397)#802
fix: pre-check for session.shutdown before sending prompts (fixes #397)#802
Conversation
Before sending a prompt, SendPromptAsync now checks if events.jsonl ends with session.shutdown. If so, it forces a reconnect before sending instead of sending to a dead session and discovering the failure 10+ minutes later via the watchdog. The GetLastEventType helper (tail-read, last 4KB) keeps overhead minimal on the normal send path. Also increases the structural test search window in PrematureIdleSignal_ResetInSendPromptAsync to accommodate the new code. Fixes #397 Co-authored-by: copilot-agentic-workflow[bot] <224017+copilot-agentic-workflow[bot]@users.noreply.github.com> Co-authored-by: Copilot <[email protected]>
🧪 Integration Test Report — PR #802
✅ All platforms passed |
Cross-Platform Verification — PR #802Build Results
✅ All platforms verified Triggered by: verify-build run |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Multi-Model Code Review — PR #802
Reviewed by 3 independent reviewers with adversarial consensus. Findings that ≥2/3 reviewers agreed on are included; single-reviewer findings were cross-validated and discarded if disputed.
Findings Summary
| # | Severity | Consensus | Finding |
|---|---|---|---|
| 1 | 🟡 MODERATE | 2/3 | Spurious double-reconnect on first-send-after-restart — Pre-check reads stale events.jsonl after lazy-resume just created a fresh session, causing an unnecessary second reconnect cycle (~3-5s latency) |
| 2 | 🟡 MODERATE | 2/3 | OperationCanceledException wrapping — Catch block wraps cancellation in InvalidOperationException, losing cancellation identity and showing a misleading error message to the user |
| 3 | 🟡 MODERATE | 3/3 | Test coverage gap — All unit tests exercise GetLastEventType only, not the actual dispose→reconnect→send flow. Integration tests are UI smoke tests unrelated to shutdown pre-check |
| 4 | 🟢 MINOR | 2/3 | Misleading error message — Failure message attributes all reconnect errors to "server shutdown" regardless of actual cause (auth, network, etc.) |
| 5 | 🟢 MINOR | 2/3 | Fragile structural test window — MultiAgentRegressionTests window bump from 8000→10000 will break again on next feature addition to SendPromptAsync |
What's Good
- Core logic is sound (3/3 agree): The pre-check correctly detects server-killed sessions via
GetLastEventType, and theSendingFlagguard prevents concurrent-send races. The error path correctly releasesSendingFlag(sinceIsProcessinghasn't been set yet at this point). - No data loss, security, or stuck-session risks from the happy path.
- The
GetLastEventTypeutility is well-designed (tail-read,FileShare.ReadWrite, null-safe). - The TOCTOU on
events.jsonlis benign (false positives cause an unnecessary reconnect, not a missed shutdown).
Discarded Findings (1/3 only, disputed)
- Missing orphan guard on old session callbacks — Reviewer 1 explicitly noted processing hasn't started and the old session's transport is disposed, so stale event handlers won't fire. Agreed.
state.Session = nulloutsideconnectLockraces with poll loop — The poll loop checksIsProcessingand skips active sessions;SendingFlag=1prevents concurrent sends. Narrow window, minimal risk.
CI & Prior Reviews
- CI: No check runs found on this PR.
- Prior reviews: None.
- Test coverage: The existing 3644 unit tests pass per PR description, but the new behavior-under-test (dispose → reconnect flow) has no automated behavioral test coverage.
Warning
⚠️ Firewall blocked 2 domains
The following domains were blocked by the firewall during workflow execution:
api.nuget.orgdc.services.visualstudio.com
To allow these domains, add them to the network.allowed list in your workflow frontmatter:
network:
allowed:
- defaults
- "api.nuget.org"
- "dc.services.visualstudio.com"See Network Configuration for more information.
Generated by Expert Code Review · ● 61M
| try | ||
| { | ||
| var shutdownCheckSid = state.Info.SessionId; | ||
| if (!string.IsNullOrEmpty(shutdownCheckSid)) | ||
| { | ||
| var eventsPath = Path.Combine(SessionStatePath, shutdownCheckSid, "events.jsonl"); | ||
| var lastEvent = GetLastEventType(eventsPath); | ||
| if (lastEvent == "session.shutdown") | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — forcing reconnect before send"); | ||
| try { await state.Session.DisposeAsync(); } catch { /* session may already be disposed */ } | ||
| state.Session = null; | ||
| await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); |
There was a problem hiding this comment.
🟡 MODERATE — Spurious double-reconnect on first-send-after-restart (2/3 reviewers)
When a session was killed by the server (events.jsonl ends with session.shutdown) and PolyPilot restarts, the lazy-resume block at line 3498 fires first (because state.Session == null). If ResumeSessionAsync succeeds (server still has session data), state.Session is now set to a fresh session — but events.jsonl still shows session.shutdown as the last event because the server hasn't flushed a new event to disk yet.
The pre-check then reads the stale file, detects session.shutdown, disposes the freshly-resumed session, and reconnects again — wasting ~3-5 seconds and creating unnecessary resource churn.
Concrete scenario:
- Server idle-kills session → writes
session.shutdowntoevents.jsonl - User restarts PolyPilot, sends a prompt
- Lazy-resume at line 3498 →
EnsureSessionConnectedAsync→state.Session = sessionA✓ - Pre-check reads events.jsonl → still sees
session.shutdown(stale) - Disposes
sessionA, reconnects tosessionB— unnecessary double-reconnect
Fix: Gate the pre-check on whether the session was already connected when we entered:
bool wasAlreadyConnected = state.Session != null;
// Lazy resume ...
if (state.Session == null) { ... }
// Only check when we entered with an existing (possibly stale) session.
// If we just performed a lazy-resume, the session is fresh and events.jsonl is stale.
if (wasAlreadyConnected)
{
// ... shutdown pre-check ...
}| Assert.True(sendIdx >= 0, "SendPromptAsync must exist in CopilotService.cs"); | ||
|
|
||
| var sendBlock = source.Substring(sendIdx, Math.Min(8000, source.Length - sendIdx)); | ||
| var sendBlock = source.Substring(sendIdx, Math.Min(10000, source.Length - sendIdx)); |
There was a problem hiding this comment.
🟢 MINOR — Fragile structural test window will break on next feature addition (2/3 reviewers)
This test searches for PrematureIdleSignal.Reset() within the first N characters from the SendPromptAsync signature. Bumping from 8000→10000 to accommodate the new pre-check block means this test breaks every time code is added to the early part of SendPromptAsync.
Fix: Search from a closer anchor (e.g., IsProcessing = true) instead of the method signature, or scan the entire method body by finding the matching closing brace.
| catch (Exception ex) | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}"); | ||
| Interlocked.Exchange(ref state.SendingFlag, 0); | ||
| throw new InvalidOperationException( | ||
| $"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex); |
There was a problem hiding this comment.
🟡 MODERATE — OperationCanceledException wrapping loses cancellation identity (2/3 reviewers)
EnsureSessionConnectedAsync calls connectLock.WaitAsync(cancellationToken) and ResumeSessionAsync(..., cancellationToken), both of which throw OperationCanceledException when the user cancels (e.g., clicks Stop or navigates away during reconnect).
This catch wraps it in InvalidOperationException, which:
- Loses the
OperationCanceledExceptionidentity — callers checkingex is OperationCanceledExceptionwon't match - Shows the user a misleading "shut down by the server" message when they simply cancelled
Fix: Add a specific catch before the general one:
catch (OperationCanceledException)
{
Interlocked.Exchange(ref state.SendingFlag, 0);
throw; // preserve cancellation semantics
}
catch (Exception ex)
{
// ... existing wrap for real failures ...
}| throw new InvalidOperationException( | ||
| $"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex); |
There was a problem hiding this comment.
🟢 MINOR — Error message attributes all failures to "server shutdown" (2/3 reviewers)
EnsureSessionConnectedAsync can fail for many reasons (auth failure → "Go to Settings → Save & Reconnect", network error, server not started, etc.). This message always says "shut down by the server and reconnection failed", hiding the actionable root cause.
The Dashboard.razor caller extracts t.Exception?.InnerException?.Message, so the user sees the generic message while the actual fix instruction is buried deeper.
Fix: Include the inner exception's message:
throw new InvalidOperationException(
$"Session '{sessionName}' needs reconnection after detecting shutdown state: {ex.Message}", ex);| // --- Behavioral test: SendPromptAsync on a shutdown session --- | ||
| // We can't call SendPromptAsync directly (requires SDK infrastructure), but we can | ||
| // verify the detection logic that guards it. |
There was a problem hiding this comment.
🟡 MODERATE — Tests don't cover the actual behavior change (3/3 reviewers)
All 8 unit tests exercise GetLastEventType (a pre-existing static utility) or manually replicate the lastEvent == "session.shutdown" boolean check. None of them invoke SendPromptAsync or test the dispose → null → EnsureSessionConnectedAsync → send sequence that this PR introduces.
The comment here admits this limitation, but the result is that the most critical aspects of the fix are untested:
- The
DisposeAsync→Session = null→EnsureSessionConnectedAsyncreconnect flow - The
SendingFlagrelease in the catch block on reconnect failure - That a successfully reconnected session proceeds to
SendAsync - The double-reconnect scenario (lazy-resume then pre-check)
Similarly, the integration tests in PolyPilot.IntegrationTests/ShutdownPreCheckTests.cs only check that the dashboard loads and a prompt input exists — no shutdown scenario at all.
Fix: Use the existing stub/mock infrastructure (same pattern as ProcessingWatchdogTests or TurnEndFallbackTests) to create a behavioral test that:
- Constructs a session state with a mock
CopilotSession - Writes
session.shutdownto the test events.jsonl - Verifies the old session was disposed and reconnection was attempted
- (Error path) Mocks
EnsureSessionConnectedAsyncto throw → verifiesSendingFlagis released
🔄 Review-Fix Loop — Round 1 of 3Findings addressed: 6/6 (3 moderate, 3 minor)
Findings skipped: 0 Tests: ✅ All 3649 tests pass (5 new) Next: Expert review round 2 dispatched, build verification dispatched. Warning
|
Cross-Platform Verification — PR #802Build Results
✅ All platforms verified Previous Review HistoryFound 5 automated review(s) on this PR. Build verification validates that all review-driven fixes compile and pass tests across platforms. Triggered by: verify-build run |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Multi-Model Code Review — PR #802
Reviewed by 3 independent reviewers with adversarial consensus. Findings that ≥2/3 reviewers agreed on are included; the single 1/3 finding was validated against prior round analysis and discarded.
Findings Summary
| # | Severity | Consensus | Finding |
|---|---|---|---|
| 1 | 🟡 MODERATE | 3/3 | OperationCanceledException wrapping loses cancellation identity — catch block wraps OCE in InvalidOperationException, breaking Task.IsCanceled, multi-agent worker shutdown, and violating the codebase's consistent catch (OperationCanceledException) { throw; } pattern |
| 2 | 🟡 MODERATE | 2/3 | Spurious double-reconnect on first-send-after-restart — lazy-resume at line 3498 reconnects, then pre-check reads stale events.jsonl (still ending session.shutdown) and reconnects again; only affects resume-success path (fresh-create updates SessionId, avoiding the stale read) |
| 3 | 🟡 MODERATE | 3/3 | Tests don't cover the actual behavior change — all unit tests exercise GetLastEventType or replicate a boolean check manually; the critical dispose→reconnect→send flow, SendingFlag release on failure, and double-reconnect scenario have zero automated coverage |
| 4 | 🟢 MINOR | 3/3 | Misleading error message — always says "shut down by the server" regardless of actual cause (auth, network, cancellation) |
| 5 | 🟢 MINOR | 3/3 | Fragile structural test window — 8000→10000 character window will break on next feature addition to SendPromptAsync |
What's Good
- Core detection logic is sound (3/3 agree):
GetLastEventTypeis well-designed (tail-read,FileShare.ReadWrite, null-safe, swallows parse errors) - SendingFlag guard is correct — the
Interlocked.Exchangerelease in the catch path runs beforeIsProcessingis set, so no stuck-session risk from the error path - The TOCTOU on
events.jsonlis benign — false positives cause an unnecessary reconnect, not a missed shutdown - No data loss, security, or information disclosure risks identified
Discarded Finding (1/3 only)
state.Session = nulloutsideconnectLockraces with concurrent readers — Reviewer 2 raised this; Reviewers 1 and 3 did not. The prior round 1 review (same PR) ran the same follow-up and discarded it:SendingFlag=1prevents concurrentSendPromptAsync,IsProcessing=falsecauses watchdog/poll to skip, and otherstate.Sessionreaders guard with null checks. Narrow window, minimal risk.
Prior Findings Verdict
| # | Prior Finding | Verdict | Rationale |
|---|---|---|---|
| 1 | Spurious double-reconnect on first-send-after-restart | ✅ Confirmed (2/3) | Valid for the resume-success path (SessionId unchanged → stale events.jsonl). One reviewer correctly noted the fresh-create fallback (Persistence.cs:463 updates SessionId) avoids this — but resume-success is the more common path for this PR's target scenario. |
| 2 | OperationCanceledException wrapping loses cancellation identity |
✅ Confirmed (3/3) | connectLock.WaitAsync(ct) and ResumeSessionAsync(..., ct) both throw OCE on cancel. The catch (Exception) wraps it, breaking the codebase's consistent pattern (11+ sites use catch (OCE) { throw; }). |
| 3 | Tests only cover GetLastEventType, not dispose→reconnect→send flow |
✅ Confirmed (3/3) | All "behavioral" tests duplicate a boolean comparison. No test exercises SendPromptAsync internals or verifies SendingFlag release, DisposeAsync invocation, or reconnect attempt. Integration tests verify UI elements exist, not shutdown behavior. |
| 4 | Misleading reconnect failure message always blames server shutdown | ✅ Confirmed (3/3) | EnsureSessionConnectedAsync can fail for auth, network, init, or cancellation reasons. The fixed message always says "shut down by the server", hiding actionable root cause. Dashboard.razor surfaces InnerException.Message but users see the outer wrapper first. |
| 5 | Structural test window bump remains fragile | ✅ Confirmed (3/3) | Bumping a character window is a maintenance liability. A closer anchor or full-method scan would be stable. |
CI & Prior Reviews
- CI: No check runs found on this PR head commit.
- Prior reviews: One automated review (round 1) already identified these same 5 findings. PR #803 was created to address them.
- Test coverage: 3644 existing tests pass per PR description. The new behavior-under-test (dispose → reconnect → send) has no automated behavioral coverage.
⚠️ Firewall blocked 2 domains during review (api.nuget.org,dc.services.visualstudio.com) — build verification was not possible in this environment.
Warning
⚠️ Firewall blocked 3 domains
The following domains were blocked by the firewall during workflow execution:
api.nuget.orgdc.services.visualstudio.compkgs.dev.azure.com
To allow these domains, add them to the network.allowed list in your workflow frontmatter:
network:
allowed:
- defaults
- "api.nuget.org"
- "dc.services.visualstudio.com"
- "pkgs.dev.azure.com"See Network Configuration for more information.
Note
🔒 Integrity filter blocked 2 items
The following items were blocked because they don't meet the GitHub integrity level.
- 38e6ac5
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - 1dba30c
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review · ● 74.2M
| catch (Exception ex) | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}"); | ||
| Interlocked.Exchange(ref state.SendingFlag, 0); | ||
| throw new InvalidOperationException( | ||
| $"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex); |
There was a problem hiding this comment.
🟡 MODERATE — OperationCanceledException wrapping loses cancellation identity (3/3 reviewers)
EnsureSessionConnectedAsync calls connectLock.WaitAsync(cancellationToken) (Persistence.cs:413) and ResumeSessionAsync(..., cancellationToken) (Persistence.cs:442), both of which throw OperationCanceledException on user cancel. This catch (Exception ex) wraps it in InvalidOperationException, breaking cancellation semantics:
Task.IsCanceledbecomesTask.IsFaulted— callers checkingex is OperationCanceledExceptionwon't match- Multi-agent worker dispatch at line ~3981 catches
OperationCanceledExceptionfor graceful shutdown — a wrapped OCE misidentifies cancellation as permanent failure - Every other catch site in this file (lines 1226, 1261, 1289, 1494) uses
catch (OperationCanceledException) { throw; }before the general catch — this is the sole exception to that established pattern
Concrete scenario: User clicks Stop during pre-check reconnect → they see "shut down by the server and reconnection failed" instead of clean cancellation. In multi-agent orchestration, a cancelled worker surfaces as an error instead of a cancellation.
Fix: Add a specific catch before the general one:
catch (OperationCanceledException)
{
Interlocked.Exchange(ref state.SendingFlag, 0);
throw; // preserve cancellation semantics
}
catch (Exception ex)
{
// ... existing wrap for real failures ...
}| Assert.True(sendIdx >= 0, "SendPromptAsync must exist in CopilotService.cs"); | ||
|
|
||
| var sendBlock = source.Substring(sendIdx, Math.Min(8000, source.Length - sendIdx)); | ||
| var sendBlock = source.Substring(sendIdx, Math.Min(10000, source.Length - sendIdx)); |
There was a problem hiding this comment.
🟢 MINOR — Structural test window bump remains fragile (3/3 reviewers)
Math.Min(8000, ...) → Math.Min(10000, ...) is forced by the new 28-line pre-check block added between the method signature and PrematureIdleSignal.Reset(). This will break again the next time substantive code is added to the early part of SendPromptAsync.
Fix (long-term): Search from a closer anchor (e.g., IsProcessing = true) instead of the method signature, or scan the entire method body by finding the matching closing brace.
| try | ||
| { | ||
| var shutdownCheckSid = state.Info.SessionId; | ||
| if (!string.IsNullOrEmpty(shutdownCheckSid)) | ||
| { | ||
| var eventsPath = Path.Combine(SessionStatePath, shutdownCheckSid, "events.jsonl"); | ||
| var lastEvent = GetLastEventType(eventsPath); | ||
| if (lastEvent == "session.shutdown") | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — forcing reconnect before send"); | ||
| try { await state.Session.DisposeAsync(); } catch { /* session may already be disposed */ } | ||
| state.Session = null; | ||
| await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); |
There was a problem hiding this comment.
🟡 MODERATE — Spurious double-reconnect on first send after app restart (2/3 reviewers)
When state.Session == null (app restart), the lazy-resume block at line 3498 fires first and calls EnsureSessionConnectedAsync. If resume succeeds (server still has the session), state.Session is now a live connection — but state.Info.SessionId is unchanged, so events.jsonl still ends with session.shutdown (the server writes session.resume asynchronously without a synchronous flush guarantee before return).
This pre-check then reads the stale file, detects session.shutdown, disposes the freshly-resumed valid session, and reconnects a second time — wasting 3-5s and creating unnecessary resource churn.
Note: One reviewer pointed out that the fresh-create fallback path (Persistence.cs:463) updates SessionId to a new value, so the pre-check reads a different events path and avoids the double-reconnect. This is correct — but the resume-success path (the more common case for this PR's target scenario) is still affected.
Concrete scenario:
- Server idle-kills session → writes
session.shutdownto events.jsonl - User restarts PolyPilot, sends a prompt
- Lazy-resume succeeds →
state.Session = sessionA✓ - Pre-check reads stale events.jsonl → still sees
session.shutdown - Disposes
sessionA, reconnects tosessionB— unnecessary double-reconnect
Fix: Track whether the lazy-resume path just ran:
bool wasAlreadyConnected = state.Session != null;
// Lazy resume ...
if (state.Session == null) { ... }
// Only run pre-check when we entered with an existing (possibly stale) session.
if (wasAlreadyConnected)
{
// ... shutdown pre-check ...
}| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}"); | ||
| Interlocked.Exchange(ref state.SendingFlag, 0); | ||
| throw new InvalidOperationException( | ||
| $"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex); |
There was a problem hiding this comment.
🟢 MINOR — Error message attributes all failures to "server shutdown" (3/3 reviewers)
EnsureSessionConnectedAsync can fail for many reasons — auth failure ("Go to Settings → Save & Reconnect"), network error, server not started, cancellation. This message always says "shut down by the server and reconnection failed", hiding the actionable root cause.
Dashboard.razor extracts t.Exception?.InnerException?.Message for display, so the user sees the generic message while the specific fix instruction is buried in the inner exception.
Fix: Include the inner exception's message:
throw new InvalidOperationException(
$"Session '{sessionName}' needs reconnection after detecting shutdown state: {ex.Message}", ex);| // verify the detection logic that guards it. | ||
|
|
||
| [Fact] | ||
| public void ShutdownPreCheck_SessionWithShutdownEvent_IsDetected() | ||
| { | ||
| // Simulate the exact check from SendPromptAsync: | ||
| // 1. Get session ID |
There was a problem hiding this comment.
🟡 MODERATE — Tests don't cover the actual behavior change (3/3 reviewers)
All 8 unit tests exercise GetLastEventType (a pre-existing static utility) or manually replicate the lastEvent == "session.shutdown" boolean check. None invoke SendPromptAsync or test the dispose → null → EnsureSessionConnectedAsync → send sequence that this PR introduces.
The most critical aspects of the fix are untested:
- The
DisposeAsync→Session = null→EnsureSessionConnectedAsyncreconnect flow - The
SendingFlagrelease in the catch block on reconnect failure - That a successfully reconnected session proceeds to
SendAsync - The double-reconnect scenario (lazy-resume then pre-check)
Similarly, the integration tests check dashboard/prompt-input visibility — no shutdown scenario at all.
Fix: Use existing stub infrastructure (same pattern as ProcessingWatchdogTests or TurnEndFallbackTests) to add a behavioral test that:
- Constructs a session state with a mock
CopilotSession - Writes
session.shutdownto the test events.jsonl - Verifies the old session was disposed and reconnection was attempted
- (Error path) Mocks reconnect to throw → verifies
SendingFlagis released
| catch (Exception ex) | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}"); | ||
| Interlocked.Exchange(ref state.SendingFlag, 0); | ||
| throw new InvalidOperationException( | ||
| $"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex); |
There was a problem hiding this comment.
🔴 CRITICAL — OperationCanceledException wrapping breaks cooperative cancellation (3/3 reviewers)
catch (Exception ex) catches all exceptions — including OperationCanceledException — and wraps them in InvalidOperationException. EnsureSessionConnectedAsync calls connectLock.WaitAsync(cancellationToken) (Persistence.cs:413) which throws OperationCanceledException when the user aborts or navigates away.
Concrete scenario: User sends prompt to a shutdown session → pre-check fires → EnsureSessionConnectedAsync begins reconnecting → user clicks Stop → OperationCanceledException is caught here and wrapped as InvalidOperationException("Session was shut down by the server..."). The 15+ catch (OperationCanceledException) { throw; } handlers up the call stack won't match. The Dashboard's ContinueWith handler sees IsFaulted=true instead of IsCanceled=true, showing a confusing "shut down by the server" error. For multi-agent workers, the catch (OperationCanceledException) when (...) filter at Organization.cs:2747 can't distinguish cancellation from genuine failure.
The lazy-resume block directly above (lines 3504–3508) correctly uses bare throw; to preserve exception identity — this new block breaks the established pattern.
Fix:
catch (OperationCanceledException)
{
Interlocked.Exchange(ref state.SendingFlag, 0);
throw; // preserve cancellation semantics
}
catch (Exception ex)
{
Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}");
Interlocked.Exchange(ref state.SendingFlag, 0);
throw new InvalidOperationException(
$"Session '{sessionName}' needs reconnection after detecting shutdown state: {ex.Message}", ex);
}| // --- Behavioral test: SendPromptAsync on a shutdown session --- | ||
| // We can't call SendPromptAsync directly (requires SDK infrastructure), but we can | ||
| // verify the detection logic that guards it. | ||
|
|
||
| [Fact] | ||
| public void ShutdownPreCheck_SessionWithShutdownEvent_IsDetected() | ||
| { | ||
| // Simulate the exact check from SendPromptAsync: |
There was a problem hiding this comment.
🟡 MODERATE — Tests don't cover the actual behavior change (3/3 reviewers)
All 8 unit tests exercise GetLastEventType (an existing utility already tested in TurnEndFallbackTests) or manually replicate the lastEvent == "session.shutdown" boolean check. None invoke SendPromptAsync or test the dispose → reconnect → send sequence this PR introduces.
What's untested:
- The
DisposeAsync→Session = null→EnsureSessionConnectedAsyncreconnect flow SendingFlagrelease in the catch block (line 3534) — a leak here permanently deadlocks the sessionOperationCanceledExceptionpropagation (the CRITICAL finding above)- The double-reconnect interaction with lazy-resume
- That a successfully reconnected session proceeds to
SendAsync
The integration tests (PolyPilot.IntegrationTests/ShutdownPreCheckTests.cs) only check that the dashboard loads and a prompt input exists — no shutdown scenario at all.
Fix: Use the existing stub/mock infrastructure (same pattern as ProcessingWatchdogTests or TurnEndFallbackTests) to add structural invariant tests that verify:
- The pre-check block exists in
SendPromptAsyncand containsGetLastEventType,DisposeAsync,EnsureSessionConnectedAsync - The catch block contains
OperationCanceledExceptionhandling (once the CRITICAL fix is applied) - The
SendingFlagrelease exists in the error path
| Assert.True(sendIdx >= 0, "SendPromptAsync must exist in CopilotService.cs"); | ||
|
|
||
| var sendBlock = source.Substring(sendIdx, Math.Min(8000, source.Length - sendIdx)); | ||
| var sendBlock = source.Substring(sendIdx, Math.Min(10000, source.Length - sendIdx)); |
There was a problem hiding this comment.
🟢 MINOR — Fragile structural test window will break on next feature addition (2/3 reviewers)
Bumping from 8000→10000 to accommodate the new pre-check is a "push the number until it fits" pattern. Adding more code before PrematureIdleSignal.Reset() in a future PR will silently push it past the scan window.
Fix: Search from a closer anchor (e.g., IsProcessing = true) or use IndexOf within the full method bounds rather than a fixed character window.
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — forcing reconnect before send"); | ||
| try { await state.Session.DisposeAsync(); } catch { /* session may already be disposed */ } | ||
| state.Session = null; |
There was a problem hiding this comment.
🟡 MODERATE — null vs null! nullable annotation violation (3/3 reviewers)
SessionState.Session is declared as public required CopilotSession Session { get; set; } (line 612) — a non-nullable reference type under #nullable enable. Every other null assignment in the codebase (12+ instances across Bridge.cs:512, Persistence.cs:847/879, Providers.cs:129/382, CopilotService.cs:2465/2547/2723/2738/2871/3169) uses null!.
This line uses bare null, producing CS8625. If TreatWarningsAsErrors is ever enabled, this becomes a build failure. It also corrupts the nullable flow analysis — the compiler may believe state.Session is non-null after this line.
Fix: state.Session = null!;
| // Pre-check: if events.jsonl ends with session.shutdown, the server killed this | ||
| // session but our event stream was dead so we never received the notification. | ||
| // Force a reconnect NOW instead of sending to a dead session and discovering the | ||
| // failure 10+ minutes later via the watchdog. (Issue #397) | ||
| try | ||
| { | ||
| var shutdownCheckSid = state.Info.SessionId; | ||
| if (!string.IsNullOrEmpty(shutdownCheckSid)) | ||
| { | ||
| var eventsPath = Path.Combine(SessionStatePath, shutdownCheckSid, "events.jsonl"); | ||
| var lastEvent = GetLastEventType(eventsPath); | ||
| if (lastEvent == "session.shutdown") | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — forcing reconnect before send"); | ||
| try { await state.Session.DisposeAsync(); } catch { /* session may already be disposed */ } | ||
| state.Session = null; | ||
| await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); |
There was a problem hiding this comment.
🟡 MODERATE — Spurious double-reconnect on first-send-after-restart (3/3 reviewers)
When state.Session == null (cold start / app restart), the lazy-resume block at line 3498 fires first → EnsureSessionConnectedAsync → session is now connected. But events.jsonl still contains session.shutdown from before the restart because the server hasn't flushed a new event to disk yet.
The pre-check then reads the stale file, sees session.shutdown, disposes the just-successfully-resumed session, nulls state.Session, and calls EnsureSessionConnectedAsync again — a full second reconnect cycle (~3-5s latency, two server RPCs).
Concrete scenario:
- Server idle-kills session → writes
session.shutdowntoevents.jsonl - User restarts PolyPilot, sends prompt
- Lazy-resume (line 3498) →
EnsureSessionConnectedAsync→state.Session = sessionA✓ - Pre-check → reads stale
events.jsonl→ still seessession.shutdown - Disposes
sessionA, reconnects tosessionB— unnecessary double-reconnect
In multi-agent contexts where several workers send concurrently after restart, this doubles CLI server traffic for the entire team.
Fix: Track whether the lazy-resume block ran and skip the pre-check:
bool justLazyResumed = false;
if (state.Session == null)
{
justLazyResumed = true;
try { await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); }
catch { Interlocked.Exchange(ref state.SendingFlag, 0); throw; }
}
if (!justLazyResumed)
{
// shutdown pre-check ...
}
🔄 Review-Fix Loop — Round 2 of 3Findings addressed: 5/5 (2 moderate, 3 minor)
Findings skipped: 0 Tests: All 3647 unit tests pass ✅ Next round: Expert review re-dispatched for round 3. Warning
|
Cross-Platform Verification — PR #802Build Results
✅ All platforms verified Previous Review HistoryFound 11 automated review(s) on this PR. Build verification validates that all review-driven fixes compile and pass tests across platforms. Triggered by: verify-build run |
Expert Code Review — PR #802Methodology: 3 independent reviewers with adversarial consensus Findings: 5 posted as inline comments (3 moderate, 2 minor)
What's Good
Discarded Findings
CI & Test Coverage
|
There was a problem hiding this comment.
Code Review — Shutdown Pre-Check (Issue #397)
Reviewed full source context around SendPromptAsync, EnsureSessionConnectedAsync, GetLastEventType, the SendingFlag lifecycle, and all callers. Validated 5 prior findings, found 1 new.
🟡 MODERATE — OperationCanceledException wrapping (Prior #2, confirmed)
The catch block wraps ALL exceptions in InvalidOperationException, including OperationCanceledException from cancellationToken during EnsureSessionConnectedAsync. The codebase convention (lines 1226, 1261, 1289, 1494) is catch (OperationCanceledException) { throw; } before the general catch. This causes a misleading "shut down by the server" error in the UI when the actual cause is user cancellation.
Fix: Add before the general catch:
catch (OperationCanceledException)
{
Interlocked.Exchange(ref state.SendingFlag, 0);
throw;
}🟡 MODERATE — Tests don't cover actual behavior (Prior #3, confirmed)
All 8 tests exercise GetLastEventType detection, which already has 7 behavioral tests in TurnEndFallbackTests.cs (lines 371–462). Zero tests verify the actual SendPromptAsync changes: session disposal, reconnect call, SendingFlag cleanup on failure, or error propagation.
🟢 MINOR — Double-reconnect on first-send-after-restart (Prior #1, downgraded)
After restart, the lazy-resume block runs first. If the server rejected the dead session, EnsureSessionConnectedAsync fresh-creates with a new session ID → pre-check reads the new session's events.jsonl → no shutdown → no double reconnect. Only triggers if ResumeSessionAsync succeeds despite shutdown AND the session.resume event hasn't flushed — a narrow race with no harmful side effect.
🟢 MINOR — Fragile structural test window (Prior #5, confirmed)
8000→10000 will need bumping on the next SendPromptAsync change.
🟢 MINOR — Error message (Prior #4, confirmed) — acceptable as-is.
🟢 MINOR (NEW) — state.Session = null on partial failure
If DisposeAsync succeeds but EnsureSessionConnectedAsync throws, the session is left without an SDK handle. Self-healing: next send enters the lazy-resume block and reconnects. No data loss.
Generated by Expert Code Review · ● 12.6M
| if (lastEvent == "session.shutdown") | ||
| { | ||
| Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — forcing reconnect before send"); | ||
| try { await state.Session.DisposeAsync(); } catch { /* session may already be disposed */ } | ||
| state.Session = null; | ||
| await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); |
There was a problem hiding this comment.
🟡 MODERATE — Spurious double-reconnect on first-send-after-restart (3/3 reviewers)
When the app restarts with a shutdown session, the lazy-resume block (above this code) fires first because state.Session == null. If ResumeSessionAsync succeeds, state.Info.SessionId stays the same (pointing at the old session directory). Then this pre-check reads the same stale events.jsonl (still ending in session.shutdown), disposes the just-resumed session, and calls EnsureSessionConnectedAsync a second time — creating an orphaned server-side session.
Fix: Skip the pre-check when the lazy-resume block just ran successfully (e.g., set a local bool justResumed = false; before the resume block, set it to true on success, and guard this check with if (!justResumed)).
| Assert.True(sendIdx >= 0, "SendPromptAsync must exist in CopilotService.cs"); | ||
|
|
||
| var sendBlock = source.Substring(sendIdx, Math.Min(8000, source.Length - sendIdx)); | ||
| var sendBlock = source.Substring(sendIdx, Math.Min(10000, source.Length - sendIdx)); |
There was a problem hiding this comment.
🟢 MINOR — Fragile structural test window (3/3 reviewers)
Bumping 8000→10000 is a band-aid — the next feature addition to SendPromptAsync will push PrematureIdleSignal.Reset() beyond this window again. SendPromptAsync is ~450 lines and growing.
Fix: Use source.IndexOf("PrematureIdleSignal.Reset()", sendIdx, StringComparison.Ordinal) to search from the method start without a fixed window, or extract the full method body and assert on that.
| throw new InvalidOperationException( | ||
| $"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex); |
There was a problem hiding this comment.
🟢 MINOR — Misleading error message (3/3 reviewers)
This message always says "was shut down by the server" regardless of the actual reconnect failure cause (auth, network, timeout, cancellation). EnsureSessionConnectedAsync already tries to create a new session internally (the CreateSessionAsync fallback), so suggesting "Try creating a new session" is unhelpful when the underlying issue is connectivity or auth.
Fix: Use a more contextual message, e.g.: $"Session '{sessionName}' reconnection failed: {ex.Message}. Check connectivity or try again."
| } | ||
| } | ||
| } | ||
| catch (Exception ex) |
There was a problem hiding this comment.
🟡 MODERATE — OperationCanceledException wrapping loses cancellation identity (3/3 reviewers)
EnsureSessionConnectedAsync can throw OperationCanceledException when the cancellation token fires (e.g., user navigates away, reconnect race). The broad catch (Exception ex) wraps it in InvalidOperationException, so callers can no longer distinguish cancellation from failure — Task.IsCanceled won't be set, retry logic treats it as a hard error, and the user sees a misleading "shut down by the server" message instead of silent cancellation.
The codebase consistently uses catch (OperationCanceledException) { throw; } before broad catches (11+ sites per prior review).
Fix: Add before this catch:
catch (OperationCanceledException)
{
Interlocked.Exchange(ref state.SendingFlag, 0);
throw;
}| // 3. Check GetLastEventType | ||
|
|
||
| var svc = CreateService(); | ||
| var baseDir = TestSetup.TestBaseDir; | ||
| var sessionStatePath = Path.Combine(baseDir, "session-state"); | ||
| var sessionId = Guid.NewGuid().ToString(); | ||
| var sessionDir = Path.Combine(sessionStatePath, sessionId); |
There was a problem hiding this comment.
🟡 MODERATE — Tests cover only GetLastEventType, not the actual behavior change (3/3 reviewers)
The "behavioral" tests below this comment just call GetLastEventType and check a boolean — they don't exercise SendPromptAsync's dispose→reconnect→send flow. As the comment admits: "We can't call SendPromptAsync directly" — but CopilotService is already instantiated via CreateService() in this class, and using ConnectionMode.Demo with the existing stubs could test:
- That
SendingFlagis released on reconnect failure - That
OperationCanceledExceptionpropagates correctly - That a normal send succeeds after pre-check reconnect
Without these, the critical error-path behavior (flag release, exception propagation) has zero automated coverage.
🔄 Review-Fix Loop — Round 3 of 3Findings addressed: 5/5 (3 moderate, 2 minor)
Tests: All 3648 passed (3644 existing + 4 new) Warning
|
Summary
Before sending a prompt,
SendPromptAsyncnow checks ifevents.jsonlends withsession.shutdown. If so, it forces a reconnect before sending instead of sending to a dead session and discovering the failure 10+ minutes later via the watchdog.Problem
When a user sends a prompt, PolyPilot may send it to a session the server has already killed. The server wrote
session.shutdowntoevents.jsonlbut the client event stream was dead, so PolyPilot never received it. The prompt appears to start but then the new session also gets a dead event stream and eventually times out.Fix
Added a pre-check in
SendPromptAsync(inCopilotService.cs) after the lazy-resume block and before the main send logic:events.jsonlviaGetLastEventType()(tail-read, last 4KB — minimal overhead)session.shutdown, disposes the dead session and callsEnsureSessionConnectedAsyncto force a reconnect[SEND-SHUTDOWN-PRECHECK]diagnostic entry for observabilitySendingFlagand throws a clear error messageChanges
PolyPilot/Services/CopilotService.cs— Added shutdown pre-check block inSendPromptAsyncPolyPilot.Tests/ShutdownPreCheckTests.cs— 8 unit tests coveringGetLastEventTypedetection and the shutdown pre-check logicPolyPilot.Tests/MultiAgentRegressionTests.cs— Increased structural test search window (8000→10000 chars) to accommodate the new codePolyPilot.IntegrationTests/ShutdownPreCheckTests.cs— Integration tests for dashboard accessibilityTest Results
All 3644 unit tests pass. Integration tests build successfully.
Fixes #397
Warning
The following domain was blocked by the firewall during workflow execution:
192.0.2.1To allow these domains, add them to the
network.allowedlist in your workflow frontmatter:See Network Configuration for more information.
Note
🔒 Integrity filter blocked 1 item
The following item were blocked because they don't meet the GitHub integrity level.
search_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".To allow these resources, lower
min-integrityin your GitHub frontmatter: