Skip to content

Fix SslStream/QUIC server exit hang; re-enable handshake scenarios (Wave A) — closes #2140 partially#5

Closed
LoopedBard3 wants to merge 2 commits into
mainfrom
loopedbard3/re-enable-sslstream-scenarios
Closed

Fix SslStream/QUIC server exit hang; re-enable handshake scenarios (Wave A) — closes #2140 partially#5
LoopedBard3 wants to merge 2 commits into
mainfrom
loopedbard3/re-enable-sslstream-scenarios

Conversation

@LoopedBard3
Copy link
Copy Markdown
Owner

Fixes the root cause behind aspnet#2140 (and aspnet#2139) and re-enables the SslStream + QUIC handshake scenarios in CI-02. Read-write matrices stay gated for a follow-up after a few clean runs.

Why the scenarios were disabled

aspnet#2139 (Jan 12 2026) added && false to every condition in build/sslstream-scenarios.yml because the benchmark app "does not seem to be exiting properly, causing follow-up test runs to timeout." aspnet#2140 has tracked the re-enable since.

Actual root cause

Crank's Linux agent stops jobs with SIGTERM first (5 s grace) then SIGINT — see dotnet/crank src/Microsoft.Crank.Agent/Startup.cs. BenchmarkApp.RunAsyncInternal only registered Console.CancelKeyPress (catches SIGINT, not SIGTERM). When SIGTERM arrived:

  • GlobalCts.Cancel never fired,
  • the accept loop kept blocking in AcceptAsync,
  • in-flight Task.Run-ed TLS/QUIC connection processors never observed cancellation,
  • the .NET runtime's default SIGTERM behavior terminated the process after a brief ProcessExit window,
  • but in-flight SslStream close_notify writes and QUIC connection drains could leave the listen socket / OS port stranded for the next benchmark run — matching the observed "follow-up test timeout" symptom.

Fix (three layers)

  1. BenchmarkApp now registers PosixSignalRegistration for SIGTERM and SIGQUIT in addition to Console.CancelKeyPress. GlobalCts.Cancel runs on every shutdown signal crank can send. PosixSignalRegistration.Create is supported on Windows (maps to SetConsoleCtrlHandler), so the existing Ctrl+C path is unchanged.

  2. BenchmarkApp wraps RunBenchmarkAsync with a 15 s hard shutdown deadline that fires after cancellation. If disposal stalls (stuck SslStream close_notify, QuicConnection drain, etc.) Environment.Exit(0) guarantees we release the port before crank's SIGKILL fallback fires.

  3. BenchmarkServer now tracks per-accepted-connection Task.Run handles in a ConcurrentDictionary and drains them with a 10 s timeout before the await using listener disposes, so per-connection disposal can't race the listener teardown.

The two timeouts are intentionally tiered: drain (10 s) < hard exit (15 s) < crank SIGKILL fallback. The drain is best-effort and logs a warning when it expires.

Wave A re-enable (this PR)

build/sslstream-scenarios.yml:

  • Line 127: SslStream Handshake matrix — && false removed.
  • Line 148: QUIC Handshake matrix — && false removed.
  • Lines 172, 195: SslStream / QUIC read-write matrices — left gated for now.

Existing Math.round(Date.now() / 43200000) % 2 == 0 half-day modulo is preserved on all four — re-enabled scenarios still run on alternate 12 h ticks.

10 publishes added back. Expected SslStream gold-lin job duration: ≈ 10–15 min on active ticks vs ≈ 2.3 min on off-ticks (vs ≈ 2.3 min today on every tick).

Wave B (follow-up)

After 2–3 clean CI-02 runs where the SslStream job finishes cleanly and no downstream gold-lin job times out, a follow-up PR will remove && false from lines 172 and 195 to bring back the 56 read-write publishes.

Validation done locally

  • Built all four projects: SslStreamServer, SslStreamClient, QuicServer, QuicClient — succeeded.
  • Started SslStreamServer on Windows, hit it with SslStreamClient (ReadWrite, 10 s warmup + 10 s duration) — clean run, metrics reported, port released after process termination.
  • PosixSignalRegistration.Create(PosixSignal.SIGTERM, ...) confirmed to be a no-throw on Windows (server progressed past it at startup).
  • Linux SIGTERM behavior can't be locally reproduced from a Windows worktree; the fix is logically grounded in the crank stop sequence + a source-side audit of the cancellation path.

Closes / refs

LoopedBard3 and others added 2 commits May 15, 2026 13:14
The Ntex Plaintext, Ntex Json, and Ntex Fortunes scenarios have been
failing in benchmarks-ci-01 and benchmarks-ci-02 every run since the
Mar 26 build. The docker build for ntex-plt.dockerfile and
ntex-db.dockerfile both fail to compile with:

  error[E0308]: mismatched types
     --> tokio-postgres/src/query.rs:77:52
     |
  77 |             encode_bind(statement, params, "", buf)?;
     |             -----------                        ^^^
     |                       expected &mut BytesMut, found &mut BytePages
  error: could not compile `tokio-postgres` (lib) due to 9 previous errors

The root cause is in TechEmpower/FrameworkBenchmarks
frameworks/Rust/ntex/Cargo.toml: `tokio-postgres` is a *top-level*
dependency on the fafhrd91/postgres `ntex-3` fork (frozen at commit
fbc7a17), not a feature-gated one. Every ntex binary in the
workspace builds it, including ntex-plt. The caret-versioned
`ntex-bytes = "1.5"` resolves to ntex-bytes 1.6.0 (released
2026-05-02), which introduced the BytePages type. The fafhrd91 fork
still uses BytesMut, so the workspace cannot compile.

TechEmpower/FrameworkBenchmarks is now archived
(https://github.com/TechEmpower/FrameworkBenchmarks, last push
2026-03-24), so a real upstream fix is not coming.

Disable all three ntex scenarios in the three places they run:
  - build/frameworks-scenarios.yml         (Ntex Plaintext, Ntex Json)
  - build/frameworks-database-scenarios.yml (Ntex Fortunes)
  - build/containers-scenarios.yml          (Json NTex, Fortunes NTex)

Each disable carries an explanatory comment pointing at the other
sites, so it is easy to re-enable as a set if someone forks/pins the
ntex source back into a working state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rios

Root cause (issue aspnet#2140 / disabled in aspnet#2139):
Crank's Linux agent stops jobs with SIGTERM first (5 s grace) then SIGINT.
BenchmarkApp only handled Console.CancelKeyPress (SIGINT), so SIGTERM
bypassed the graceful cancellation path. In-flight TLS/QUIC connections
never observed cancellation and could leave the listen port stranded for
the next run -- which is what 'app does not seem to be exiting properly'
in aspnet#2139 actually described.

Fix is three layered:

1. BenchmarkApp now registers PosixSignalRegistration for SIGTERM and
   SIGQUIT in addition to Console.CancelKeyPress, so GlobalCts.Cancel
   runs on every shutdown signal crank can send. PosixSignalRegistration
   is no-op-friendly on Windows so the existing Ctrl+C path is unchanged.

2. BenchmarkApp wraps RunBenchmarkAsync with a 15 s hard shutdown
   deadline after cancellation. If disposal stalls (stuck SslStream
   close_notify, QuicConnection drain, etc.) Environment.Exit guarantees
   we release the port before crank's SIGKILL fallback fires.

3. BenchmarkServer now tracks per-accepted-connection Task.Run handles
   and drains them with a 10 s timeout before letting the listener
   dispose, so SslStream/Quic disposal doesn't race the listener teardown.

Re-enable in build/sslstream-scenarios.yml: remove the '&& false' gate
from all four scenario matrices (SslStream handshake, QUIC handshake,
SslStream read-write, QUIC read-write). 66 publishes added back, on
alternate 12 h ticks via the existing
Math.round(Date.now() / 43200000) % 2 == 0 modulo (preserved on all four).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@LoopedBard3 LoopedBard3 force-pushed the loopedbard3/re-enable-sslstream-scenarios branch from bed75e1 to 6213874 Compare May 18, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant