MinGW + SSL CI and investigation by stephenberry · Pull Request #2448 · stephenberry/glaze

stephenberry · 2026-04-08T02:17:50Z

Summary

Adds a MinGW + SSL CI workflow and documents an intermittent heap corruption (0xc0000374) on MinGW/GCC + Windows when OpenSSL is linked.

Root cause: OpenSSL + MinGW runtime interaction bug (not a Glaze bug). The crash occurs purely from linking OpenSSL libraries — even with GLZ_ENABLE_SSL undefined and no SSL code compiled. MSVC builds are unaffected.

See full investigation writeup for details.

Changes

msys2-ssl.yml — New CI workflow for MinGW + SSL testing (continue-on-error: true)
http_server.hpp — Move ssl_context to conditional base class (ssl_context_holder) so http_server<false> has no OpenSSL members
http_client.hpp — Add io_context drain in stop_workers() (good practice for clean shutdown)
tests/CMakeLists.txt — Add ws2_32/mswsock Winsock linking for MinGW
mingw_ssl_diag/ — Diagnostic test suite and minimal reproducer
Documentation — Full writeup of the investigation and findings

Investigation summary

Hypothesis	Result
`GLZ_ENABLE_SSL` macro / template changes	Crashes even without the macro
DLL boundary / CRT heap mismatch	Crashes with static OpenSSL too
GCC optimizer bug	Crashes with `-O0`
Pending ASIO handlers during io_context destruction	Drain didn't fix it
OpenSSL TLS cleanup on thread exit	`OPENSSL_thread_stop()` didn't fix it
Merely linking OpenSSL	Confirmed root cause

Test plan

All existing CI workflows pass (gcc, clang, msvc, standalone-asio, boost-asio, etc.)
MinGW SSL workflow runs with continue-on-error: true (known intermittent failure)
http_server<true> (TLS servers) still compile and work via ssl_context_holder base class

packit-as-a-service · 2026-04-08T14:30:40Z

One of the tests failed for 0ad8abc. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581076 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10301902/

packit-as-a-service · 2026-04-08T15:08:38Z

One of the tests failed for b08b8ba. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581122 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302067/

packit-as-a-service · 2026-04-08T15:25:15Z

One of the tests failed for bb44ab5. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581154 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302183/

packit-as-a-service · 2026-04-08T15:59:37Z

One of the tests failed for 894cdf5. @admin check logs https://download.copr.fedorainfracloud.org/results/packit/stephenberry-glaze-2448/srpm-builds/10302242/builder-live.log, packit dashboard https://dashboard.packit.dev/jobs/srpm/581176 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302242/

packit-as-a-service · 2026-04-08T16:53:58Z

One of the tests failed for 85a3f55. @admin check logs https://download.copr.fedorainfracloud.org/results/packit/stephenberry-glaze-2448/fedora-rawhide-aarch64/10302317-glaze/builder-live.log, packit dashboard https://dashboard.packit.dev/jobs/copr/3451123 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302317/

packit-as-a-service · 2026-04-08T16:53:59Z

One of the tests failed for 85a3f55. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3451122 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302317/

Run mingw_ssl_diag first (with || true) so it always produces output, then http_client_test with if: always(). This ensures both tests run even if one crashes.

Three parallel CI jobs: - ssl-enabled: GLZ_ENABLE_SSL + dynamic OpenSSL (baseline, crashes) - link-only: OpenSSL DLLs linked, GLZ_ENABLE_SSL undefined (tests DLL loading) - static-ssl: GLZ_ENABLE_SSL + static OpenSSL (tests DLL boundary issue)

Temporarily remove the ssl_context data member to determine if its presence in http_server<false> causes the MinGW heap corruption. The member's unique_ptr<asio::ssl::context> destructor instantiates OpenSSL cleanup code even when the pointer is null, which may be the trigger. All usages are inside if constexpr (EnableTLS) so http_server<false> compiles without it.

Use ssl_context_holder<EnableTLS> base class so that ssl_context only exists as a data member when EnableTLS=true. This avoids instantiating unique_ptr<asio::ssl::context> destructor for http_server<false>. All references use this->ssl_context to defer name lookup past GCC 15's -Wtemplate-body checking of discarded if constexpr branches.

The previous -UGLZ_ENABLE_SSL approach didn't reliably override the interface target's -D flag, so the link-only test was still compiled with SSL code paths. Now the link-only job configures with glaze_ENABLE_SSL=OFF so the glaze headers have no SSL code at all.

Root cause identified: heap corruption occurs on MinGW/GCC + Windows purely from linking OpenSSL — even with GLZ_ENABLE_SSL undefined and no SSL code compiled. This is an OpenSSL + MinGW runtime interaction bug, not a Glaze issue. Changes: - Document findings in docs/networking/mingw-ssl-heap-corruption.md - Simplify msys2-ssl.yml CI workflow with continue-on-error: true - Move ssl_context to conditional base class (ssl_context_holder) so http_server<false> has no OpenSSL members — cleaner architecture - Add io_context drain in http_client::stop_workers() (good practice) - Restore http_client_test.cpp to clean state (remove debug traces) - Keep diagnostic tests for tracking upstream fixes - Add ws2_32/mswsock Winsock linking for MinGW in tests/CMakeLists.txt

GTruf · 2026-04-13T00:36:43Z

@stephenberry, hello.
I don't know if you did this investigation on your own or if you have a whole team, but that's impressive... Could you please tell me if there’s any resolution regarding the issue you found: have you fixed it and all the tests are passing, or is additional testing needed (Status: Root cause identified, fix applied)? And, by the way, since we’re discussing all this in the context of WebSocket implementation, why are you using shared_ptr and weak_ptr patterns everywhere there? Why not unique_ptr or something else? Won’t there be a performance hit from the use of shared_ptr everywhere?

And I have a question about the Correct usage pattern: isn't it possible that new handlers might appear after a restart (clear the stopped flag)? I understand the basic idea: we stop the io_context, wait for the workers to finish, then restart and complete the work of any other handlers, if there are any. But couldn’t more handlers be added at that point?

stephenberry · 2026-04-13T12:52:17Z

@GTruf, this was me pounding away at the problem using AI. The PR description was out of date and I just updated it.

What is extremely relevant is that merely linking OpenSSL on MinGW without compiling any SSL code causes crashes. So, the primary issue isn't a Glaze bug.

I'm still digging through the problem, but this looks like a major concern with MinGW. I'd be surprised if it was OpenSSL's fault.

stephenberry · 2026-04-13T12:54:21Z

@GTruf, take a look at this writeup: https://github.com/stephenberry/glaze/blob/msys2-mingw-ssl/docs/networking/mingw-ssl-heap-corruption.md

No Glaze dependency — just ASIO + OpenSSL + MinGW. Useful for reporting upstream to MSYS2/OpenSSL. Includes a drop-in GitHub Action workflow.

GTruf · 2026-04-13T13:05:31Z

@stephenberry, I’ve also done a lot of research on the glaze and asio code using AI, and for the most part, all the major AI models produced results that were more or less the same as what’s in your notes.

Yes, initially in the PR about WebSocket, I suggested that the problem might be at the OpenSSL level, but then I deleted or edited that comment. Basically, yes, the main assumption is that the crash occurs at the OpenSSL level, built in MSYS2 using MinGW. I don’t know exactly what’s crashing. I think the problem will be easy to find if we add logging at the OpenSSL level; we’ll just have to rebuild the library for each new log entry, but that’s not a big deal. Do you have time for this, or do you need help? Everything is set up on my end; I can test it soon. And we could also use GDB to see what’s going on there. It also annoys me that there are no sanitizers for MinGW, that makes it very hard to debug this problem...

Also, if you could reply in this PR thread about shared_ptr, that would be great, or I can create a separate issue for that discussion.

stephenberry · 2026-04-13T13:17:58Z

@GTruf, as much as you're able to help, I would appreciate it, because my time is limited right now. But, I'd also like to understand what is going on. I'm going to build a minimal reproduction of the issue.

And, by the way, since we’re discussing all this in the context of WebSocket implementation, why are you using shared_ptr and weak_ptr patterns everywhere there? Why not unique_ptr or something else? Won’t there be a performance hit from the use of shared_ptr everywhere?

As for use of shared_ptr, it is extremely helpful for asynchronous calls, where we might want the data or connection to outlive the call. The shared/weak pairing also works well for connections that might be dropped, as we can use the weak_ptr to query if the connection is live and we therefore pair lifetimes with utility.

However, there is a lot of higher level complexity due to dealing with the asio architecture, and after trying to optimize I've realized in the long run we probably want to drop asio's event loop logic. This is much deeper discussion to have, but it would also allow the architecture to be cleaner.

shared_ptr isn't really a performance concern unless constantly being constructed and destroyed.

stephenberry · 2026-04-13T13:25:30Z

And I have a question about the Correct usage pattern: isn't it possible that new handlers might appear after a restart (clear the stopped flag)? I understand the basic idea: we stop the io_context, wait for the workers to finish, then restart and complete the work of any other handlers, if there are any. But couldn’t more handlers be added at that point?

AI explanation:

After restart() clears the stopped flag, new handlers could theoretically be dispatched. But in this sequence it's safe:

io_context->stop();       // 1. Prevent new handler dispatch
join_worker_threads();    // 2. All threads exit run(), no code left to post handlers
io_context->restart();    // 3. Clear stopped flag
io_context->poll();       // 4. Drain remaining queued handlers on THIS thread

After step 2, no other threads are alive to post work. The only handlers in the queue are ones that were posted before stop() but never dispatched. poll() runs them synchronously on the calling thread.

If a drained handler itself posts new work (e.g., a completion callback that calls async_read), poll() picks that up too, which is actually desirable, since it lets cleanup chains complete fully. Once the queue is empty and no new work is posted, poll() returns.

That said, this drain pattern is not currently applied on main, it's proposed in this PR as a preventive improvement for http_client::stop_workers(). The MinGW + OpenSSL heap corruption turned out to be an upstream runtime issue unrelated to handler draining (the crash occurs even when no handlers are pending). The drain is still good practice for clean ASIO shutdown.

Triggers on push to msys2-mingw-ssl and debug/* branches, plus manual workflow_dispatch. Runs the ASIO + OpenSSL reproducer (no Glaze dependency) 3 times to catch the intermittent crash.

GTruf · 2026-04-13T13:36:12Z

@stephenberry, By the way, I’m comparing the performance of the libwebsockets and glaze libraries when working with WebSockets. For libwebsockets, I’m also using OpenSSL compiled under MinGW in MSYS2 (basically the same binary as for glaze), and everything works fine there, without any crash.

Fixed port 19876 caused bind failures on reuse. Now uses port 0 so the OS assigns a free port each round.

GTruf · 2026-04-14T00:09:20Z

@stephenberry, I tested OpenSSL 3.0.8, 3.1.8, 3.6.2, and 4.0.0-beta-1. All of them crash, though the issue occurs much less frequently in earlier versions. Using AI, I discovered that it is possible to set certain callback functions that are called at specific points during event handling in OpenSSL. I defined the callbacks at the very top of websocket_client.hpp and set them up immediately after allocating memory for ssl_socket_. What’s characteristic is that for all versions, in the event of a crash, a handshake occurs first, followed by two CCS packets of 5 and 1 bytes, respectively, after which the crash occurs.

// AI-based
void ssl_info_callback(const SSL* ssl, int where, int ret)
{
    const char* str = "undefined";
    int w = where & ~SSL_ST_MASK;

    if (w & SSL_ST_CONNECT)      str = "SSL_connect";
    else if (w & SSL_ST_ACCEPT)  str = "SSL_acc << std::flushept";

    if (where & SSL_CB_LOOP) {
        std::cout << str << ": " << SSL_state_string_long(ssl) << "\n" << std::flush;
    }
    else if (where & SSL_CB_ALERT) {
        const char* dir = (where & SSL_CB_READ) ? "received" : "sent";
        std::cout << "Alert " << dir << ": "
                  << SSL_alert_type_string_long(ret) << " : "
                  << SSL_alert_desc_string_long(ret) << "\n";
    }
    else if (where & SSL_CB_EXIT) {
        if (ret == 0)
            std::cout << str << ": failed in " << SSL_state_string_long(ssl) << "\n" << std::flush;
        else if (ret < 0)
            std::cout << str << ": error in " << SSL_state_string_long(ssl) << "\n" << std::flush;
    }
    if (where & SSL_CB_HANDSHAKE_DONE)
        std::cout << "=== HANDSHAKE DONE ===\n" << std::flush << "VERSION: " << OpenSSL_version(OPENSSL_VERSION_STRING) << std::endl;
}

// AI-based
void msg_callback(int write_p, int version, int content_type,
                  const void* buf, size_t len, SSL* ssl, void* arg)
{
    const char* dir = write_p ? "sent" : "received";
    const char* type = (content_type == SSL3_RT_HANDSHAKE) ? "handshake" :
                       (content_type == SSL3_RT_ALERT) ? "alert" : "ccs";

    std::cout << dir << " " << type << " (" << len << " bytes)\n" << std::flush;
}

...
// Inside `void connect(std::string_view url_str)`
ssl_socket_ = std::make_shared<asio::ssl::stream<asio::ip::tcp::socket>>(*ctx, *ssl_ctx_); // <--- glaze code

// Added code for logging via callbacks
SSL* ssl = ssl_socket_->native_handle();
SSL_set_info_callback(ssl, ssl_info_callback);
SSL_set_msg_callback(ssl, msg_callback);

Output:

SSL_connect: before SSL initialization
sent ccs (5 bytes)
sent handshake (1539 bytes)
SSL_connect: SSLv3/TLS write client hello
SSL_connect: error in SSLv3/TLS write client hello
SSL_connect: error in SSLv3/TLS write client hello
received ccs (5 bytes)
SSL_connect: SSLv3/TLS write client hello
received handshake (1210 bytes)
received ccs (5 bytes)
received ccs (1 bytes)
received ccs (5 bytes)
SSL_connect: error in SSLv3/TLS read server hello
received ccs (1 bytes)
SSL_connect: SSLv3/TLS read server hello
received handshake (10 bytes)
SSL_connect: TLSv1.3 read encrypted extensions
received handshake (2519 bytes)
SSL_connect: SSLv3/TLS read server certificate
received handshake (78 bytes)
SSL_connect: TLSv1.3 read server certificate verify
received handshake (52 bytes)
SSL_connect: SSLv3/TLS read finished
sent ccs (5 bytes)
sent ccs (1 bytes)
SSL_connect: SSLv3/TLS write change cipher spec
sent ccs (5 bytes)
sent ccs (1 bytes)
sent handshake (52 bytes)
SSL_connect: SSLv3/TLS write finished
=== HANDSHAKE DONE ===
VERSION: 3.6.2
sent ccs (5 bytes)
sent ccs (1 bytes)

Process finished with exit code -1073741819 (0xC0000005)

GTruf · 2026-04-14T01:03:32Z

@stephenberry, Overall, I think that building OpenSSL under MSYS2's MinGW is incompatible with using standard MinGW in our case. It might be an incompatibility between UCRT and MSVCRT or something else. But the whole thing is pretty strange; an incompatibility issue like this would clearly have been detected earlier. To reiterate, the libwebsockets code is fully functional, and OpenSSL is sourced from MSYS2. I think there is no crash in the libwebsockets version because the handshake and cleanup occur within the library, which was built in a more consistent environment (or uses slightly different internal calls).

Anyway, if you create any more PRs on this topic, feel free to ping me so I’m at least aware that they exist.

GTruf · 2026-04-14T01:20:10Z

@stephenberry, If we ultimately can't resolve the issue, I suggest failing the compilation using CMake message(FATAL_ERROR ...) and C++ static_assert that checks for the use of glaze with MinGW and OpenSSL, as a temporary solution.

stephenberry · 2026-04-14T02:26:12Z

@GTruf, thanks for all this analysis. This is probably something that the OpenSSL team would be very interested in, because it could potentially be an attack vector. Do you plan to submit an error report or would you like me to?

GTruf · 2026-04-14T02:42:35Z

@stephenberry, To be honest, I'm not quite sure how to put all this into words for them. If you could send them all of this, that would be great. If you need to run the text by me to make sure we're both confident we're sending them the right report, just ping me (especially if you need code and reproduction steps, starting with the OpenSSL build, etc.).

P.S. I think they’ll need the Glaze code that uses asio so they can see what’s wrong there, and also to point out that everything works fine with libwebsockets.

GTruf · 2026-04-14T02:45:04Z

@stephenberry,

because it could potentially be an attack vector

I can't really imagine how this could be used for malicious purposes

stephenberry added 6 commits April 7, 2026 21:17

msys2-mingw-ssl error replication test

83b33e9

Update CMakeLists.txt

5a56229

Minimal diagnostic test for MinGW + SSL heap corruption

f63fbff

Update msys2-ssl.yml

73d1100

Update msys2-ssl.yml

0ad8abc

more stress testing

b08b8ba

On job 20 iterations to tease out the issue

bb44ab5

Update msys2-ssl.yml

894cdf5

Update msys2-ssl.yml

85a3f55

Update msys2-ssl.yml

2b71889

stephenberry added 14 commits April 8, 2026 15:31

updates

9ee4791

Update msys2-ssl.yml

51e3846

better diagnostics

a2d70b7

Update msys2-ssl.yml

4682dda

Update msys2-ssl.yml

43a5eae

Update msys2-ssl.yml

4ad07d8

Update http_client_test.cpp

a87c467

Update http_client_test.cpp

499ec41

Update http_client_test.cpp

e036476

Update http_client_test.cpp

7912f79

Update http_client_test.cpp

7916a5f

Update http_server.hpp

952ad64

http_server fix

c980c12

Update http_server.hpp

aeccc6a

stephenberry added 13 commits April 9, 2026 21:29

Close all tracked sockets to cancel pending I/O

73ce111

Update http_server.hpp

59688bf

Update http_server.hpp

d49405b

Update http_client_test.cpp

dcf485a

Update http_server.hpp

26eddac

Update http_server.hpp

bcc59c3

Update mingw_ssl_diag.cpp

ace39a9

Run minimal reproducer before http_client_test in CI

051e5ec

Run mingw_ssl_diag first (with || true) so it always produces output, then http_client_test with if: always(). This ensures both tests run even if one crashes.

stephenberry changed the title ~~msys2-mingw-ssl error replication test~~ MinGW + SSL CI and investigation Apr 13, 2026

Add standalone MinGW + OpenSSL heap corruption reproducer

b13a078

No Glaze dependency — just ASIO + OpenSSL + MinGW. Useful for reporting upstream to MSYS2/OpenSSL. Includes a drop-in GitHub Action workflow.

Add standalone reproducer workflow to .github/workflows/

74e166a

Triggers on push to msys2-mingw-ssl and debug/* branches, plus manual workflow_dispatch. Runs the ASIO + OpenSSL reproducer (no Glaze dependency) 3 times to catch the intermittent crash.

Fix standalone reproducer: use OS-assigned port

8ec4c8b

Fixed port 19876 caused bind failures on reuse. Now uses port 0 so the OS assigns a free port each round.

Conversation

stephenberry commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Investigation summary

Test plan

Uh oh!

packit-as-a-service Bot commented Apr 8, 2026

Uh oh!

packit-as-a-service Bot commented Apr 8, 2026

Uh oh!

packit-as-a-service Bot commented Apr 8, 2026

Uh oh!

packit-as-a-service Bot commented Apr 8, 2026

Uh oh!

packit-as-a-service Bot commented Apr 8, 2026

Uh oh!

packit-as-a-service Bot commented Apr 8, 2026

Uh oh!

GTruf commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenberry commented Apr 13, 2026

Uh oh!

stephenberry commented Apr 13, 2026

Uh oh!

GTruf commented Apr 13, 2026

Uh oh!

stephenberry commented Apr 13, 2026

Uh oh!

stephenberry commented Apr 13, 2026

Uh oh!

GTruf commented Apr 13, 2026

Uh oh!

GTruf commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GTruf commented Apr 14, 2026

Uh oh!

GTruf commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenberry commented Apr 14, 2026

Uh oh!

GTruf commented Apr 14, 2026

Uh oh!

GTruf commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephenberry commented Apr 8, 2026 •

edited

Loading

GTruf commented Apr 13, 2026 •

edited

Loading

GTruf commented Apr 14, 2026 •

edited

Loading

GTruf commented Apr 14, 2026 •

edited

Loading

GTruf commented Apr 14, 2026 •

edited

Loading