Skip to content

MinGW + SSL CI and investigation#2448

Open
stephenberry wants to merge 41 commits intomainfrom
msys2-mingw-ssl
Open

MinGW + SSL CI and investigation#2448
stephenberry wants to merge 41 commits intomainfrom
msys2-mingw-ssl

Conversation

@stephenberry
Copy link
Copy Markdown
Owner

@stephenberry stephenberry commented Apr 8, 2026

Summary

Adds a MinGW + SSL CI workflow and documents an intermittent heap corruption (0xc0000374) on MinGW/GCC + Windows when OpenSSL is linked.

Root cause: OpenSSL + MinGW runtime interaction bug (not a Glaze bug). The crash occurs purely from linking OpenSSL libraries — even with GLZ_ENABLE_SSL undefined and no SSL code compiled. MSVC builds are unaffected.

See full investigation writeup for details.

Changes

  • msys2-ssl.yml — New CI workflow for MinGW + SSL testing (continue-on-error: true)
  • http_server.hpp — Move ssl_context to conditional base class (ssl_context_holder) so http_server<false> has no OpenSSL members
  • http_client.hpp — Add io_context drain in stop_workers() (good practice for clean shutdown)
  • tests/CMakeLists.txt — Add ws2_32/mswsock Winsock linking for MinGW
  • mingw_ssl_diag/ — Diagnostic test suite and minimal reproducer
  • Documentation — Full writeup of the investigation and findings

Investigation summary

Hypothesis Result
GLZ_ENABLE_SSL macro / template changes Crashes even without the macro
DLL boundary / CRT heap mismatch Crashes with static OpenSSL too
GCC optimizer bug Crashes with -O0
Pending ASIO handlers during io_context destruction Drain didn't fix it
OpenSSL TLS cleanup on thread exit OPENSSL_thread_stop() didn't fix it
Merely linking OpenSSL Confirmed root cause

Test plan

  • All existing CI workflows pass (gcc, clang, msvc, standalone-asio, boost-asio, etc.)
  • MinGW SSL workflow runs with continue-on-error: true (known intermittent failure)
  • http_server<true> (TLS servers) still compile and work via ssl_context_holder base class

@packit-as-a-service
Copy link
Copy Markdown

One of the tests failed for 0ad8abc. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581076 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10301902/

@packit-as-a-service
Copy link
Copy Markdown

One of the tests failed for b08b8ba. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581122 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302067/

@packit-as-a-service
Copy link
Copy Markdown

One of the tests failed for bb44ab5. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581154 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302183/

@packit-as-a-service
Copy link
Copy Markdown

@packit-as-a-service
Copy link
Copy Markdown

@packit-as-a-service
Copy link
Copy Markdown

One of the tests failed for 85a3f55. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3451122 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302317/

Run mingw_ssl_diag first (with || true) so it always produces output,
then http_client_test with if: always(). This ensures both tests run
even if one crashes.
Three parallel CI jobs:
- ssl-enabled: GLZ_ENABLE_SSL + dynamic OpenSSL (baseline, crashes)
- link-only: OpenSSL DLLs linked, GLZ_ENABLE_SSL undefined (tests DLL loading)
- static-ssl: GLZ_ENABLE_SSL + static OpenSSL (tests DLL boundary issue)
Temporarily remove the ssl_context data member to determine if its
presence in http_server<false> causes the MinGW heap corruption.
The member's unique_ptr<asio::ssl::context> destructor instantiates
OpenSSL cleanup code even when the pointer is null, which may be
the trigger. All usages are inside if constexpr (EnableTLS) so
http_server<false> compiles without it.
Use ssl_context_holder<EnableTLS> base class so that ssl_context only
exists as a data member when EnableTLS=true. This avoids instantiating
unique_ptr<asio::ssl::context> destructor for http_server<false>.

All references use this->ssl_context to defer name lookup past
GCC 15's -Wtemplate-body checking of discarded if constexpr branches.
The previous -UGLZ_ENABLE_SSL approach didn't reliably override the
interface target's -D flag, so the link-only test was still compiled
with SSL code paths. Now the link-only job configures with
glaze_ENABLE_SSL=OFF so the glaze headers have no SSL code at all.
Root cause identified: heap corruption occurs on MinGW/GCC + Windows
purely from linking OpenSSL — even with GLZ_ENABLE_SSL undefined and
no SSL code compiled. This is an OpenSSL + MinGW runtime interaction
bug, not a Glaze issue.

Changes:
- Document findings in docs/networking/mingw-ssl-heap-corruption.md
- Simplify msys2-ssl.yml CI workflow with continue-on-error: true
- Move ssl_context to conditional base class (ssl_context_holder) so
  http_server<false> has no OpenSSL members — cleaner architecture
- Add io_context drain in http_client::stop_workers() (good practice)
- Restore http_client_test.cpp to clean state (remove debug traces)
- Keep diagnostic tests for tracking upstream fixes
- Add ws2_32/mswsock Winsock linking for MinGW in tests/CMakeLists.txt
@GTruf
Copy link
Copy Markdown

GTruf commented Apr 13, 2026

@stephenberry, hello.
I don't know if you did this investigation on your own or if you have a whole team, but that's impressive... Could you please tell me if there’s any resolution regarding the issue you found: have you fixed it and all the tests are passing, or is additional testing needed (Status: Root cause identified, fix applied)? And, by the way, since we’re discussing all this in the context of WebSocket implementation, why are you using shared_ptr and weak_ptr patterns everywhere there? Why not unique_ptr or something else? Won’t there be a performance hit from the use of shared_ptr everywhere?

And I have a question about the Correct usage pattern: isn't it possible that new handlers might appear after a restart (clear the stopped flag)? I understand the basic idea: we stop the io_context, wait for the workers to finish, then restart and complete the work of any other handlers, if there are any. But couldn’t more handlers be added at that point?

@stephenberry stephenberry changed the title msys2-mingw-ssl error replication test MinGW + SSL CI and investigation Apr 13, 2026
@stephenberry
Copy link
Copy Markdown
Owner Author

@GTruf, this was me pounding away at the problem using AI. The PR description was out of date and I just updated it.

What is extremely relevant is that merely linking OpenSSL on MinGW without compiling any SSL code causes crashes. So, the primary issue isn't a Glaze bug.

I'm still digging through the problem, but this looks like a major concern with MinGW. I'd be surprised if it was OpenSSL's fault.

@stephenberry
Copy link
Copy Markdown
Owner Author

@GTruf, take a look at this writeup: https://github.com/stephenberry/glaze/blob/msys2-mingw-ssl/docs/networking/mingw-ssl-heap-corruption.md

No Glaze dependency — just ASIO + OpenSSL + MinGW. Useful for
reporting upstream to MSYS2/OpenSSL. Includes a drop-in GitHub
Action workflow.
@GTruf
Copy link
Copy Markdown

GTruf commented Apr 13, 2026

@stephenberry, I’ve also done a lot of research on the glaze and asio code using AI, and for the most part, all the major AI models produced results that were more or less the same as what’s in your notes.

Yes, initially in the PR about WebSocket, I suggested that the problem might be at the OpenSSL level, but then I deleted or edited that comment. Basically, yes, the main assumption is that the crash occurs at the OpenSSL level, built in MSYS2 using MinGW. I don’t know exactly what’s crashing. I think the problem will be easy to find if we add logging at the OpenSSL level; we’ll just have to rebuild the library for each new log entry, but that’s not a big deal. Do you have time for this, or do you need help? Everything is set up on my end; I can test it soon. And we could also use GDB to see what’s going on there. It also annoys me that there are no sanitizers for MinGW, that makes it very hard to debug this problem...

Also, if you could reply in this PR thread about shared_ptr, that would be great, or I can create a separate issue for that discussion.

@stephenberry
Copy link
Copy Markdown
Owner Author

@GTruf, as much as you're able to help, I would appreciate it, because my time is limited right now. But, I'd also like to understand what is going on. I'm going to build a minimal reproduction of the issue.

And, by the way, since we’re discussing all this in the context of WebSocket implementation, why are you using shared_ptr and weak_ptr patterns everywhere there? Why not unique_ptr or something else? Won’t there be a performance hit from the use of shared_ptr everywhere?

As for use of shared_ptr, it is extremely helpful for asynchronous calls, where we might want the data or connection to outlive the call. The shared/weak pairing also works well for connections that might be dropped, as we can use the weak_ptr to query if the connection is live and we therefore pair lifetimes with utility.

However, there is a lot of higher level complexity due to dealing with the asio architecture, and after trying to optimize I've realized in the long run we probably want to drop asio's event loop logic. This is much deeper discussion to have, but it would also allow the architecture to be cleaner.

shared_ptr isn't really a performance concern unless constantly being constructed and destroyed.

@stephenberry
Copy link
Copy Markdown
Owner Author

And I have a question about the Correct usage pattern: isn't it possible that new handlers might appear after a restart (clear the stopped flag)? I understand the basic idea: we stop the io_context, wait for the workers to finish, then restart and complete the work of any other handlers, if there are any. But couldn’t more handlers be added at that point?

AI explanation:

After restart() clears the stopped flag, new handlers could theoretically be dispatched. But in this sequence it's safe:

io_context->stop();       // 1. Prevent new handler dispatch
join_worker_threads();    // 2. All threads exit run(), no code left to post handlers
io_context->restart();    // 3. Clear stopped flag
io_context->poll();       // 4. Drain remaining queued handlers on THIS thread

After step 2, no other threads are alive to post work. The only handlers in the queue are ones that were posted before stop() but never dispatched. poll() runs them synchronously on the calling thread.

If a drained handler itself posts new work (e.g., a completion callback that calls async_read), poll() picks that up too, which is actually desirable, since it lets cleanup chains complete fully. Once the queue is empty and no new work is posted, poll() returns.

That said, this drain pattern is not currently applied on main, it's proposed in this PR as a preventive improvement for http_client::stop_workers(). The MinGW + OpenSSL heap corruption turned out to be an upstream runtime issue unrelated to handler draining (the crash occurs even when no handlers are pending). The drain is still good practice for clean ASIO shutdown.

Triggers on push to msys2-mingw-ssl and debug/* branches, plus
manual workflow_dispatch. Runs the ASIO + OpenSSL reproducer
(no Glaze dependency) 3 times to catch the intermittent crash.
@GTruf
Copy link
Copy Markdown

GTruf commented Apr 13, 2026

@stephenberry, By the way, I’m comparing the performance of the libwebsockets and glaze libraries when working with WebSockets. For libwebsockets, I’m also using OpenSSL compiled under MinGW in MSYS2 (basically the same binary as for glaze), and everything works fine there, without any crash.

Fixed port 19876 caused bind failures on reuse. Now uses port 0
so the OS assigns a free port each round.
@GTruf
Copy link
Copy Markdown

GTruf commented Apr 14, 2026

@stephenberry, I tested OpenSSL 3.0.8, 3.1.8, 3.6.2, and 4.0.0-beta-1. All of them crash, though the issue occurs much less frequently in earlier versions. Using AI, I discovered that it is possible to set certain callback functions that are called at specific points during event handling in OpenSSL. I defined the callbacks at the very top of websocket_client.hpp and set them up immediately after allocating memory for ssl_socket_. What’s characteristic is that for all versions, in the event of a crash, a handshake occurs first, followed by two CCS packets of 5 and 1 bytes, respectively, after which the crash occurs.

// AI-based
void ssl_info_callback(const SSL* ssl, int where, int ret)
{
    const char* str = "undefined";
    int w = where & ~SSL_ST_MASK;

    if (w & SSL_ST_CONNECT)      str = "SSL_connect";
    else if (w & SSL_ST_ACCEPT)  str = "SSL_acc << std::flushept";

    if (where & SSL_CB_LOOP) {
        std::cout << str << ": " << SSL_state_string_long(ssl) << "\n" << std::flush;
    }
    else if (where & SSL_CB_ALERT) {
        const char* dir = (where & SSL_CB_READ) ? "received" : "sent";
        std::cout << "Alert " << dir << ": "
                  << SSL_alert_type_string_long(ret) << " : "
                  << SSL_alert_desc_string_long(ret) << "\n";
    }
    else if (where & SSL_CB_EXIT) {
        if (ret == 0)
            std::cout << str << ": failed in " << SSL_state_string_long(ssl) << "\n" << std::flush;
        else if (ret < 0)
            std::cout << str << ": error in " << SSL_state_string_long(ssl) << "\n" << std::flush;
    }
    if (where & SSL_CB_HANDSHAKE_DONE)
        std::cout << "=== HANDSHAKE DONE ===\n" << std::flush << "VERSION: " << OpenSSL_version(OPENSSL_VERSION_STRING) << std::endl;
}

// AI-based
void msg_callback(int write_p, int version, int content_type,
                  const void* buf, size_t len, SSL* ssl, void* arg)
{
    const char* dir = write_p ? "sent" : "received";
    const char* type = (content_type == SSL3_RT_HANDSHAKE) ? "handshake" :
                       (content_type == SSL3_RT_ALERT) ? "alert" : "ccs";

    std::cout << dir << " " << type << " (" << len << " bytes)\n" << std::flush;
}

...
// Inside `void connect(std::string_view url_str)`
ssl_socket_ = std::make_shared<asio::ssl::stream<asio::ip::tcp::socket>>(*ctx, *ssl_ctx_); // <--- glaze code

// Added code for logging via callbacks
SSL* ssl = ssl_socket_->native_handle();
SSL_set_info_callback(ssl, ssl_info_callback);
SSL_set_msg_callback(ssl, msg_callback);

Output:

SSL_connect: before SSL initialization
sent ccs (5 bytes)
sent handshake (1539 bytes)
SSL_connect: SSLv3/TLS write client hello
SSL_connect: error in SSLv3/TLS write client hello
SSL_connect: error in SSLv3/TLS write client hello
received ccs (5 bytes)
SSL_connect: SSLv3/TLS write client hello
received handshake (1210 bytes)
received ccs (5 bytes)
received ccs (1 bytes)
received ccs (5 bytes)
SSL_connect: error in SSLv3/TLS read server hello
received ccs (1 bytes)
SSL_connect: SSLv3/TLS read server hello
received handshake (10 bytes)
SSL_connect: TLSv1.3 read encrypted extensions
received handshake (2519 bytes)
SSL_connect: SSLv3/TLS read server certificate
received handshake (78 bytes)
SSL_connect: TLSv1.3 read server certificate verify
received handshake (52 bytes)
SSL_connect: SSLv3/TLS read finished
sent ccs (5 bytes)
sent ccs (1 bytes)
SSL_connect: SSLv3/TLS write change cipher spec
sent ccs (5 bytes)
sent ccs (1 bytes)
sent handshake (52 bytes)
SSL_connect: SSLv3/TLS write finished
=== HANDSHAKE DONE ===
VERSION: 3.6.2
sent ccs (5 bytes)
sent ccs (1 bytes)

Process finished with exit code -1073741819 (0xC0000005)

@GTruf
Copy link
Copy Markdown

GTruf commented Apr 14, 2026

@stephenberry, Overall, I think that building OpenSSL under MSYS2's MinGW is incompatible with using standard MinGW in our case. It might be an incompatibility between UCRT and MSVCRT or something else. But the whole thing is pretty strange; an incompatibility issue like this would clearly have been detected earlier. To reiterate, the libwebsockets code is fully functional, and OpenSSL is sourced from MSYS2. I think there is no crash in the libwebsockets version because the handshake and cleanup occur within the library, which was built in a more consistent environment (or uses slightly different internal calls).

Anyway, if you create any more PRs on this topic, feel free to ping me so I’m at least aware that they exist.

@GTruf
Copy link
Copy Markdown

GTruf commented Apr 14, 2026

@stephenberry, If we ultimately can't resolve the issue, I suggest failing the compilation using CMake message(FATAL_ERROR ...) and C++ static_assert that checks for the use of glaze with MinGW and OpenSSL, as a temporary solution.

@stephenberry
Copy link
Copy Markdown
Owner Author

@GTruf, thanks for all this analysis. This is probably something that the OpenSSL team would be very interested in, because it could potentially be an attack vector. Do you plan to submit an error report or would you like me to?

@GTruf
Copy link
Copy Markdown

GTruf commented Apr 14, 2026

@stephenberry, To be honest, I'm not quite sure how to put all this into words for them. If you could send them all of this, that would be great. If you need to run the text by me to make sure we're both confident we're sending them the right report, just ping me (especially if you need code and reproduction steps, starting with the OpenSSL build, etc.).

P.S. I think they’ll need the Glaze code that uses asio so they can see what’s wrong there, and also to point out that everything works fine with libwebsockets.

@GTruf
Copy link
Copy Markdown

GTruf commented Apr 14, 2026

@stephenberry,

because it could potentially be an attack vector

I can't really imagine how this could be used for malicious purposes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants