MinGW + SSL CI and investigation#2448
Conversation
|
One of the tests failed for 0ad8abc. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581076 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10301902/ |
|
One of the tests failed for b08b8ba. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581122 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302067/ |
|
One of the tests failed for bb44ab5. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/581154 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302183/ |
|
One of the tests failed for 894cdf5. @admin check logs https://download.copr.fedorainfracloud.org/results/packit/stephenberry-glaze-2448/srpm-builds/10302242/builder-live.log, packit dashboard https://dashboard.packit.dev/jobs/srpm/581176 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302242/ |
|
One of the tests failed for 85a3f55. @admin check logs https://download.copr.fedorainfracloud.org/results/packit/stephenberry-glaze-2448/fedora-rawhide-aarch64/10302317-glaze/builder-live.log, packit dashboard https://dashboard.packit.dev/jobs/copr/3451123 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302317/ |
|
One of the tests failed for 85a3f55. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3451122 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10302317/ |
Run mingw_ssl_diag first (with || true) so it always produces output, then http_client_test with if: always(). This ensures both tests run even if one crashes.
Three parallel CI jobs: - ssl-enabled: GLZ_ENABLE_SSL + dynamic OpenSSL (baseline, crashes) - link-only: OpenSSL DLLs linked, GLZ_ENABLE_SSL undefined (tests DLL loading) - static-ssl: GLZ_ENABLE_SSL + static OpenSSL (tests DLL boundary issue)
Temporarily remove the ssl_context data member to determine if its presence in http_server<false> causes the MinGW heap corruption. The member's unique_ptr<asio::ssl::context> destructor instantiates OpenSSL cleanup code even when the pointer is null, which may be the trigger. All usages are inside if constexpr (EnableTLS) so http_server<false> compiles without it.
Use ssl_context_holder<EnableTLS> base class so that ssl_context only exists as a data member when EnableTLS=true. This avoids instantiating unique_ptr<asio::ssl::context> destructor for http_server<false>. All references use this->ssl_context to defer name lookup past GCC 15's -Wtemplate-body checking of discarded if constexpr branches.
The previous -UGLZ_ENABLE_SSL approach didn't reliably override the interface target's -D flag, so the link-only test was still compiled with SSL code paths. Now the link-only job configures with glaze_ENABLE_SSL=OFF so the glaze headers have no SSL code at all.
Root cause identified: heap corruption occurs on MinGW/GCC + Windows purely from linking OpenSSL — even with GLZ_ENABLE_SSL undefined and no SSL code compiled. This is an OpenSSL + MinGW runtime interaction bug, not a Glaze issue. Changes: - Document findings in docs/networking/mingw-ssl-heap-corruption.md - Simplify msys2-ssl.yml CI workflow with continue-on-error: true - Move ssl_context to conditional base class (ssl_context_holder) so http_server<false> has no OpenSSL members — cleaner architecture - Add io_context drain in http_client::stop_workers() (good practice) - Restore http_client_test.cpp to clean state (remove debug traces) - Keep diagnostic tests for tracking upstream fixes - Add ws2_32/mswsock Winsock linking for MinGW in tests/CMakeLists.txt
|
@stephenberry, hello. And I have a question about the |
|
@GTruf, this was me pounding away at the problem using AI. The PR description was out of date and I just updated it. What is extremely relevant is that merely linking OpenSSL on MinGW without compiling any SSL code causes crashes. So, the primary issue isn't a Glaze bug. I'm still digging through the problem, but this looks like a major concern with MinGW. I'd be surprised if it was OpenSSL's fault. |
|
@GTruf, take a look at this writeup: https://github.com/stephenberry/glaze/blob/msys2-mingw-ssl/docs/networking/mingw-ssl-heap-corruption.md |
No Glaze dependency — just ASIO + OpenSSL + MinGW. Useful for reporting upstream to MSYS2/OpenSSL. Includes a drop-in GitHub Action workflow.
|
@stephenberry, I’ve also done a lot of research on the glaze and asio code using AI, and for the most part, all the major AI models produced results that were more or less the same as what’s in your notes. Yes, initially in the PR about WebSocket, I suggested that the problem might be at the OpenSSL level, but then I deleted or edited that comment. Basically, yes, the main assumption is that the crash occurs at the OpenSSL level, built in MSYS2 using MinGW. I don’t know exactly what’s crashing. I think the problem will be easy to find if we add logging at the OpenSSL level; we’ll just have to rebuild the library for each new log entry, but that’s not a big deal. Do you have time for this, or do you need help? Everything is set up on my end; I can test it soon. And we could also use GDB to see what’s going on there. It also annoys me that there are no sanitizers for MinGW, that makes it very hard to debug this problem... Also, if you could reply in this PR thread about |
|
@GTruf, as much as you're able to help, I would appreciate it, because my time is limited right now. But, I'd also like to understand what is going on. I'm going to build a minimal reproduction of the issue.
As for use of shared_ptr, it is extremely helpful for asynchronous calls, where we might want the data or connection to outlive the call. The shared/weak pairing also works well for connections that might be dropped, as we can use the weak_ptr to query if the connection is live and we therefore pair lifetimes with utility. However, there is a lot of higher level complexity due to dealing with the asio architecture, and after trying to optimize I've realized in the long run we probably want to drop asio's event loop logic. This is much deeper discussion to have, but it would also allow the architecture to be cleaner. shared_ptr isn't really a performance concern unless constantly being constructed and destroyed. |
AI explanation: After io_context->stop(); // 1. Prevent new handler dispatch
join_worker_threads(); // 2. All threads exit run(), no code left to post handlers
io_context->restart(); // 3. Clear stopped flag
io_context->poll(); // 4. Drain remaining queued handlers on THIS threadAfter step 2, no other threads are alive to post work. The only handlers in the queue are ones that were posted before If a drained handler itself posts new work (e.g., a completion callback that calls That said, this drain pattern is not currently applied on |
Triggers on push to msys2-mingw-ssl and debug/* branches, plus manual workflow_dispatch. Runs the ASIO + OpenSSL reproducer (no Glaze dependency) 3 times to catch the intermittent crash.
|
@stephenberry, By the way, I’m comparing the performance of the libwebsockets and glaze libraries when working with WebSockets. For |
Fixed port 19876 caused bind failures on reuse. Now uses port 0 so the OS assigns a free port each round.
|
@stephenberry, I tested OpenSSL // AI-based
void ssl_info_callback(const SSL* ssl, int where, int ret)
{
const char* str = "undefined";
int w = where & ~SSL_ST_MASK;
if (w & SSL_ST_CONNECT) str = "SSL_connect";
else if (w & SSL_ST_ACCEPT) str = "SSL_acc << std::flushept";
if (where & SSL_CB_LOOP) {
std::cout << str << ": " << SSL_state_string_long(ssl) << "\n" << std::flush;
}
else if (where & SSL_CB_ALERT) {
const char* dir = (where & SSL_CB_READ) ? "received" : "sent";
std::cout << "Alert " << dir << ": "
<< SSL_alert_type_string_long(ret) << " : "
<< SSL_alert_desc_string_long(ret) << "\n";
}
else if (where & SSL_CB_EXIT) {
if (ret == 0)
std::cout << str << ": failed in " << SSL_state_string_long(ssl) << "\n" << std::flush;
else if (ret < 0)
std::cout << str << ": error in " << SSL_state_string_long(ssl) << "\n" << std::flush;
}
if (where & SSL_CB_HANDSHAKE_DONE)
std::cout << "=== HANDSHAKE DONE ===\n" << std::flush << "VERSION: " << OpenSSL_version(OPENSSL_VERSION_STRING) << std::endl;
}
// AI-based
void msg_callback(int write_p, int version, int content_type,
const void* buf, size_t len, SSL* ssl, void* arg)
{
const char* dir = write_p ? "sent" : "received";
const char* type = (content_type == SSL3_RT_HANDSHAKE) ? "handshake" :
(content_type == SSL3_RT_ALERT) ? "alert" : "ccs";
std::cout << dir << " " << type << " (" << len << " bytes)\n" << std::flush;
}
...
// Inside `void connect(std::string_view url_str)`
ssl_socket_ = std::make_shared<asio::ssl::stream<asio::ip::tcp::socket>>(*ctx, *ssl_ctx_); // <--- glaze code
// Added code for logging via callbacks
SSL* ssl = ssl_socket_->native_handle();
SSL_set_info_callback(ssl, ssl_info_callback);
SSL_set_msg_callback(ssl, msg_callback);Output: SSL_connect: before SSL initialization
sent ccs (5 bytes)
sent handshake (1539 bytes)
SSL_connect: SSLv3/TLS write client hello
SSL_connect: error in SSLv3/TLS write client hello
SSL_connect: error in SSLv3/TLS write client hello
received ccs (5 bytes)
SSL_connect: SSLv3/TLS write client hello
received handshake (1210 bytes)
received ccs (5 bytes)
received ccs (1 bytes)
received ccs (5 bytes)
SSL_connect: error in SSLv3/TLS read server hello
received ccs (1 bytes)
SSL_connect: SSLv3/TLS read server hello
received handshake (10 bytes)
SSL_connect: TLSv1.3 read encrypted extensions
received handshake (2519 bytes)
SSL_connect: SSLv3/TLS read server certificate
received handshake (78 bytes)
SSL_connect: TLSv1.3 read server certificate verify
received handshake (52 bytes)
SSL_connect: SSLv3/TLS read finished
sent ccs (5 bytes)
sent ccs (1 bytes)
SSL_connect: SSLv3/TLS write change cipher spec
sent ccs (5 bytes)
sent ccs (1 bytes)
sent handshake (52 bytes)
SSL_connect: SSLv3/TLS write finished
=== HANDSHAKE DONE ===
VERSION: 3.6.2
sent ccs (5 bytes)
sent ccs (1 bytes)
Process finished with exit code -1073741819 (0xC0000005) |
|
@stephenberry, Overall, I think that building OpenSSL under MSYS2's MinGW is incompatible with using standard MinGW in our case. It might be an incompatibility between UCRT and MSVCRT or something else. But the whole thing is pretty strange; an incompatibility issue like this would clearly have been detected earlier. To reiterate, the Anyway, if you create any more PRs on this topic, feel free to ping me so I’m at least aware that they exist. |
|
@stephenberry, If we ultimately can't resolve the issue, I suggest failing the compilation using CMake |
|
@GTruf, thanks for all this analysis. This is probably something that the OpenSSL team would be very interested in, because it could potentially be an attack vector. Do you plan to submit an error report or would you like me to? |
|
@stephenberry, To be honest, I'm not quite sure how to put all this into words for them. If you could send them all of this, that would be great. If you need to run the text by me to make sure we're both confident we're sending them the right report, just ping me (especially if you need code and reproduction steps, starting with the OpenSSL build, etc.). P.S. I think they’ll need the |
I can't really imagine how this could be used for malicious purposes |
Summary
Adds a MinGW + SSL CI workflow and documents an intermittent heap corruption (
0xc0000374) on MinGW/GCC + Windows when OpenSSL is linked.Root cause: OpenSSL + MinGW runtime interaction bug (not a Glaze bug). The crash occurs purely from linking OpenSSL libraries — even with
GLZ_ENABLE_SSLundefined and no SSL code compiled. MSVC builds are unaffected.See full investigation writeup for details.
Changes
msys2-ssl.yml— New CI workflow for MinGW + SSL testing (continue-on-error: true)http_server.hpp— Movessl_contextto conditional base class (ssl_context_holder) sohttp_server<false>has no OpenSSL membershttp_client.hpp— Addio_contextdrain instop_workers()(good practice for clean shutdown)tests/CMakeLists.txt— Addws2_32/mswsockWinsock linking for MinGWmingw_ssl_diag/— Diagnostic test suite and minimal reproducerInvestigation summary
GLZ_ENABLE_SSLmacro / template changes-O0OPENSSL_thread_stop()didn't fix itTest plan
continue-on-error: true(known intermittent failure)http_server<true>(TLS servers) still compile and work viassl_context_holderbase class