websocket optimization and benchmarking#2399
Conversation
|
@stephenberry, Hi, are there any plans to optimize the WebSockets/HTTP part? And maybe the benchmarks you were planning to do, at least with uWebSockets? |
|
@GTruf, optimizing websockets is an aim. uWebSockets is pretty well optimized, so in my work to make Glaze just as fast I realized there are core limitations due to the design of asio. I need to decide whether to optimize asio or rework core networking logic to support extreme optimization. I think in the mean time I'll merge optimizations that still use the asio architecture and include benchmarks. But, I want to get this right and not potentially make any API breaking changes without good reason. |
|
@stephenberry, By |
|
This is an interesting dilemma; are there any better alternatives to asio at this point or is this a situation in which one would have to roll their own? |
|
There is a lot I like about asio, but also some core flaws and lots of unfixed bugs. I've thought of trying to contribute heavily to asio, but I'd rather just work on a more modern library and not need to support old C++ versions like asio needs to care about. I have an experimental fork of asio that I have massively cleaned up, and which drops lots of deprecated code and cleans up things with modern C++20 concepts, etc. If I could get enough developers to help maintain this I would consider this direction, but I'm wary of the required time investment. The other option is to implement custom cross platform networking code for websockets, but this is more prone to bugs and bifurcates the networking codebase. A couple years ago I was using uWebsockets heavily and core bugs and design flaws were requiring us to do strange hacks and quick exits to avoid segfaults, so tightly optimized networking code is hard, but also very desirable. I might open source my asio fork soon for feedback. It's a tough call. It removes many thousands of lines of code that aren't needed due to historical debt, but I wouldn't want to aim for parity with asio any more. This would be a completely new library with a similar API, but would evolve in another direction for the sake of modern C++ and performance. |
|
@RazielXYZ, DPDK is fully implemented for Linux. It’s already available for Windows, but is still under active development. No one is suggesting that the backend should be implemented exclusively on the F-Stack, I mean supporting it as an additional option to maximize performance right at the kernel bypass level. |
|
@stephenberry, I think you'll have no trouble finding developers on Reddit. |
I don't think there's necessarily much point in aiming for parity with asio anyway, since asio still exists and is still under development, so if people want parity with asio, well, they can use asio. Something nicer and modern-er, as you describe it there, would certainly be welcome. As for other ws libraries - I did try uWebSockets a bit, but did not like the API or design much - anything slightly different than expected or somewhat involved was either not doable easily or really damn ugly. I've used ixWebSockets quite a bit recently and it was fine in that regard, but quite limited by the one thread per connection/client design, and not really under active development anymore. Way back in the day I also used websocketpp, which is quite far removed from the niceties we have nowadays. |
|
Thanks for the feedback and encouragement. I do think an asio fork would be great for the future C++ community. And, I'm excited to do more networking in C++, I'm just short on time at the moment and I don't want to do something half baked. |
WebSocket Optimization and Benchmarking
Performance Optimizations
Shared receive buffers — All WebSocket connections on a given thread now share a single 512KB receive buffer (one allocation per thread instead of per-connection). Unconsumed partial-frame bytes spill to a small per-connection buffer. Deferred reclamation avoids thrashing. Enabled by default; disable with
ws_recv_buffer_size(0).Fused unmask + ASCII detection — XOR unmasking now processes 8 bytes at a time and simultaneously checks whether all bytes are ASCII. For ASCII-only text frames, the separate UTF-8 validation pass is skipped entirely.
Zero-allocation write fast path — When no write is in flight, outgoing frames are built directly in a persistent per-connection buffer (capacity reused across messages). Frames are only heap-allocated and queued when a concurrent write is already in progress.
Write queue simplification — Replaced
std::deque<std::unique_ptr<std::vector<uint8_t>>>withstd::deque<std::vector<uint8_t>>, removing a level of indirection.Benchmark Suite
Added
benchmarks/ws_benchmark/comparing Glaze against uWebSockets using Boost.Beast as a neutral client. Tests cover: