server: set SO_REUSEPORT on the listen socket to avoid spurious EADDRINUSE on rapid restart#387
server: set SO_REUSEPORT on the listen socket to avoid spurious EADDRINUSE on rapid restart#387bryancall wants to merge 1 commit into
Conversation
…INUSE on rapid restart
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves server startup robustness by allowing a new verifier server instance to bind to a port even when a previous instance is still actively listening (common in back-to-back test runs).
Changes:
- Adds
SO_REUSEPORTsocket option setup indo_listen()afterSO_REUSEADDR. - Documents the rationale for
SO_REUSEPORTto mitigateEADDRINUSEduring rapid restarts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else if (setsockopt(socket_fd, SOL_SOCKET, SO_REUSEPORT, &ONE, sizeof(int)) < 0) { | ||
| // SO_REUSEADDR alone permits binding a port left in TIME_WAIT, but not one | ||
| // still actively bound by a live listening socket. When the server is run | ||
| // back-to-back on the same port (e.g. an autest harness recycling ports | ||
| // across tests, where a prior server instance has not fully exited), the | ||
| // bind() below can fail with EADDRINUSE and the server exits without ever | ||
| // listening. SO_REUSEPORT lets the new listener bind alongside the | ||
| // departing one, eliminating that spurious startup failure. | ||
| errata.note( | ||
| S_ERROR, | ||
| R"(Could not set reuseport on socket {}: {}.)", | ||
| socket_fd, | ||
| swoc::bwf::Errno{}); |
| } else if (setsockopt(socket_fd, SOL_SOCKET, SO_REUSEPORT, &ONE, sizeof(int)) < 0) { | ||
| // SO_REUSEADDR alone permits binding a port left in TIME_WAIT, but not one | ||
| // still actively bound by a live listening socket. When the server is run | ||
| // back-to-back on the same port (e.g. an autest harness recycling ports | ||
| // across tests, where a prior server instance has not fully exited), the | ||
| // bind() below can fail with EADDRINUSE and the server exits without ever | ||
| // listening. SO_REUSEPORT lets the new listener bind alongside the | ||
| // departing one, eliminating that spurious startup failure. |
| R"(Could not set reuseaddr on socket {}: {}.)", | ||
| socket_fd, | ||
| swoc::bwf::Errno{}); | ||
| } else if (setsockopt(socket_fd, SOL_SOCKET, SO_REUSEPORT, &ONE, sizeof(int)) < 0) { |
|
Closing. SO_REUSEPORT is the wrong fix here: it co-binds with a live listener (verified: my own test left two servers listening on the same port), which can mask a genuine 'another instance is already running' conflict. More fundamentally, I had not actually confirmed the root cause from real evidence before opening this — closing until the actual failure is captured. |
Problem
When the server is started on a port that a previous server instance has not fully released,
bind()fails withEADDRINUSEand the process exits (process_exit_code = 1) without ever callinglisten()— there is no bind retry.SO_REUSEADDRis already set, but it only permits rebinding a port inTIME_WAIT; it does not allow binding a port still held by a live listening socket. So a rapidly-restarted server (or two server instances briefly overlapping during teardown/startup) races on the port and the new one dies on startup.This shows up in test harnesses that recycle a pool of ports across many back-to-back tests (e.g. ATS autest under
--network=host): intermittently a verifier server is handed a port whose prior owner is still finishing its exit,bind()returnsEADDRINUSE, and the server exits before listening. The harness, which only polls "is the port open?", reports a generic "process failed to become ready in time" — masking the real cause.Fix
Set
SO_REUSEPORTalongside the existingSO_REUSEADDRon the server listen socket. This lets the new listener bind even while the departing instance still holds the port, eliminating the spurious startup failure. The kernel routes new connections to the live listener once the old one closes.One line (plus a comment) in
do_listen(). Scoped to the server TCP listen socket only — the clientdo_connect()path and the QUIC UDP socket are intentionally left unchanged.Verification
SO_REUSEPORTcompiles on Linux/Fedora and macOS (the supported build platforms);verifier-server.ccalready includes<sys/socket.h>.