fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368)#519
Open
urrsk wants to merge 9 commits into
Open
Conversation
TCPSocket::setup() slept for the full reconnection_time between consecutive failed connect() attempts with no way to be woken early. When RTDEClient's reconnect thread was in this sleep, ~RTDEClient() would block indefinitely in reconnecting_thread_.join() even though stop_reconnection_ had been set. Replace the monolithic sleep_for(reconnection_time) with a 100 ms-sliced loop that exits as soon as state_ transitions to SocketState::Closed. URStream::disconnect() (called by RTDEClient::disconnect()) already calls TCPSocket::close() which sets state_ = Closed, so no new API is needed. Also reset state_ to Invalid at the top of setup() (after the Connected guard) so that a Closed state left by a previous disconnect() cannot prematurely terminate the sliced sleep on a brand-new connection attempt. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
…isconnect stream Two related fixes for RTDEClient teardown: 1. ~RTDEClient() now calls disconnect() before joining reconnecting_thread_. Previously the join came first, so the reconnect thread could be sleeping inside TCPSocket::setup()'s between-attempt sleep when ~RTDEClient() ran, causing the join to block until the sleep completed (up to reconnection_timeout, default 10 s, and indefinitely with max_connection_attempts=0). Moving disconnect() before join() triggers the new sliced-sleep exit path in TCPSocket::setup() so the thread wakes within 100 ms. 2. RTDEClient::disconnect() now calls stream_.disconnect() and writer_.stop() unconditionally, regardless of client_state_. Previously, a failed negotiateProtocolVersion() would reset client_state_ to UNINITIALIZED, causing the conditional guard to skip stream_.disconnect(). The stream was then left in SocketState::Connected, and the next init() retry's TCPSocket::setup() would return false immediately (due to the Connected early-return guard), throwing 'Failed to connect to robot' even though the server was reachable. TCPSocket::close() and RTDEWriter::stop() are both idempotent, so removing the guard is safe. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
TCPSocketTest.setup_interruptible_by_close (test_tcp_socket.cpp): Unit test that runs without INTEGRATION_TESTS. Starts TCPSocket::setup() with max_num_tries=0 and a 5 s reconnection_timeout against a non-listening port, then calls close() from the main thread and asserts the background thread joins within 2 s. Directly exercises the interruptible-sleep fix. RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread (test_rtde_client.cpp): Integration-level test using the existing RTDEServer fake. Initialises an RTDEClient with reconnection_timeout=5 s, drops the fake server to trigger the reconnect thread, then asserts ~RTDEClient() completes in < 2 s. The test skips gracefully when the fake server cannot complete the RTDE handshake within the socket read timeout (environment-dependent timing); in that case TCPSocketTest.setup_interruptible_by_close provides full coverage of the underlying fix. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #519 +/- ##
==========================================
+ Coverage 78.87% 79.00% +0.13%
==========================================
Files 116 115 -1
Lines 6612 6825 +213
Branches 2920 3004 +84
==========================================
+ Hits 5215 5392 +177
- Misses 1031 1050 +19
- Partials 366 383 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. |
…erruptible producer backoff Extends the RTDEClient teardown fix to the PrimaryClient pipeline, which is the path used by UrDriver. ~PrimaryClient() could block indefinitely (up to the unbounded reconnect timeout) when the robot dropped the primary connection at teardown time. Root cause: on connection loss TCPSocket::read() leaves the socket in SocketState::Disconnected, so URProducer::tryGetImpl() enters its reconnect loop (sleep backoff + stream_.connect() with unlimited retries). ~PrimaryClient() and PrimaryClient::stop() joined the producer thread (pipeline_->stop()) without first closing the stream, so the join blocked for the full reconnect duration. Two related fixes: 1. ~PrimaryClient() and PrimaryClient::stop() now call stream_.close() BEFORE pipeline_->stop(). stream_.close() sets SocketState::Closed, waking a producer stuck in its reconnect path so the join returns promptly. Previously stop() closed the stream after the join (too late) and the destructor never closed it. 2. URProducer::tryGetImpl()'s reconnect backoff slept sleep_for(timeout_) (growing up to 120 s) with no cancellation point. It now sleeps in 100 ms slices and bails out as soon as running_ becomes false or the stream is closed, mirroring the TCPSocket::setup() interruptible-sleep fix. test: add PrimaryClientReconnectTest.destructor_not_blocked_by_stuck_reconnect_thread (test_primary_client_reconnect.cpp). The PrimaryClient counterpart of RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread, using the in-process FakePrimaryServer so it runs without a robot (unlike the INTEGRATION_TESTS-gated primary_client_test_headless). It starts a client against the fake server, drops the server to drive the producer into its reconnect loop, then asserts ~PrimaryClient() completes in < 2 s. Verified to hang without the fix and pass in ~1.5 s with it. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
1 task
Replace the state-based, sleep-polling reconnect interruption with a dedicated, sticky cancellation flag (stop_requested_) plus requestStop()/clearStop(). The flag is orthogonal to SocketState, so setup()'s internal state resets can no longer race away a teardown signal -- the lost-wakeup that hung ~PrimaryClient() (and the Windows CI test) indefinitely. setup() now connects with a non-blocking socket polled in short slices (poll()/WSAPoll()), so a connect attempt against a genuinely unreachable host is interruptible too -- not just the between-attempt back-off. This makes the destructors return promptly instead of blocking for the reconnection timeout (or forever with unlimited attempts). - TCPSocket: add stop_requested_ + requestStop()/clearStop(); interruptible, non-blocking openInterruptible(); honor the flag in setup()'s connect loop and back-off. - PrimaryClient/RTDEClient: call requestStop() before joining the reconnect thread; clear the flag on (re)start (URProducer::setupProducer, RTDEClient::init). - tests: drive setup() interruption via requestStop(); add a non-routable-address blocking-connect regression test; guard the teardown tests with a watchdog; add CTest TIMEOUT properties so a hang fails fast instead of timing out the job.
reconnectStream() did stream_.close() + stream_.connect() without clearing the sticky stop_requested_ cancellation flag introduced in 5b22c97 ("make connection attempts cancellable to unblock teardown"). PrimaryClient::stop()/~PrimaryClient() set the flag via stream_.requestStop() so a producer thread stuck in TCPSocket::setup() aborts promptly at teardown. The flag is sticky and only cleared on (re)start (URProducer::setupProducer, RTDEClient::init). A deliberate reconnect via reconnectStream() never cleared it, so after stopPrimaryClientCommunication() any resendRobotProgram() call failed: TCPSocket::setup() bailed out at the up-front stop_requested_ check and logged "Failed to reconnect primary stream!". This regressed UrDriverTest.send_robot_program_retry_on_failure on every integration runner. Clear the flag in reconnectStream() before connecting, mirroring URProducer::setupProducer(); a deliberate reconnect is a restart and must not honor a stale teardown cancellation. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
…imit select() cannot watch file descriptors whose number is >= FD_SETSIZE (1024 on glibc). When the hosting process holds many descriptors (e.g. a JVM such as MATLAB's), accepted socket FDs exceed that limit; the previous code either rejected the connection (the set_size_exceeded guard added for select()) or risked the "bit out of range" fd_set crash. Replace the select()/fd_set machinery in TCPServer with poll() (WSAPoll() on Windows), which has no FD_SETSIZE limitation. The pollfd set is rebuilt each spin() from the listen socket plus client_fds_ (already tracked), so the masterfds_/tempfds_/maxfd_ members and the FD_SET/FD_CLR/FD_ZERO/FD_ISSET and set_size_exceeded code are removed entirely. tests: add two TCPServer regression tests - services_client_with_high_fd_number (POSIX): consumes low FDs so the accepted client FD exceeds FD_SETSIZE, then asserts connect, bidirectional data and disconnect all work. Fails on select(), passes on poll(). - receives_from_many_concurrent_clients: many clients send simultaneously; asserts the server observes activity on every client FD (guards the poll() revents loop).
…tuck reconnect Covers the second symptom of issue #368: PrimaryClient::stop() (the implementation behind UrDriver::stopPrimaryClientCommunication()) must return promptly when the producer thread is stuck in its reconnect loop against an unreachable robot, instead of blocking on the pipeline join. Also asserts the stop()/start() restart path reconnects, verifying the sticky cancellation flag is cleared via clearStop() on (re)start.
Replace the stop_requested_ flag and requestStop()/clearStop() API with an
enriched SocketState machine (Connecting/Connected/Reconnecting/LostConnection/
Disconnecting/Disconnected). The public lifecycle is now just connect() and
disconnect(): connect() implicitly clears a prior deliberate stop, so callers no
longer have to remember a separate clear step.
The deliberate-stop set {Disconnecting, Disconnected} is sticky and is never
overwritten by the connect/retry loop (setup()) or close(), and is cleared only
by an explicit connect(). The in-loop auto-reconnect (reconnect()) never clears
it, keeping ~PrimaryClient()/~RTDEClient() teardown race-free (the original
#368 fix). The transient-drop state is renamed Disconnected -> LostConnection;
SocketState::Disconnected is repurposed for the deliberate stop.
Update URStream, URProducer, PrimaryClient and RTDEClient to the new API, and
adjust the affected unit tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Tearing down a client while its reconnect thread is active blocks for an arbitrarily long time — up to the full
reconnection_timeout, or indefinitely whenmax_connection_attempts == 0(the default) and the robot is unreachable.Affected teardown paths:
~RTDEClient()~PrimaryClient()andPrimaryClient::stop()(the latter is whatUrDriver::stopPrimaryClientCommunication()calls)~UrDriver(), which owns both clients as membersFixes #368 ("UrDriver destructor hangs if arm is unreachable"). The repeated
Failed to connect to robot ... Retrying in 10 secondslog in that issue is exactly the reconnect loop this PR makes interruptible.Root cause
TCPSocket::setup()retriesconnect()with no upper bound whenmax_num_tries == 0. Both the blockingconnect()itself and the monolithicsleep_for(reconnection_time)between attempts were uninterruptible, so a thread sitting insetup()could not be woken at teardown.setup()and could not observe the flag — so the join blocked.SocketState::Closedis racy:setup()resetsstate_on every attempt, so a teardown signal carried purely by the socket state could be erased by a concurrent reset.RTDEClient::disconnect()guardedstream_.disconnect()onclient_state_ > UNINITIALIZED. A failednegotiateProtocolVersion()resetsclient_state_toUNINITIALIZED, leaving the stream inSocketState::Connected; the nextsetup()then returnedfalsevia itsConnectedguard even though the server was reachable.Fix
1.
TCPSocket— sticky, race-free cancellation + interruptible connect (src/comm/tcp_socket.cpp,include/ur_client_library/comm/tcp_socket.h)requestStop()/clearStop()backed by an atomicstop_requested_flag that is orthogonal tostate_.requestStop()sets the flag and closes the socket; the flag stays set untilclearStop(), so it cannot be clobbered bysetup()'s internalstate_resets.connect()withopenInterruptible(): a non-blocking connect polled in 100 ms slices (poll()/WSAPoll()), re-checkingstop_requested_each slice. This aborts a connect to a genuinely unreachable host promptly instead of waiting out the OS connect timeout.stop_requested_.2.
URProducer— interruptible reconnect backoff (include/ur_client_library/comm/producer.h)running_ == false) or the stream is closed.clearStop()insetupProducer()so a stream that was stopped at a previous teardown can be reused on (re)start.3.
RTDEClient— stop the stream before joining (src/rtde/rtde_client.cpp)~RTDEClient()now callsrequestStop()+disconnect()before joiningreconnecting_thread_, so a thread stuck insetup()aborts within one poll slice.disconnect()now disconnects the stream and stops the writer unconditionally (both are idempotent), fixing the case where a failed handshake left the streamConnectedand blocked re-initialization.init()clears the stop flag before connecting.4.
PrimaryClient— stop the stream before joining (src/primary/primary_client.cpp)~PrimaryClient()andstop()callrequestStop()beforepipeline_->stop(), so a producer stuck in its reconnect path is aborted and the pipeline join returns promptly.reconnectStream()clears the stop flag so a deliberate reconnect after astop()works.5.
TCPServer—poll()instead ofselect()(src/comm/tcp_server.cpp,include/ur_client_library/comm/tcp_server.h)select()withpoll(), removing theFD_SETSIZElimit so the server keeps functioning with high-numbered file descriptors and many concurrent clients.Tests
All of the following run in the normal (non-
INTEGRATION_TESTS) build using in-process fakes — no robot required.TCPSocketTest.setup_interruptible_by_close—setup()is interrupted during the between-attempt wait.TCPSocketTest.setup_interruptible_during_blocking_connect—setup()is interrupted while blocked inconnect()to an unreachable host (the direct UrDriver destructor hangs if arm is unreachable #368 case).RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread—~RTDEClient()returns in well under 2 s with a stuck reconnect thread.PrimaryClientReconnectTest.destructor_not_blocked_by_stuck_reconnect_thread—~PrimaryClient()returns promptly.PrimaryClientReconnectTest.stop_not_blocked_by_stuck_reconnect_thread—PrimaryClient::stop()returns promptly, and a subsequentstart()reconnects (verifies theclearStop()reuse path).TCPServerTest.services_client_with_high_fd_number/TCPServerTest.receives_from_many_concurrent_clients— exercise thepoll()migration.