fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368) by urrsk · Pull Request #519 · UniversalRobots/Universal_Robots_Client_Library

urrsk · 2026-06-11T13:32:45Z

Problem

Tearing down a client while its reconnect thread is active blocks for an arbitrarily long time — up to the full reconnection_timeout, or indefinitely when max_connection_attempts == 0 (the default) and the robot is unreachable.

Affected teardown paths:

~RTDEClient()
~PrimaryClient() and PrimaryClient::stop() (the latter is what UrDriver::stopPrimaryClientCommunication() calls)
transitively ~UrDriver(), which owns both clients as members

Fixes #368 ("UrDriver destructor hangs if arm is unreachable"). The repeated Failed to connect to robot ... Retrying in 10 seconds log in that issue is exactly the reconnect loop this PR makes interruptible.

Root cause

TCPSocket::setup() retries connect() with no upper bound when max_num_tries == 0. Both the blocking connect() itself and the monolithic sleep_for(reconnection_time) between attempts were uninterruptible, so a thread sitting in setup() could not be woken at teardown.
The client destructors set a stop flag and then immediately joined the reconnect thread, but that thread was blocked inside setup() and could not observe the flag — so the join blocked.
Signalling cancellation only through SocketState::Closed is racy: setup() resets state_ on every attempt, so a teardown signal carried purely by the socket state could be erased by a concurrent reset.
RTDEClient::disconnect() guarded stream_.disconnect() on client_state_ > UNINITIALIZED. A failed negotiateProtocolVersion() resets client_state_ to UNINITIALIZED, leaving the stream in SocketState::Connected; the next setup() then returned false via its Connected guard even though the server was reachable.

Fix

1. `TCPSocket` — sticky, race-free cancellation + interruptible connect (`src/comm/tcp_socket.cpp`, `include/ur_client_library/comm/tcp_socket.h`)

Add requestStop() / clearStop() backed by an atomic stop_requested_ flag that is orthogonal to state_. requestStop() sets the flag and closes the socket; the flag stays set until clearStop(), so it cannot be clobbered by setup()'s internal state_ resets.
Replace the blocking connect() with openInterruptible(): a non-blocking connect polled in 100 ms slices (poll() / WSAPoll()), re-checking stop_requested_ each slice. This aborts a connect to a genuinely unreachable host promptly instead of waiting out the OS connect timeout.
Slice the between-attempt back-off so it also exits on stop_requested_.

2. `URProducer` — interruptible reconnect backoff (`include/ur_client_library/comm/producer.h`)

Sleep the reconnect backoff in 100 ms slices, exiting early when the producer is stopped (running_ == false) or the stream is closed.
Call clearStop() in setupProducer() so a stream that was stopped at a previous teardown can be reused on (re)start.

3. `RTDEClient` — stop the stream before joining (`src/rtde/rtde_client.cpp`)

~RTDEClient() now calls requestStop() + disconnect() before joining reconnecting_thread_, so a thread stuck in setup() aborts within one poll slice.
disconnect() now disconnects the stream and stops the writer unconditionally (both are idempotent), fixing the case where a failed handshake left the stream Connected and blocked re-initialization.
init() clears the stop flag before connecting.

4. `PrimaryClient` — stop the stream before joining (`src/primary/primary_client.cpp`)

~PrimaryClient() and stop() call requestStop() before pipeline_->stop(), so a producer stuck in its reconnect path is aborted and the pipeline join returns promptly.
reconnectStream() clears the stop flag so a deliberate reconnect after a stop() works.

5. `TCPServer` — `poll()` instead of `select()` (`src/comm/tcp_server.cpp`, `include/ur_client_library/comm/tcp_server.h`)

Replace select() with poll(), removing the FD_SETSIZE limit so the server keeps functioning with high-numbered file descriptors and many concurrent clients.

Tests

All of the following run in the normal (non-INTEGRATION_TESTS) build using in-process fakes — no robot required.

TCPSocketTest.setup_interruptible_by_close — setup() is interrupted during the between-attempt wait.
TCPSocketTest.setup_interruptible_during_blocking_connect — setup() is interrupted while blocked in connect() to an unreachable host (the direct UrDriver destructor hangs if arm is unreachable #368 case).
RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread — ~RTDEClient() returns in well under 2 s with a stuck reconnect thread.
PrimaryClientReconnectTest.destructor_not_blocked_by_stuck_reconnect_thread — ~PrimaryClient() returns promptly.
PrimaryClientReconnectTest.stop_not_blocked_by_stuck_reconnect_thread — PrimaryClient::stop() returns promptly, and a subsequent start() reconnects (verifies the clearStop() reuse path).
TCPServerTest.services_client_with_high_fd_number / TCPServerTest.receives_from_many_concurrent_clients — exercise the poll() migration.

TCPSocket::setup() slept for the full reconnection_time between consecutive failed connect() attempts with no way to be woken early. When RTDEClient's reconnect thread was in this sleep, ~RTDEClient() would block indefinitely in reconnecting_thread_.join() even though stop_reconnection_ had been set. Replace the monolithic sleep_for(reconnection_time) with a 100 ms-sliced loop that exits as soon as state_ transitions to SocketState::Closed. URStream::disconnect() (called by RTDEClient::disconnect()) already calls TCPSocket::close() which sets state_ = Closed, so no new API is needed. Also reset state_ to Invalid at the top of setup() (after the Connected guard) so that a Closed state left by a previous disconnect() cannot prematurely terminate the sliced sleep on a brand-new connection attempt. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>

…isconnect stream Two related fixes for RTDEClient teardown: 1. ~RTDEClient() now calls disconnect() before joining reconnecting_thread_. Previously the join came first, so the reconnect thread could be sleeping inside TCPSocket::setup()'s between-attempt sleep when ~RTDEClient() ran, causing the join to block until the sleep completed (up to reconnection_timeout, default 10 s, and indefinitely with max_connection_attempts=0). Moving disconnect() before join() triggers the new sliced-sleep exit path in TCPSocket::setup() so the thread wakes within 100 ms. 2. RTDEClient::disconnect() now calls stream_.disconnect() and writer_.stop() unconditionally, regardless of client_state_. Previously, a failed negotiateProtocolVersion() would reset client_state_ to UNINITIALIZED, causing the conditional guard to skip stream_.disconnect(). The stream was then left in SocketState::Connected, and the next init() retry's TCPSocket::setup() would return false immediately (due to the Connected early-return guard), throwing 'Failed to connect to robot' even though the server was reachable. TCPSocket::close() and RTDEWriter::stop() are both idempotent, so removing the guard is safe. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>

TCPSocketTest.setup_interruptible_by_close (test_tcp_socket.cpp): Unit test that runs without INTEGRATION_TESTS. Starts TCPSocket::setup() with max_num_tries=0 and a 5 s reconnection_timeout against a non-listening port, then calls close() from the main thread and asserts the background thread joins within 2 s. Directly exercises the interruptible-sleep fix. RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread (test_rtde_client.cpp): Integration-level test using the existing RTDEServer fake. Initialises an RTDEClient with reconnection_timeout=5 s, drops the fake server to trigger the reconnect thread, then asserts ~RTDEClient() completes in < 2 s. The test skips gracefully when the fake server cannot complete the RTDE handshake within the socket read timeout (environment-dependent timing); in that case TCPSocketTest.setup_interruptible_by_close provides full coverage of the underlying fix. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>

codecov · 2026-06-11T13:33:55Z

Codecov Report

❌ Patch coverage is 80.00000% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.00%. Comparing base (ca25b13) to head (9025946).
⚠️ Report is 5 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/comm/tcp_socket.cpp	77.14%	1 Missing and 15 partials ⚠️
include/ur_client_library/comm/producer.h	66.66%	0 Missing and 3 partials ⚠️
src/comm/tcp_server.cpp	86.66%	1 Missing and 1 partial ⚠️
include/ur_client_library/comm/stream.h	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #519      +/-   ##
==========================================
+ Coverage   78.87%   79.00%   +0.13%     
==========================================
  Files         116      115       -1     
  Lines        6612     6825     +213     
  Branches     2920     3004      +84     
==========================================
+ Hits         5215     5392     +177     
- Misses       1031     1050      +19     
- Partials      366      383      +17

Flag	Coverage Δ
check_version_ur10-3.15.8	`11.82% <37.27%> (-1.31%)`	⬇️
check_version_ur10e-10.11.0	`11.58% <34.54%> (+0.09%)`	⬆️
check_version_ur10e-5.15.2	`12.16% <48.18%> (+0.36%)`	⬆️
check_version_ur12e-10.12.1	`11.62% <34.54%> (+0.19%)`	⬆️
check_version_ur12e-5.25.1	`11.58% <34.54%> (-0.82%)`	⬇️
check_version_ur15-10.12.1	`11.58% <34.54%> (+0.14%)`	⬆️
check_version_ur15-5.25.1	`11.58% <34.54%> (+0.14%)`	⬆️
check_version_ur16e-10.12.1	`11.58% <34.54%> (+0.14%)`	⬆️
check_version_ur16e-5.25.1	`11.58% <34.54%> (-0.05%)`	⬇️
check_version_ur18-10.12.1	`11.58% <34.54%> (+0.14%)`	⬆️
check_version_ur18-5.25.1	`11.58% <34.54%> (-0.05%)`	⬇️
check_version_ur20-10.12.1	`11.58% <34.54%> (+0.14%)`	⬆️
check_version_ur20-5.25.1	`11.58% <34.54%> (+0.19%)`	⬆️
check_version_ur3-3.14.3	`11.82% <37.27%> (-1.29%)`	⬇️
check_version_ur30-10.12.1	`11.58% <34.54%> (+0.09%)`	⬆️
check_version_ur30-5.25.1	`11.58% <34.54%> (-0.28%)`	⬇️
check_version_ur3e-10.11.0	`11.58% <34.54%> (+0.14%)`	⬆️
check_version_ur3e-5.9.4	`11.82% <37.27%> (+0.34%)`	⬆️
check_version_ur5-3.15.8	`11.82% <37.27%> (-0.62%)`	⬇️
check_version_ur5e-10.11.0	`11.58% <34.54%> (+0.19%)`	⬆️
check_version_ur5e-5.12.8	`11.62% <34.54%> (-0.23%)`	⬇️
check_version_ur7e-10.11.0	`11.58% <34.54%> (+0.19%)`	⬆️
check_version_ur7e-5.22.2	`11.78% <37.27%> (-0.03%)`	⬇️
check_version_ur8long-10.12.1	`11.62% <34.54%> (+0.19%)`	⬆️
check_version_ur8long-5.25.1	`11.78% <37.27%> (+0.20%)`	⬆️
python_scripts	`75.90% <ø> (ø)`
start_ursim	`83.63% <ø> (-1.57%)`	⬇️
ur5-3.14.3	`74.60% <80.00%> (-0.03%)`	⬇️
ur5e-10.11.0	`69.41% <80.00%> (+0.30%)`	⬆️
ur5e-10.12.0	`70.54% <79.09%> (+0.42%)`	⬆️
ur5e-10.7.0	`68.84% <79.09%> (+0.12%)`	⬆️
ur5e-5.9.4	`75.38% <79.09%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

…erruptible producer backoff Extends the RTDEClient teardown fix to the PrimaryClient pipeline, which is the path used by UrDriver. ~PrimaryClient() could block indefinitely (up to the unbounded reconnect timeout) when the robot dropped the primary connection at teardown time. Root cause: on connection loss TCPSocket::read() leaves the socket in SocketState::Disconnected, so URProducer::tryGetImpl() enters its reconnect loop (sleep backoff + stream_.connect() with unlimited retries). ~PrimaryClient() and PrimaryClient::stop() joined the producer thread (pipeline_->stop()) without first closing the stream, so the join blocked for the full reconnect duration. Two related fixes: 1. ~PrimaryClient() and PrimaryClient::stop() now call stream_.close() BEFORE pipeline_->stop(). stream_.close() sets SocketState::Closed, waking a producer stuck in its reconnect path so the join returns promptly. Previously stop() closed the stream after the join (too late) and the destructor never closed it. 2. URProducer::tryGetImpl()'s reconnect backoff slept sleep_for(timeout_) (growing up to 120 s) with no cancellation point. It now sleeps in 100 ms slices and bails out as soon as running_ becomes false or the stream is closed, mirroring the TCPSocket::setup() interruptible-sleep fix. test: add PrimaryClientReconnectTest.destructor_not_blocked_by_stuck_reconnect_thread (test_primary_client_reconnect.cpp). The PrimaryClient counterpart of RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread, using the in-process FakePrimaryServer so it runs without a robot (unlike the INTEGRATION_TESTS-gated primary_client_test_headless). It starts a client against the fake server, drops the server to drive the producer into its reconnect loop, then asserts ~PrimaryClient() completes in < 2 s. Verified to hang without the fix and pass in ~1.5 s with it. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>

Replace the state-based, sleep-polling reconnect interruption with a dedicated, sticky cancellation flag (stop_requested_) plus requestStop()/clearStop(). The flag is orthogonal to SocketState, so setup()'s internal state resets can no longer race away a teardown signal -- the lost-wakeup that hung ~PrimaryClient() (and the Windows CI test) indefinitely. setup() now connects with a non-blocking socket polled in short slices (poll()/WSAPoll()), so a connect attempt against a genuinely unreachable host is interruptible too -- not just the between-attempt back-off. This makes the destructors return promptly instead of blocking for the reconnection timeout (or forever with unlimited attempts). - TCPSocket: add stop_requested_ + requestStop()/clearStop(); interruptible, non-blocking openInterruptible(); honor the flag in setup()'s connect loop and back-off. - PrimaryClient/RTDEClient: call requestStop() before joining the reconnect thread; clear the flag on (re)start (URProducer::setupProducer, RTDEClient::init). - tests: drive setup() interruption via requestStop(); add a non-routable-address blocking-connect regression test; guard the teardown tests with a watchdog; add CTest TIMEOUT properties so a hang fails fast instead of timing out the job.

reconnectStream() did stream_.close() + stream_.connect() without clearing the sticky stop_requested_ cancellation flag introduced in 5b22c97 ("make connection attempts cancellable to unblock teardown"). PrimaryClient::stop()/~PrimaryClient() set the flag via stream_.requestStop() so a producer thread stuck in TCPSocket::setup() aborts promptly at teardown. The flag is sticky and only cleared on (re)start (URProducer::setupProducer, RTDEClient::init). A deliberate reconnect via reconnectStream() never cleared it, so after stopPrimaryClientCommunication() any resendRobotProgram() call failed: TCPSocket::setup() bailed out at the up-front stop_requested_ check and logged "Failed to reconnect primary stream!". This regressed UrDriverTest.send_robot_program_retry_on_failure on every integration runner. Clear the flag in reconnectStream() before connecting, mirroring URProducer::setupProducer(); a deliberate reconnect is a restart and must not honor a stale teardown cancellation. Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>

…imit select() cannot watch file descriptors whose number is >= FD_SETSIZE (1024 on glibc). When the hosting process holds many descriptors (e.g. a JVM such as MATLAB's), accepted socket FDs exceed that limit; the previous code either rejected the connection (the set_size_exceeded guard added for select()) or risked the "bit out of range" fd_set crash. Replace the select()/fd_set machinery in TCPServer with poll() (WSAPoll() on Windows), which has no FD_SETSIZE limitation. The pollfd set is rebuilt each spin() from the listen socket plus client_fds_ (already tracked), so the masterfds_/tempfds_/maxfd_ members and the FD_SET/FD_CLR/FD_ZERO/FD_ISSET and set_size_exceeded code are removed entirely. tests: add two TCPServer regression tests - services_client_with_high_fd_number (POSIX): consumes low FDs so the accepted client FD exceeds FD_SETSIZE, then asserts connect, bidirectional data and disconnect all work. Fails on select(), passes on poll(). - receives_from_many_concurrent_clients: many clients send simultaneously; asserts the server observes activity on every client FD (guards the poll() revents loop).

…tuck reconnect Covers the second symptom of issue #368: PrimaryClient::stop() (the implementation behind UrDriver::stopPrimaryClientCommunication()) must return promptly when the producer thread is stuck in its reconnect loop against an unreachable robot, instead of blocking on the pipeline join. Also asserts the stop()/start() restart path reconnects, verifying the sticky cancellation flag is cleared via clearStop() on (re)start.

Replace the stop_requested_ flag and requestStop()/clearStop() API with an enriched SocketState machine (Connecting/Connected/Reconnecting/LostConnection/ Disconnecting/Disconnected). The public lifecycle is now just connect() and disconnect(): connect() implicitly clears a prior deliberate stop, so callers no longer have to remember a separate clear step. The deliberate-stop set {Disconnecting, Disconnected} is sticky and is never overwritten by the connect/retry loop (setup()) or close(), and is cleared only by an explicit connect(). The in-loop auto-reconnect (reconnect()) never clears it, keeping ~PrimaryClient()/~RTDEClient() teardown race-free (the original #368 fix). The transient-drop state is renamed Disconnected -> LostConnection; SocketState::Disconnected is repurposed for the deliberate stop. Update URStream, URProducer, PrimaryClient and RTDEClient to the new API, and adjust the affected unit tests.

cursoragent and others added 3 commits June 11, 2026 13:31

urfeex linked an issue Jun 15, 2026 that may be closed by this pull request

UrDriver destructor hangs if arm is unreachable #368

Open

1 task

urrsk and others added 3 commits June 15, 2026 15:02

urrsk marked this pull request as ready for review June 16, 2026 05:21

urrsk changed the title ~~fix: RTDEClient destructor blocks indefinitely on reconnect thread sleeping in TCPSocket::setup()~~ fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368) Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368)#519

fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368)#519
urrsk wants to merge 9 commits into
masterfrom
fix/rtde-reconnect-thread-blocking-destructor

urrsk commented Jun 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

urrsk commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

1. TCPSocket — sticky, race-free cancellation + interruptible connect (src/comm/tcp_socket.cpp, include/ur_client_library/comm/tcp_socket.h)

2. URProducer — interruptible reconnect backoff (include/ur_client_library/comm/producer.h)

3. RTDEClient — stop the stream before joining (src/rtde/rtde_client.cpp)

4. PrimaryClient — stop the stream before joining (src/primary/primary_client.cpp)

5. TCPServer — poll() instead of select() (src/comm/tcp_server.cpp, include/ur_client_library/comm/tcp_server.h)

Tests

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

urrsk commented Jun 11, 2026 •

edited

Loading

1. `TCPSocket` — sticky, race-free cancellation + interruptible connect (`src/comm/tcp_socket.cpp`, `include/ur_client_library/comm/tcp_socket.h`)

2. `URProducer` — interruptible reconnect backoff (`include/ur_client_library/comm/producer.h`)

3. `RTDEClient` — stop the stream before joining (`src/rtde/rtde_client.cpp`)

4. `PrimaryClient` — stop the stream before joining (`src/primary/primary_client.cpp`)

5. `TCPServer` — `poll()` instead of `select()` (`src/comm/tcp_server.cpp`, `include/ur_client_library/comm/tcp_server.h`)

codecov Bot commented Jun 11, 2026 •

edited

Loading