Skip to content

fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368)#519

Open
urrsk wants to merge 9 commits into
masterfrom
fix/rtde-reconnect-thread-blocking-destructor
Open

fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368)#519
urrsk wants to merge 9 commits into
masterfrom
fix/rtde-reconnect-thread-blocking-destructor

Conversation

@urrsk

@urrsk urrsk commented Jun 11, 2026

Copy link
Copy Markdown
Member

Problem

Tearing down a client while its reconnect thread is active blocks for an arbitrarily long time — up to the full reconnection_timeout, or indefinitely when max_connection_attempts == 0 (the default) and the robot is unreachable.

Affected teardown paths:

  • ~RTDEClient()
  • ~PrimaryClient() and PrimaryClient::stop() (the latter is what UrDriver::stopPrimaryClientCommunication() calls)
  • transitively ~UrDriver(), which owns both clients as members

Fixes #368 ("UrDriver destructor hangs if arm is unreachable"). The repeated Failed to connect to robot ... Retrying in 10 seconds log in that issue is exactly the reconnect loop this PR makes interruptible.

Root cause

  • TCPSocket::setup() retries connect() with no upper bound when max_num_tries == 0. Both the blocking connect() itself and the monolithic sleep_for(reconnection_time) between attempts were uninterruptible, so a thread sitting in setup() could not be woken at teardown.
  • The client destructors set a stop flag and then immediately joined the reconnect thread, but that thread was blocked inside setup() and could not observe the flag — so the join blocked.
  • Signalling cancellation only through SocketState::Closed is racy: setup() resets state_ on every attempt, so a teardown signal carried purely by the socket state could be erased by a concurrent reset.
  • RTDEClient::disconnect() guarded stream_.disconnect() on client_state_ > UNINITIALIZED. A failed negotiateProtocolVersion() resets client_state_ to UNINITIALIZED, leaving the stream in SocketState::Connected; the next setup() then returned false via its Connected guard even though the server was reachable.

Fix

1. TCPSocket — sticky, race-free cancellation + interruptible connect (src/comm/tcp_socket.cpp, include/ur_client_library/comm/tcp_socket.h)

  • Add requestStop() / clearStop() backed by an atomic stop_requested_ flag that is orthogonal to state_. requestStop() sets the flag and closes the socket; the flag stays set until clearStop(), so it cannot be clobbered by setup()'s internal state_ resets.
  • Replace the blocking connect() with openInterruptible(): a non-blocking connect polled in 100 ms slices (poll() / WSAPoll()), re-checking stop_requested_ each slice. This aborts a connect to a genuinely unreachable host promptly instead of waiting out the OS connect timeout.
  • Slice the between-attempt back-off so it also exits on stop_requested_.

2. URProducer — interruptible reconnect backoff (include/ur_client_library/comm/producer.h)

  • Sleep the reconnect backoff in 100 ms slices, exiting early when the producer is stopped (running_ == false) or the stream is closed.
  • Call clearStop() in setupProducer() so a stream that was stopped at a previous teardown can be reused on (re)start.

3. RTDEClient — stop the stream before joining (src/rtde/rtde_client.cpp)

  • ~RTDEClient() now calls requestStop() + disconnect() before joining reconnecting_thread_, so a thread stuck in setup() aborts within one poll slice.
  • disconnect() now disconnects the stream and stops the writer unconditionally (both are idempotent), fixing the case where a failed handshake left the stream Connected and blocked re-initialization.
  • init() clears the stop flag before connecting.

4. PrimaryClient — stop the stream before joining (src/primary/primary_client.cpp)

  • ~PrimaryClient() and stop() call requestStop() before pipeline_->stop(), so a producer stuck in its reconnect path is aborted and the pipeline join returns promptly.
  • reconnectStream() clears the stop flag so a deliberate reconnect after a stop() works.

5. TCPServerpoll() instead of select() (src/comm/tcp_server.cpp, include/ur_client_library/comm/tcp_server.h)

  • Replace select() with poll(), removing the FD_SETSIZE limit so the server keeps functioning with high-numbered file descriptors and many concurrent clients.

Tests

All of the following run in the normal (non-INTEGRATION_TESTS) build using in-process fakes — no robot required.

  • TCPSocketTest.setup_interruptible_by_closesetup() is interrupted during the between-attempt wait.
  • TCPSocketTest.setup_interruptible_during_blocking_connectsetup() is interrupted while blocked in connect() to an unreachable host (the direct UrDriver destructor hangs if arm is unreachable #368 case).
  • RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread~RTDEClient() returns in well under 2 s with a stuck reconnect thread.
  • PrimaryClientReconnectTest.destructor_not_blocked_by_stuck_reconnect_thread~PrimaryClient() returns promptly.
  • PrimaryClientReconnectTest.stop_not_blocked_by_stuck_reconnect_threadPrimaryClient::stop() returns promptly, and a subsequent start() reconnects (verifies the clearStop() reuse path).
  • TCPServerTest.services_client_with_high_fd_number / TCPServerTest.receives_from_many_concurrent_clients — exercise the poll() migration.
Open in Web Open in Cursor 

cursoragent and others added 3 commits June 11, 2026 13:31
TCPSocket::setup() slept for the full reconnection_time between consecutive
failed connect() attempts with no way to be woken early.  When RTDEClient's
reconnect thread was in this sleep, ~RTDEClient() would block indefinitely
in reconnecting_thread_.join() even though stop_reconnection_ had been set.

Replace the monolithic sleep_for(reconnection_time) with a 100 ms-sliced
loop that exits as soon as state_ transitions to SocketState::Closed.
URStream::disconnect() (called by RTDEClient::disconnect()) already calls
TCPSocket::close() which sets state_ = Closed, so no new API is needed.

Also reset state_ to Invalid at the top of setup() (after the Connected
guard) so that a Closed state left by a previous disconnect() cannot
prematurely terminate the sliced sleep on a brand-new connection attempt.

Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
…isconnect stream

Two related fixes for RTDEClient teardown:

1. ~RTDEClient() now calls disconnect() before joining reconnecting_thread_.
   Previously the join came first, so the reconnect thread could be sleeping
   inside TCPSocket::setup()'s between-attempt sleep when ~RTDEClient() ran,
   causing the join to block until the sleep completed (up to reconnection_timeout,
   default 10 s, and indefinitely with max_connection_attempts=0).
   Moving disconnect() before join() triggers the new sliced-sleep exit path in
   TCPSocket::setup() so the thread wakes within 100 ms.

2. RTDEClient::disconnect() now calls stream_.disconnect() and writer_.stop()
   unconditionally, regardless of client_state_.  Previously, a failed
   negotiateProtocolVersion() would reset client_state_ to UNINITIALIZED, causing
   the conditional guard to skip stream_.disconnect().  The stream was then left
   in SocketState::Connected, and the next init() retry's TCPSocket::setup()
   would return false immediately (due to the Connected early-return guard),
   throwing 'Failed to connect to robot' even though the server was reachable.
   TCPSocket::close() and RTDEWriter::stop() are both idempotent, so removing
   the guard is safe.

Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
TCPSocketTest.setup_interruptible_by_close (test_tcp_socket.cpp):
  Unit test that runs without INTEGRATION_TESTS.  Starts TCPSocket::setup()
  with max_num_tries=0 and a 5 s reconnection_timeout against a non-listening
  port, then calls close() from the main thread and asserts the background
  thread joins within 2 s.  Directly exercises the interruptible-sleep fix.

RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread (test_rtde_client.cpp):
  Integration-level test using the existing RTDEServer fake.  Initialises an
  RTDEClient with reconnection_timeout=5 s, drops the fake server to trigger
  the reconnect thread, then asserts ~RTDEClient() completes in < 2 s.
  The test skips gracefully when the fake server cannot complete the RTDE
  handshake within the socket read timeout (environment-dependent timing);
  in that case TCPSocketTest.setup_interruptible_by_close provides full
  coverage of the underlying fix.

Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.00%. Comparing base (ca25b13) to head (9025946).
⚠️ Report is 5 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/comm/tcp_socket.cpp 77.14% 1 Missing and 15 partials ⚠️
include/ur_client_library/comm/producer.h 66.66% 0 Missing and 3 partials ⚠️
src/comm/tcp_server.cpp 86.66% 1 Missing and 1 partial ⚠️
include/ur_client_library/comm/stream.h 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #519      +/-   ##
==========================================
+ Coverage   78.87%   79.00%   +0.13%     
==========================================
  Files         116      115       -1     
  Lines        6612     6825     +213     
  Branches     2920     3004      +84     
==========================================
+ Hits         5215     5392     +177     
- Misses       1031     1050      +19     
- Partials      366      383      +17     
Flag Coverage Δ
check_version_ur10-3.15.8 11.82% <37.27%> (-1.31%) ⬇️
check_version_ur10e-10.11.0 11.58% <34.54%> (+0.09%) ⬆️
check_version_ur10e-5.15.2 12.16% <48.18%> (+0.36%) ⬆️
check_version_ur12e-10.12.1 11.62% <34.54%> (+0.19%) ⬆️
check_version_ur12e-5.25.1 11.58% <34.54%> (-0.82%) ⬇️
check_version_ur15-10.12.1 11.58% <34.54%> (+0.14%) ⬆️
check_version_ur15-5.25.1 11.58% <34.54%> (+0.14%) ⬆️
check_version_ur16e-10.12.1 11.58% <34.54%> (+0.14%) ⬆️
check_version_ur16e-5.25.1 11.58% <34.54%> (-0.05%) ⬇️
check_version_ur18-10.12.1 11.58% <34.54%> (+0.14%) ⬆️
check_version_ur18-5.25.1 11.58% <34.54%> (-0.05%) ⬇️
check_version_ur20-10.12.1 11.58% <34.54%> (+0.14%) ⬆️
check_version_ur20-5.25.1 11.58% <34.54%> (+0.19%) ⬆️
check_version_ur3-3.14.3 11.82% <37.27%> (-1.29%) ⬇️
check_version_ur30-10.12.1 11.58% <34.54%> (+0.09%) ⬆️
check_version_ur30-5.25.1 11.58% <34.54%> (-0.28%) ⬇️
check_version_ur3e-10.11.0 11.58% <34.54%> (+0.14%) ⬆️
check_version_ur3e-5.9.4 11.82% <37.27%> (+0.34%) ⬆️
check_version_ur5-3.15.8 11.82% <37.27%> (-0.62%) ⬇️
check_version_ur5e-10.11.0 11.58% <34.54%> (+0.19%) ⬆️
check_version_ur5e-5.12.8 11.62% <34.54%> (-0.23%) ⬇️
check_version_ur7e-10.11.0 11.58% <34.54%> (+0.19%) ⬆️
check_version_ur7e-5.22.2 11.78% <37.27%> (-0.03%) ⬇️
check_version_ur8long-10.12.1 11.62% <34.54%> (+0.19%) ⬆️
check_version_ur8long-5.25.1 11.78% <37.27%> (+0.20%) ⬆️
python_scripts 75.90% <ø> (ø)
start_ursim 83.63% <ø> (-1.57%) ⬇️
ur5-3.14.3 74.60% <80.00%> (-0.03%) ⬇️
ur5e-10.11.0 69.41% <80.00%> (+0.30%) ⬆️
ur5e-10.12.0 70.54% <79.09%> (+0.42%) ⬆️
ur5e-10.7.0 68.84% <79.09%> (+0.12%) ⬆️
ur5e-5.9.4 75.38% <79.09%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

…erruptible producer backoff

Extends the RTDEClient teardown fix to the PrimaryClient pipeline, which is the
path used by UrDriver. ~PrimaryClient() could block indefinitely (up to the
unbounded reconnect timeout) when the robot dropped the primary connection at
teardown time.

Root cause: on connection loss TCPSocket::read() leaves the socket in
SocketState::Disconnected, so URProducer::tryGetImpl() enters its reconnect loop
(sleep backoff + stream_.connect() with unlimited retries). ~PrimaryClient() and
PrimaryClient::stop() joined the producer thread (pipeline_->stop()) without
first closing the stream, so the join blocked for the full reconnect duration.

Two related fixes:

1. ~PrimaryClient() and PrimaryClient::stop() now call stream_.close() BEFORE
   pipeline_->stop(). stream_.close() sets SocketState::Closed, waking a producer
   stuck in its reconnect path so the join returns promptly. Previously stop()
   closed the stream after the join (too late) and the destructor never closed it.

2. URProducer::tryGetImpl()'s reconnect backoff slept sleep_for(timeout_) (growing
   up to 120 s) with no cancellation point. It now sleeps in 100 ms slices and
   bails out as soon as running_ becomes false or the stream is closed, mirroring
   the TCPSocket::setup() interruptible-sleep fix.

test: add PrimaryClientReconnectTest.destructor_not_blocked_by_stuck_reconnect_thread
(test_primary_client_reconnect.cpp). The PrimaryClient counterpart of
RTDEClientTest.destructor_not_blocked_by_stuck_reconnect_thread, using the
in-process FakePrimaryServer so it runs without a robot (unlike the
INTEGRATION_TESTS-gated primary_client_test_headless). It starts a client against
the fake server, drops the server to drive the producer into its reconnect loop,
then asserts ~PrimaryClient() completes in < 2 s. Verified to hang without the fix
and pass in ~1.5 s with it.

Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
@urfeex urfeex linked an issue Jun 15, 2026 that may be closed by this pull request
1 task
urrsk and others added 3 commits June 15, 2026 15:02
Replace the state-based, sleep-polling reconnect interruption with a dedicated,
sticky cancellation flag (stop_requested_) plus requestStop()/clearStop(). The
flag is orthogonal to SocketState, so setup()'s internal state resets can no
longer race away a teardown signal -- the lost-wakeup that hung ~PrimaryClient()
(and the Windows CI test) indefinitely.

setup() now connects with a non-blocking socket polled in short slices
(poll()/WSAPoll()), so a connect attempt against a genuinely unreachable host is
interruptible too -- not just the between-attempt back-off. This makes the
destructors return promptly instead of blocking for the reconnection timeout
(or forever with unlimited attempts).

- TCPSocket: add stop_requested_ + requestStop()/clearStop(); interruptible,
  non-blocking openInterruptible(); honor the flag in setup()'s connect loop and
  back-off.
- PrimaryClient/RTDEClient: call requestStop() before joining the reconnect
  thread; clear the flag on (re)start (URProducer::setupProducer, RTDEClient::init).
- tests: drive setup() interruption via requestStop(); add a non-routable-address
  blocking-connect regression test; guard the teardown tests with a watchdog;
  add CTest TIMEOUT properties so a hang fails fast instead of timing out the job.
reconnectStream() did stream_.close() + stream_.connect() without clearing
the sticky stop_requested_ cancellation flag introduced in
5b22c97 ("make connection attempts cancellable to unblock teardown").

PrimaryClient::stop()/~PrimaryClient() set the flag via stream_.requestStop()
so a producer thread stuck in TCPSocket::setup() aborts promptly at teardown.
The flag is sticky and only cleared on (re)start (URProducer::setupProducer,
RTDEClient::init). A deliberate reconnect via reconnectStream() never cleared
it, so after stopPrimaryClientCommunication() any resendRobotProgram() call
failed: TCPSocket::setup() bailed out at the up-front stop_requested_ check and
logged "Failed to reconnect primary stream!".

This regressed UrDriverTest.send_robot_program_retry_on_failure on every
integration runner. Clear the flag in reconnectStream() before connecting,
mirroring URProducer::setupProducer(); a deliberate reconnect is a restart and
must not honor a stale teardown cancellation.

Co-authored-by: Rune Søe-Knudsen <urrsk@users.noreply.github.com>
…imit

select() cannot watch file descriptors whose number is >= FD_SETSIZE (1024
on glibc). When the hosting process holds many descriptors (e.g. a JVM such
as MATLAB's), accepted socket FDs exceed that limit; the previous code either
rejected the connection (the set_size_exceeded guard added for select()) or
risked the "bit out of range" fd_set crash.

Replace the select()/fd_set machinery in TCPServer with poll() (WSAPoll() on
Windows), which has no FD_SETSIZE limitation. The pollfd set is rebuilt each
spin() from the listen socket plus client_fds_ (already tracked), so the
masterfds_/tempfds_/maxfd_ members and the FD_SET/FD_CLR/FD_ZERO/FD_ISSET and
set_size_exceeded code are removed entirely.

tests: add two TCPServer regression tests
- services_client_with_high_fd_number (POSIX): consumes low FDs so the
  accepted client FD exceeds FD_SETSIZE, then asserts connect, bidirectional
  data and disconnect all work. Fails on select(), passes on poll().
- receives_from_many_concurrent_clients: many clients send simultaneously;
  asserts the server observes activity on every client FD (guards the poll()
  revents loop).
@urrsk urrsk marked this pull request as ready for review June 16, 2026 05:21
…tuck reconnect

Covers the second symptom of issue #368: PrimaryClient::stop() (the
implementation behind UrDriver::stopPrimaryClientCommunication()) must return
promptly when the producer thread is stuck in its reconnect loop against an
unreachable robot, instead of blocking on the pipeline join. Also asserts the
stop()/start() restart path reconnects, verifying the sticky cancellation flag
is cleared via clearStop() on (re)start.
@urrsk urrsk changed the title fix: RTDEClient destructor blocks indefinitely on reconnect thread sleeping in TCPSocket::setup() fix: teardown blocks indefinitely when a reconnect thread is active and the robot is unreachable (#368) Jun 16, 2026
Replace the stop_requested_ flag and requestStop()/clearStop() API with an
enriched SocketState machine (Connecting/Connected/Reconnecting/LostConnection/
Disconnecting/Disconnected). The public lifecycle is now just connect() and
disconnect(): connect() implicitly clears a prior deliberate stop, so callers no
longer have to remember a separate clear step.

The deliberate-stop set {Disconnecting, Disconnected} is sticky and is never
overwritten by the connect/retry loop (setup()) or close(), and is cleared only
by an explicit connect(). The in-loop auto-reconnect (reconnect()) never clears
it, keeping ~PrimaryClient()/~RTDEClient() teardown race-free (the original
#368 fix). The transient-drop state is renamed Disconnected -> LostConnection;
SocketState::Disconnected is repurposed for the deliberate stop.

Update URStream, URProducer, PrimaryClient and RTDEClient to the new API, and
adjust the affected unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UrDriver destructor hangs if arm is unreachable

2 participants