Skip to content

fix: Prevent client destruction race in services.call_service#1255

Open
JWhitleyWork wants to merge 1 commit into
RobotWebTools:ros2from
PickNikRobotics:fix/services-destroy-client-race-ros2
Open

fix: Prevent client destruction race in services.call_service#1255
JWhitleyWork wants to merge 1 commit into
RobotWebTools:ros2from
PickNikRobotics:fix/services-destroy-client-race-ros2

Conversation

@JWhitleyWork
Copy link
Copy Markdown

@JWhitleyWork JWhitleyWork commented May 18, 2026

Summary

services.call_service and AdvertisedActionHandler both create rclpy entities (rclpy.Client and rclpy.action.ActionServer, respectively) on the rosbridge node directly from a worker thread, while the rclpy executor is concurrently spinning that same node. The executor's wait-set rebuild can race with the worker thread's entity registration / destruction.

The symptom varies by entity:

  • Service clients silently lose their backing handle: the executor ends up holding a Client whose PyCapsule has been freed, and the next rclpy binding call on the affected node raises TypeError: Object of type 'NoneType' is not an instance of 'capsule'. After this trips once, every subsequent call_service on the affected websocket returns result: false and the connection is unrecoverable until the rosbridge process restarts.

  • Action servers crash harder: the worker thread reaches _rclpy.ActionServer(...) in rclpy/action/server.py:__init__ while the executor is mid-wait-set-rebuild and the process SIGSEGVs. This is the actual cause of the test_capabilities hang on Rolling/Lyrical CI — pytest's xunit fires after the segfault, but ctest waits the full 60 s before declaring the test timed out, masking the real signal in run logs.

Approach

Same idea as the IncomingQueue race fixed in #1183: route the lifecycle calls through the executor via executor.create_task(...) so they serialize with the executor's own work instead of competing with it.

Extract the executor-routing pattern into rosbridge_library.internal.executor_helpers — one module, two helpers:

Helper Returns Used for
run_on_executor(node, fn) fn()'s return value Entity creation, where the caller needs the new handle before continuing
schedule_on_executor(node, fn) nothing (fire-and-forget) Entity destruction, where the caller just needs the teardown to happen on the right thread

Both fall back to inline invocation when the node has no attached executor, which keeps unit-test setups (no executor) working unchanged.

What changed

  • services.call_service now routes create_client through run_on_executor (it needs the resulting Client before it can call wait_for_service / call_async). The three destroy_client sites — wait-for-service failure, response timeout, and the successful path — go through schedule_on_executor.
  • AdvertisedActionHandler.__init__ routes ActionServer(...) through run_on_executor. (ActionServer destruction was already routed through the executor in feat: Improve action unadvertising #1248, so only the creation side needed the fix.)
  • ActionTester in test/internal/actions/test_actions.py builds its own ActionServer from the test thread while the fixture's executor is spinning. The same SIGSEGV reproduced there, so its ActionServer(...) call now goes through run_on_executor too.
  • Scope is intentionally limited to the create/destroy lifecycle. The read-only setup calls in call_service (resolve_service_name, get_service_names_and_types) stay on the worker thread.

Regression test

rosbridge_library/test/internal/services/test_stress_service_clients.py exercises the race window:

  • test_many_parallel_service_calls — 40 ServiceCallers in flight at once against the same service.
  • test_barrier_synchronized_service_calls — 20 callers start at the same instant via a threading.Barrier.
  • test_repeated_rapid_service_calls — 100 sequential calls with no settle time between iterations.
  • test_interleaved_waves_of_service_calls — each wave of callers starts while the previous wave is still tearing down its clients.

After each scenario, the test asserts the executor's spin thread is still alive, has logged no errors, and can still process newly submitted tasks. The action-server race is covered by the existing test_action_capabilities and test_actions tests, which previously SIGSEGV'd and now pass.

Back-ports

services.py and advertise_action.py are identical on jazzy and humble; sibling PRs target those branches. Happy to defer those to mergify if preferred.

@JWhitleyWork
Copy link
Copy Markdown
Author

Working on CI failures.

@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 18, 2026

Rolling and Lyrical CI is broken for now. There are mypy errors and topic subscription tests fail randomly. This is not related to this PR but I would appreciate any help in fixing these failures.

@JWhitleyWork JWhitleyWork force-pushed the fix/services-destroy-client-race-ros2 branch from f3a1016 to a70f997 Compare May 18, 2026 22:08
@JWhitleyWork
Copy link
Copy Markdown
Author

Rolling and Lyrical CI is broken for now. There are mypy errors and topic subscription tests fail randomly. This is not related to this PR but I would appreciate any help in fixing these failures.

On it. Splitting the CI fixes out into a new PR.

…lities

Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler`
create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`,
respectively) on the rosbridge node directly from a worker thread.  The same
node is concurrently spun by the rclpy executor, so the executor's wait-set
rebuild can race with the worker thread's entity registration / destruction
and either crash inside `_rclpy` (action servers SIGSEGV in
`rclpy/action/server.py:__init__`) or leave the executor holding a handle
whose underlying PyCapsule has been freed (services surface this later as
`TypeError: Object of type 'NoneType' is not an instance of 'capsule'`).

This is the service- and action-capability analogue of the IncomingQueue
race fixed in RobotWebTools#1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own work
instead of competing with it.

- Extract the executor-routing pattern into
  `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor`
  (synchronous; returns the value, used when the caller needs the entity
  back) and `schedule_on_executor` (fire-and-forget, used for destructors).
  Both fall back to inline invocation when the node has no attached
  executor, preserving unit-test behaviour.
- Apply `run_on_executor` to client creation in `services.call_service` and
  ActionServer creation in `AdvertisedActionHandler.__init__`. The three
  `destroy_client` sites in `services.call_service` now use
  `schedule_on_executor`. ActionServer destruction was already routed
  through the executor in RobotWebTools#1248.
- `ActionTester` in `test_actions.py` constructs its own ActionServer from
  the test thread while the fixture's executor is already spinning. Route
  that creation through `run_on_executor` for the same reason; without it
  the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy.

`test_stress_service_clients.py` exercises the service race window
(parallel callers, barrier-synchronized starts, rapid sequential cycles,
and interleaved waves of concurrent callers) and asserts the executor
remains healthy and responsive afterwards.
@JWhitleyWork JWhitleyWork force-pushed the fix/services-destroy-client-race-ros2 branch from 3eb66ff to d0a816f Compare May 18, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants