Skip to content

fix: Prevent client destruction race in services.call_service (backport #1255)#1256

Closed
JWhitleyWork wants to merge 1 commit into
RobotWebTools:jazzyfrom
PickNikRobotics:fix/services-destroy-client-race-jazzy
Closed

fix: Prevent client destruction race in services.call_service (backport #1255)#1256
JWhitleyWork wants to merge 1 commit into
RobotWebTools:jazzyfrom
PickNikRobotics:fix/services-destroy-client-race-jazzy

Conversation

@JWhitleyWork
Copy link
Copy Markdown
Contributor

Back-port of #1255 to jazzy. rosbridge_library/internal/services.py is identical between ros2 and jazzy, so the patch cherry-picks cleanly with no conflicts and the diff is the same. See #1255 for the full rationale, root cause analysis, and regression-test description.

In short: services.call_service creates and destroys an rclpy.Client directly from a worker thread, racing the executor that is spinning the same node. The race occasionally leaves the executor referencing a freed PyCapsule, and the next rclpy binding call on the node surfaces as:

TypeError: Object of type 'NoneType' is not an instance of 'capsule'

after which every subsequent call_service on the rosbridge process fails the same way until the process is restarted. The fix routes the create/destroy calls through executor.create_task(...), mirroring the same pattern that #1183 applied to the disconnect path.

A new test/internal/services/test_stress_service_clients.py regression test exercises the race window (parallel callers, barrier-synchronized start, rapid sequential cycles, interleaved waves) and asserts the executor stays healthy and responsive afterwards.

Each invocation of `services.call_service` runs on a worker thread (when
`call_services_in_new_thread` is enabled, the default) and creates and
destroys an rclpy `Client` on the rosbridge node directly from that
thread. The same node is concurrently spun by the rclpy executor, so the
executor's wait-set rebuild can race with the worker thread's
create/destroy, leaving the executor holding a handle whose underlying
PyCapsule has been freed. The next rclpy binding call on the same node
then raises:

    TypeError: Object of type 'NoneType' is not an instance of 'capsule'

After that, every subsequent service call on the affected websocket
returns `result: false` with that message, and the connection is
unrecoverable for the lifetime of the rosbridge process.

This is the service-call analogue of the IncomingQueue race fixed in
RobotWebTools#1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own
work instead of competing with it.

- Add `_run_on_executor` (synchronous) for client creation, which needs
  the resulting `Client` before continuing.
- Add `_destroy_client_async` (fire-and-forget) for the three
  `destroy_client` sites in the success / wait-for-service-failed /
  response-timeout paths.
- Fall back to inline calls when the node has no attached executor,
  which is the typical unit-test configuration and unchanged behavior.

A new stress test `test_stress_service_clients.py` exercises the race
window (parallel callers, barrier-synchronized start, rapid sequential
cycles, and interleaved waves of concurrent callers) and asserts the
executor remains healthy and responsive afterwards.
@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 18, 2026

You don't have to create backports yourself. I run mergify to create backports after merging each PR to ros2 branch, when applicable.

@JWhitleyWork
Copy link
Copy Markdown
Contributor Author

You don't have to create backports yourself. I run mergify to create backports after merging each PR to ros2 branch, when applicable.

I can close these if you want your bot to do it instead. Claude did them for me so it was negligible effort.

@JWhitleyWork
Copy link
Copy Markdown
Contributor Author

Closing because #1255 got some updates for a related fix.

@JWhitleyWork JWhitleyWork deleted the fix/services-destroy-client-race-jazzy branch May 18, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants