fix: Prevent client destruction race in services.call_service (backport #1255)#1256
Closed
JWhitleyWork wants to merge 1 commit into
Closed
Conversation
Each invocation of `services.call_service` runs on a worker thread (when
`call_services_in_new_thread` is enabled, the default) and creates and
destroys an rclpy `Client` on the rosbridge node directly from that
thread. The same node is concurrently spun by the rclpy executor, so the
executor's wait-set rebuild can race with the worker thread's
create/destroy, leaving the executor holding a handle whose underlying
PyCapsule has been freed. The next rclpy binding call on the same node
then raises:
TypeError: Object of type 'NoneType' is not an instance of 'capsule'
After that, every subsequent service call on the affected websocket
returns `result: false` with that message, and the connection is
unrecoverable for the lifetime of the rosbridge process.
This is the service-call analogue of the IncomingQueue race fixed in
RobotWebTools#1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own
work instead of competing with it.
- Add `_run_on_executor` (synchronous) for client creation, which needs
the resulting `Client` before continuing.
- Add `_destroy_client_async` (fire-and-forget) for the three
`destroy_client` sites in the success / wait-for-service-failed /
response-timeout paths.
- Fall back to inline calls when the node has no attached executor,
which is the typical unit-test configuration and unchanged behavior.
A new stress test `test_stress_service_clients.py` exercises the race
window (parallel callers, barrier-synchronized start, rapid sequential
cycles, and interleaved waves of concurrent callers) and asserts the
executor remains healthy and responsive afterwards.
Member
|
You don't have to create backports yourself. I run mergify to create backports after merging each PR to |
Contributor
Author
I can close these if you want your bot to do it instead. Claude did them for me so it was negligible effort. |
Contributor
Author
|
Closing because #1255 got some updates for a related fix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Back-port of #1255 to
jazzy.rosbridge_library/internal/services.pyis identical betweenros2andjazzy, so the patch cherry-picks cleanly with no conflicts and the diff is the same. See #1255 for the full rationale, root cause analysis, and regression-test description.In short:
services.call_servicecreates and destroys anrclpy.Clientdirectly from a worker thread, racing the executor that is spinning the same node. The race occasionally leaves the executor referencing a freed PyCapsule, and the next rclpy binding call on the node surfaces as:after which every subsequent
call_serviceon the rosbridge process fails the same way until the process is restarted. The fix routes the create/destroy calls throughexecutor.create_task(...), mirroring the same pattern that #1183 applied to the disconnect path.A new
test/internal/services/test_stress_service_clients.pyregression test exercises the race window (parallel callers, barrier-synchronized start, rapid sequential cycles, interleaved waves) and asserts the executor stays healthy and responsive afterwards.