fix: Prevent client destruction race in services.call_service#1255
Open
JWhitleyWork wants to merge 1 commit into
Open
fix: Prevent client destruction race in services.call_service#1255JWhitleyWork wants to merge 1 commit into
JWhitleyWork wants to merge 1 commit into
Conversation
This was referenced May 18, 2026
Author
|
Working on CI failures. |
Member
|
Rolling and Lyrical CI is broken for now. There are mypy errors and topic subscription tests fail randomly. This is not related to this PR but I would appreciate any help in fixing these failures. |
f3a1016 to
a70f997
Compare
Author
On it. Splitting the CI fixes out into a new PR. |
a70f997 to
3eb66ff
Compare
…lities Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler` create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`, respectively) on the rosbridge node directly from a worker thread. The same node is concurrently spun by the rclpy executor, so the executor's wait-set rebuild can race with the worker thread's entity registration / destruction and either crash inside `_rclpy` (action servers SIGSEGV in `rclpy/action/server.py:__init__`) or leave the executor holding a handle whose underlying PyCapsule has been freed (services surface this later as `TypeError: Object of type 'NoneType' is not an instance of 'capsule'`). This is the service- and action-capability analogue of the IncomingQueue race fixed in RobotWebTools#1183: route the lifecycle calls through the executor via `executor.create_task(...)` so they serialize with the executor's own work instead of competing with it. - Extract the executor-routing pattern into `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor` (synchronous; returns the value, used when the caller needs the entity back) and `schedule_on_executor` (fire-and-forget, used for destructors). Both fall back to inline invocation when the node has no attached executor, preserving unit-test behaviour. - Apply `run_on_executor` to client creation in `services.call_service` and ActionServer creation in `AdvertisedActionHandler.__init__`. The three `destroy_client` sites in `services.call_service` now use `schedule_on_executor`. ActionServer destruction was already routed through the executor in RobotWebTools#1248. - `ActionTester` in `test_actions.py` constructs its own ActionServer from the test thread while the fixture's executor is already spinning. Route that creation through `run_on_executor` for the same reason; without it the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy. `test_stress_service_clients.py` exercises the service race window (parallel callers, barrier-synchronized starts, rapid sequential cycles, and interleaved waves of concurrent callers) and asserts the executor remains healthy and responsive afterwards.
3eb66ff to
d0a816f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
services.call_serviceandAdvertisedActionHandlerboth create rclpy entities (rclpy.Clientandrclpy.action.ActionServer, respectively) on the rosbridge node directly from a worker thread, while the rclpy executor is concurrently spinning that same node. The executor's wait-set rebuild can race with the worker thread's entity registration / destruction.The symptom varies by entity:
Service clients silently lose their backing handle: the executor ends up holding a
Clientwhose PyCapsule has been freed, and the next rclpy binding call on the affected node raisesTypeError: Object of type 'NoneType' is not an instance of 'capsule'. After this trips once, every subsequentcall_serviceon the affected websocket returnsresult: falseand the connection is unrecoverable until the rosbridge process restarts.Action servers crash harder: the worker thread reaches
_rclpy.ActionServer(...)inrclpy/action/server.py:__init__while the executor is mid-wait-set-rebuild and the process SIGSEGVs. This is the actual cause of thetest_capabilitieshang on Rolling/Lyrical CI — pytest's xunit fires after the segfault, but ctest waits the full 60 s before declaring the test timed out, masking the real signal in run logs.Approach
Same idea as the IncomingQueue race fixed in #1183: route the lifecycle calls through the executor via
executor.create_task(...)so they serialize with the executor's own work instead of competing with it.Extract the executor-routing pattern into
rosbridge_library.internal.executor_helpers— one module, two helpers:run_on_executor(node, fn)fn()'s return valueschedule_on_executor(node, fn)Both fall back to inline invocation when the node has no attached executor, which keeps unit-test setups (no executor) working unchanged.
What changed
services.call_servicenow routescreate_clientthroughrun_on_executor(it needs the resultingClientbefore it can callwait_for_service/call_async). The threedestroy_clientsites — wait-for-service failure, response timeout, and the successful path — go throughschedule_on_executor.AdvertisedActionHandler.__init__routesActionServer(...)throughrun_on_executor. (ActionServer destruction was already routed through the executor in feat: Improve action unadvertising #1248, so only the creation side needed the fix.)ActionTesterintest/internal/actions/test_actions.pybuilds its own ActionServer from the test thread while the fixture's executor is spinning. The same SIGSEGV reproduced there, so itsActionServer(...)call now goes throughrun_on_executortoo.call_service(resolve_service_name,get_service_names_and_types) stay on the worker thread.Regression test
rosbridge_library/test/internal/services/test_stress_service_clients.pyexercises the race window:test_many_parallel_service_calls— 40ServiceCallers in flight at once against the same service.test_barrier_synchronized_service_calls— 20 callers start at the same instant via athreading.Barrier.test_repeated_rapid_service_calls— 100 sequential calls with no settle time between iterations.test_interleaved_waves_of_service_calls— each wave of callers starts while the previous wave is still tearing down its clients.After each scenario, the test asserts the executor's spin thread is still alive, has logged no errors, and can still process newly submitted tasks. The action-server race is covered by the existing
test_action_capabilitiesandtest_actionstests, which previously SIGSEGV'd and now pass.Back-ports
services.pyandadvertise_action.pyare identical onjazzyandhumble; sibling PRs target those branches. Happy to defer those to mergify if preferred.