Fix: Pre-existing Rolling/Lyrical CI failures (mypy + flaky subscriber tests)#1258
Open
JWhitleyWork wants to merge 3 commits into
Open
Fix: Pre-existing Rolling/Lyrical CI failures (mypy + flaky subscriber tests)#1258JWhitleyWork wants to merge 3 commits into
JWhitleyWork wants to merge 3 commits into
Conversation
Member
I love how the AI just assumed that's the case without any proof. The problem seems to be more complicated than that. |
Author
Yeah, agreed. It's churning now on using a more deterministic solution (at my prompting) like |
f4fcacd to
e90ddf5
Compare
…lities Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler` create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`, respectively) on the rosbridge node directly from a worker thread. The same node is concurrently spun by the rclpy executor, so the executor's wait-set rebuild can race with the worker thread's entity registration / destruction and either crash inside `_rclpy` (action servers SIGSEGV in `rclpy/action/server.py:__init__`) or leave the executor holding a handle whose underlying PyCapsule has been freed (services surface this later as `TypeError: Object of type 'NoneType' is not an instance of 'capsule'`). This is the service- and action-capability analogue of the IncomingQueue race fixed in RobotWebTools#1183: route the lifecycle calls through the executor via `executor.create_task(...)` so they serialize with the executor's own work instead of competing with it. - Extract the executor-routing pattern into `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor` (synchronous; returns the value, used when the caller needs the entity back) and `schedule_on_executor` (fire-and-forget, used for destructors). Both fall back to inline invocation when the node has no attached executor, preserving unit-test behaviour. - Apply `run_on_executor` to client creation in `services.call_service` and ActionServer creation in `AdvertisedActionHandler.__init__`. The three `destroy_client` sites in `services.call_service` now use `schedule_on_executor`. ActionServer destruction was already routed through the executor in RobotWebTools#1248. - `ActionTester` in `test_actions.py` constructs its own ActionServer from the test thread while the fixture's executor is already spinning. Route that creation through `run_on_executor` for the same reason; without it the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy. `test_stress_service_clients.py` exercises the service race window (parallel callers, barrier-synchronized starts, rapid sequential cycles, and interleaved waves of concurrent callers) and asserts the executor remains healthy and responsive afterwards.
Two independent root causes block any PR that touches a Rolling/Lyrical job:
mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
cover no-redef. The chained 'import as encode_json' pattern legitimately
rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
with the same assert pattern already in open()/on_close(); annotate the
bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
request.remote_ip to a str at the ClientManager.{add,remove}_client
boundary (Tornado types it as Any | None).
Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.
Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
node.publishers / node.subscriptions (the local entity list). These are
populated synchronously inside create_publisher / create_subscription
and cleared inside destroy_publisher / destroy_subscription, so the
result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
(RobotWebTools#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
test thread still has to synchronize with the executor before asserting
the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
test_publisher_manager.test_register_infer_topictype — create_publisher
is now reflected immediately.
Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.
e90ddf5 to
35e5b6d
Compare
test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical even though pytest itself completes in ~14s and reports 46/46 passing. The remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon worker threads. Each of the action / service capability tests dispatches the rosbridge worker via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to the production protocol.register_operation path). The worker calls SendGoal.send_goal / call_service.call_service, which busy-wait on a result that is only delivered when the executor spins the registered callback. When tearDown calls executor.shutdown() before the worker's result callback has fired, the busy-wait runs forever. Because the dispatch threads are non-daemon, the Python interpreter blocks at shutdown trying to join them. Mark these worker threads daemon=True so process exit is not gated on them. The threads still complete normally on the happy path; daemon only changes behaviour at interpreter shutdown, mirroring the pattern already used in test_stress_service_clients.py.
Author
|
I think it did it... |
Member
|
Is the first commit needed to make the tests pass? |
Author
It is, yes. That commit is what makes up #1255, which still needs to go into |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pre-existing CI hygiene on Rolling and Lyrical that has been red on
ros2since the move toubuntu:resolute:resolute's package set) flags 5 latent typing issues that older mypy let slide — one inrosbridge_library/capabilities/subscribe.py, four inrosbridge_server/websocket_handler.py. None are caused by the changes in this branch; they are simply now load-bearing for any PR that touches a Rolling/Lyrical job.resoluteoverran the topic-existence test deadlines. The helpers inrosbridge_library/util/ros.pypolled the rmw graph cache viaget_(publisher|subscriber)_names_and_types_by_node, which requires a discovery round-trip even for self-queries. Different subscriber / publisher tests flaked across different runs.Dependency
This PR stacks on top of #1255 (the service / action lifecycle race fix). #1255 must land first; without it
test_action_capabilitiesSIGSEGVs and the rest of the Rolling/Lyrical job is moot.Approach
mypy
Smallest possible patches at each error site — no broader refactor, no library-wide annotation pass.
subscribe.py:61— extend thetype: ignoreon thesimplejsonfallback to also coverno-redef. The chainedfrom X import Y as encode_jsonpattern legitimately rebinds the name; the new mypy version requires the suppression to say so.websocket_handler.py::_log_exception— narrowRosbridgeWebSocket.node_handlewith the sameassert isinstance(..., Node)pattern already used inopen()/on_close()before calling.get_logger().websocket_handler.py::protocol_parameters— annotate the bareClassVarasClassVar[dict[str, Any]]to match the value it's actually assigned inrosbridge_websocket.py.websocket_handler.py::open/on_close— passself.request.remote_ip or ""toClientManager.add_client/remove_client. Tornado typesremote_ipasAny | None; the manager signature isstr.Topic-existence checks
The polling-based
assert_topic_(not_)subscribedhelpers, and the rmw-graph-cache queries they were built around, were always going to flake. The graph cache is updated by DDS discovery events; even for a node querying its own subscriptions, that's an asynchronous round-trip, and theresolutepackage set's defaults make the round-trip slower.The deterministic alternative — and the one this PR adopts — is to read the node's own entity list directly:
is_topic_published(node, name)→any(p.topic_name == name for p in node.publishers)is_topic_subscribed(node, name)→any(s.topic_name == name for s in node.subscriptions)Both
node.publishersandnode.subscriptionsare populated synchronously insidecreate_publisher/create_subscriptionand cleared insidedestroy_publisher/destroy_subscription, so the result is correct as soon as the calling thread has the entity handle — no DDS involvement.The one remaining asynchrony is that
MultiSubscriber.unregisterroutesdestroy_subscriptionthroughexecutor.create_task(added in #1194). Aftermanager.unsubscribe, the destroy task is queued but may not have run yet from the test thread's point of view.wait_for_executor_idlehandles this: it submits a no-op task and waits for it to run.SingleThreadedExecutorprocesses tasks FIFO, so when the no-op completes every task enqueued before it has also completed.Why not
launch_testing?launch_testingis for testing systems that span processes. The tests here exercise the in-processSubscriberManager/PublisherManagerAPI on a single node — moving them underlaunch_testingwould impose a per-test process boundary and a much larger refactor without changing the determinism story.node.publishers/node.subscriptionsgive the same guarantee with no new infrastructure.What changed
rosbridge_library/util/ros.py— rewriteis_topic_published/is_topic_subscribedagainstnode.publishers/node.subscriptions; addwait_for_executor_idle.assert_topic_(not_)subscribedpolling helpers from the two subscriber test files; replace with directassertTrue/assertFalseagainstis_topic_subscribed. Addwait_for_executor_idle(self.executor)after eachmanager.unsubscribe/multi.unregisterbefore asserting the entity is gone.time.sleep(0.1)beforeis_topic_publishedintest_publisher_manager.test_register_infer_topictype—create_publisheris now reflected immediately. Thetime.sleep(manager.unregister_timeout + 1.0)waits stay; those wait onthreading.Timer, not graph propagation.subscribe.py,websocket_handler.py— the mypy fixes described above.Local validation
Reproduced on a
ros:rolling-perceptioncontainer with current rclpy:test/internal/subscribers/+test/internal/publishers/— 44/44 pass (the previously flakytest_register_infer_topictype/test_register_subscriber_multiclientetc. now resolve as soon ascreate_subscriptionreturns instead of polling).test/capabilities/— 46/46 pass with fix: Prevent client destruction race in services.call_service #1255 applied (without fix: Prevent client destruction race in services.call_service #1255,test_action_capabilitiesSIGSEGVs before reaching anything in this PR's scope).