Skip to content

Fix: Pre-existing Rolling/Lyrical CI failures (mypy + flaky subscriber tests)#1258

Open
JWhitleyWork wants to merge 3 commits into
RobotWebTools:ros2from
PickNikRobotics:fix/rolling-lyrical-ci
Open

Fix: Pre-existing Rolling/Lyrical CI failures (mypy + flaky subscriber tests)#1258
JWhitleyWork wants to merge 3 commits into
RobotWebTools:ros2from
PickNikRobotics:fix/rolling-lyrical-ci

Conversation

@JWhitleyWork
Copy link
Copy Markdown

@JWhitleyWork JWhitleyWork commented May 18, 2026

Summary

Pre-existing CI hygiene on Rolling and Lyrical that has been red on ros2 since the move to ubuntu:resolute:

  • Newer mypy (in resolute's package set) flags 5 latent typing issues that older mypy let slide — one in rosbridge_library/capabilities/subscribe.py, four in rosbridge_server/websocket_handler.py. None are caused by the changes in this branch; they are simply now load-bearing for any PR that touches a Rolling/Lyrical job.
  • DDS discovery on resolute overran the topic-existence test deadlines. The helpers in rosbridge_library/util/ros.py polled the rmw graph cache via get_(publisher|subscriber)_names_and_types_by_node, which requires a discovery round-trip even for self-queries. Different subscriber / publisher tests flaked across different runs.

Dependency

This PR stacks on top of #1255 (the service / action lifecycle race fix). #1255 must land first; without it test_action_capabilities SIGSEGVs and the rest of the Rolling/Lyrical job is moot.

Approach

mypy

Smallest possible patches at each error site — no broader refactor, no library-wide annotation pass.

  • subscribe.py:61 — extend the type: ignore on the simplejson fallback to also cover no-redef. The chained from X import Y as encode_json pattern legitimately rebinds the name; the new mypy version requires the suppression to say so.
  • websocket_handler.py::_log_exception — narrow RosbridgeWebSocket.node_handle with the same assert isinstance(..., Node) pattern already used in open() / on_close() before calling .get_logger().
  • websocket_handler.py::protocol_parameters — annotate the bare ClassVar as ClassVar[dict[str, Any]] to match the value it's actually assigned in rosbridge_websocket.py.
  • websocket_handler.py::open / on_close — pass self.request.remote_ip or "" to ClientManager.add_client / remove_client. Tornado types remote_ip as Any | None; the manager signature is str.

Topic-existence checks

The polling-based assert_topic_(not_)subscribed helpers, and the rmw-graph-cache queries they were built around, were always going to flake. The graph cache is updated by DDS discovery events; even for a node querying its own subscriptions, that's an asynchronous round-trip, and the resolute package set's defaults make the round-trip slower.

The deterministic alternative — and the one this PR adopts — is to read the node's own entity list directly:

  • is_topic_published(node, name)any(p.topic_name == name for p in node.publishers)
  • is_topic_subscribed(node, name)any(s.topic_name == name for s in node.subscriptions)

Both node.publishers and node.subscriptions are populated synchronously inside create_publisher / create_subscription and cleared inside destroy_publisher / destroy_subscription, so the result is correct as soon as the calling thread has the entity handle — no DDS involvement.

The one remaining asynchrony is that MultiSubscriber.unregister routes destroy_subscription through executor.create_task (added in #1194). After manager.unsubscribe, the destroy task is queued but may not have run yet from the test thread's point of view. wait_for_executor_idle handles this: it submits a no-op task and waits for it to run. SingleThreadedExecutor processes tasks FIFO, so when the no-op completes every task enqueued before it has also completed.

Why not launch_testing?

launch_testing is for testing systems that span processes. The tests here exercise the in-process SubscriberManager / PublisherManager API on a single node — moving them under launch_testing would impose a per-test process boundary and a much larger refactor without changing the determinism story. node.publishers / node.subscriptions give the same guarantee with no new infrastructure.

What changed

  • rosbridge_library/util/ros.py — rewrite is_topic_published / is_topic_subscribed against node.publishers / node.subscriptions; add wait_for_executor_idle.
  • Drop the assert_topic_(not_)subscribed polling helpers from the two subscriber test files; replace with direct assertTrue / assertFalse against is_topic_subscribed. Add wait_for_executor_idle(self.executor) after each manager.unsubscribe / multi.unregister before asserting the entity is gone.
  • Remove the time.sleep(0.1) before is_topic_published in test_publisher_manager.test_register_infer_topictypecreate_publisher is now reflected immediately. The time.sleep(manager.unregister_timeout + 1.0) waits stay; those wait on threading.Timer, not graph propagation.
  • subscribe.py, websocket_handler.py — the mypy fixes described above.

Local validation

Reproduced on a ros:rolling-perception container with current rclpy:

@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 18, 2026

2. DDS discovery is slower on resolute

I love how the AI just assumed that's the case without any proof. The problem seems to be more complicated than that.

@JWhitleyWork
Copy link
Copy Markdown
Author

  1. DDS discovery is slower on resolute

I love how the AI just assumed that's the case without any proof. The problem seems to be more complicated than that.

Yeah, agreed. It's churning now on using a more deterministic solution (at my prompting) like launch_testing.

@JWhitleyWork JWhitleyWork force-pushed the fix/rolling-lyrical-ci branch 3 times, most recently from f4fcacd to e90ddf5 Compare May 18, 2026 23:35
…lities

Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler`
create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`,
respectively) on the rosbridge node directly from a worker thread.  The same
node is concurrently spun by the rclpy executor, so the executor's wait-set
rebuild can race with the worker thread's entity registration / destruction
and either crash inside `_rclpy` (action servers SIGSEGV in
`rclpy/action/server.py:__init__`) or leave the executor holding a handle
whose underlying PyCapsule has been freed (services surface this later as
`TypeError: Object of type 'NoneType' is not an instance of 'capsule'`).

This is the service- and action-capability analogue of the IncomingQueue
race fixed in RobotWebTools#1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own work
instead of competing with it.

- Extract the executor-routing pattern into
  `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor`
  (synchronous; returns the value, used when the caller needs the entity
  back) and `schedule_on_executor` (fire-and-forget, used for destructors).
  Both fall back to inline invocation when the node has no attached
  executor, preserving unit-test behaviour.
- Apply `run_on_executor` to client creation in `services.call_service` and
  ActionServer creation in `AdvertisedActionHandler.__init__`. The three
  `destroy_client` sites in `services.call_service` now use
  `schedule_on_executor`. ActionServer destruction was already routed
  through the executor in RobotWebTools#1248.
- `ActionTester` in `test_actions.py` constructs its own ActionServer from
  the test thread while the fixture's executor is already spinning. Route
  that creation through `run_on_executor` for the same reason; without it
  the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy.

`test_stress_service_clients.py` exercises the service race window
(parallel callers, barrier-synchronized starts, rapid sequential cycles,
and interleaved waves of concurrent callers) and asserts the executor
remains healthy and responsive afterwards.
Two independent root causes block any PR that touches a Rolling/Lyrical job:

mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
  cover no-redef. The chained 'import as encode_json' pattern legitimately
  rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
  with the same assert pattern already in open()/on_close(); annotate the
  bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
  request.remote_ip to a str at the ClientManager.{add,remove}_client
  boundary (Tornado types it as Any | None).

Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.

Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
  node.publishers / node.subscriptions (the local entity list). These are
  populated synchronously inside create_publisher / create_subscription
  and cleared inside destroy_publisher / destroy_subscription, so the
  result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
  subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
  (RobotWebTools#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
  test thread still has to synchronize with the executor before asserting
  the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
  task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
  any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
  test_publisher_manager.test_register_infer_topictype — create_publisher
  is now reflected immediately.

Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.
@JWhitleyWork JWhitleyWork force-pushed the fix/rolling-lyrical-ci branch from e90ddf5 to 35e5b6d Compare May 18, 2026 23:52
test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical
even though pytest itself completes in ~14s and reports 46/46 passing. The
remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon
worker threads.

Each of the action / service capability tests dispatches the rosbridge worker
via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to
the production protocol.register_operation path). The worker calls
SendGoal.send_goal / call_service.call_service, which busy-wait on a result
that is only delivered when the executor spins the registered callback. When
tearDown calls executor.shutdown() before the worker's result callback has
fired, the busy-wait runs forever. Because the dispatch threads are
non-daemon, the Python interpreter blocks at shutdown trying to join them.

Mark these worker threads daemon=True so process exit is not gated on them.
The threads still complete normally on the happy path; daemon only changes
behaviour at interpreter shutdown, mirroring the pattern already used in
test_stress_service_clients.py.
@JWhitleyWork
Copy link
Copy Markdown
Author

I think it did it...

@JWhitleyWork JWhitleyWork marked this pull request as ready for review May 20, 2026 05:34
@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 20, 2026

Is the first commit needed to make the tests pass?

@JWhitleyWork
Copy link
Copy Markdown
Author

Is the first commit needed to make the tests pass?

It is, yes. That commit is what makes up #1255, which still needs to go into humble and jazzy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants