[CI] Enable multi-GPU pytest workflow by hujc7 · Pull Request #5738 · isaac-sim/IsaacLab

hujc7 · 2026-05-21T23:12:56Z

1. Summary

Re-enables on: pull_request trigger in .github/workflows/test-fabric-multi-gpu.yaml. Single file changed, 5 insertions / 12 deletions.
Unblocks the cuda:1-parameterized unit tests added in Enable mgpu in FrameView #5514 (source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py) — they are currently only reachable via workflow_dispatch.
First end-to-end exercise of the new multi-GPU runner pool (g6e.12xlarge, label [self-hosted, linux, x64, gpu, multi-gpu]) provisioned 2026-05-21.

2. Why now

The trigger block was commented out at #5514 merge time because no self-hosted runner advertised the multi-gpu label; any pull_request event would have queued indefinitely. The infra precondition is now satisfied, so the original trigger can be restored verbatim.

3. Scope

No test code changes. The cuda:1 tests, the ISAACLAB_TEST_MULTI_GPU=1 env gate, the 2-GPU pre-flight check, and the runner label all stay as-is.
This PR is the minimal "is the runner alive" check. If the workflow goes green, follow-ups can broaden coverage (path filter, additional cuda:1 parameterized tests).

4. CI footprint

The diff only touches .github/workflows/test-fabric-multi-gpu.yaml, which is not in build.yaml's Detect Changes allowlist. The standard L40 docker-test matrix self-skips with run_docker_tests=false, so iteration on this branch does not consume the production single-GPU pool. Only the new multi-GPU workflow runs.

5. Test plan

Workflow appears in the checks list on this draft PR.
2-GPU pre-flight step passes (torch.cuda.device_count() >= 2).
All three cuda:1-parameterized tests in test_views_xform_prim_fabric.py pass.
If multi-GPU pool capacity is constrained (us-west-2d g6e.12xlarge shortage per Alexander 2026-05-21), confirm the job queues rather than fails and picks up once capacity returns.

The multi-GPU runner pool went live on 2026-05-21 (g6e.12xlarge, self-hosted, linux, x64, gpu, multi-gpu). The on: block in this workflow was previously commented out because no runner advertised that label; that precondition is now satisfied. This change uncomments the original pull_request paths trigger and removes the DISABLED comment. No test or workflow logic changes.

isaaclab-review-bot

🤖 Isaac Lab Review Bot

PR #5738 — [CI] Enable FabricFrameView multi-GPU workflow trigger

Summary

This PR re-enables the pull_request trigger for the FabricFrameView multi-GPU test workflow. The trigger was previously commented out awaiting infrastructure provisioning of a multi-GPU runner.

Changes Reviewed

File	Change Type	Lines
`.github/workflows/test-fabric-multi-gpu.yaml`	Modified	+5/-12

Analysis

✅ Clean change — The diff simply uncomments the original pull_request paths trigger and removes the DISABLED comment block. No workflow logic changes.

✅ Path filters retained — The workflow appropriately triggers only on changes to:

source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
.github/workflows/test-fabric-multi-gpu.yaml

✅ workflow_dispatch preserved — Manual triggering capability remains for debugging/re-runs.

✅ Excellent PR description — Clear rationale, scope, and test plan provided.

CI Status

Standard test jobs correctly self-skip (not in Detect Changes allowlist)
FabricFrameView multi-GPU unit tests job is pending/queued as expected

Verdict

No issues found. This is a minimal, well-documented infrastructure enablement change.

_{🤖 Automated review by isaaclab-review-bot | Commit: baf811f}

Update (commit c21377a): New commit adds a Set up Python step using setup-python@v5 with Python 3.12. This is a sensible addition to ensure consistent Python environment setup before running the Isaac Lab install script. ✅ No issues — LGTM.

Update (commit 4f136d1): Adds pip install cmake step before Isaac Lab installation. The detailed comment explains this bypasses the sudo apt-get path in install.py (line 35) by placing cmake on PATH via the pip wheel. Smart workaround for CI environments. ✅ No issues — LGTM.

Update (commit 43ab9b9): Changes install command to ./isaaclab.sh --install none (core submodules only). Well-commented explanation: the FabricFrameView cuda:1 tests only need isaaclab core + isaaclab_physx + Isaac Sim, and this avoids building robomimic's egl_probe wheel which requires libEGL/X11 headers missing on the multi-GPU runner image. ✅ Sensible minimal install approach — LGTM.

Update (commit a5ca71d): Adds explicit pip install 'isaacsim[all,extscache]==5.1.0' step. Logical follow-up to the --install none change — since Isaac Sim is an optional extra not pulled by minimal install, it must be installed explicitly. The comment correctly notes that without it, AppLauncher's import isaacsim silently fails, leaving EXP_PATH unset and causing KeyError at test collection. ✅ Well-documented fix — LGTM.

Update (commit 1106337): Bumps Isaac Sim version from 5.1.0 to 6.0.0.0. Updated comment explains the version pin: isaacsim 6.0.0.0 is the only release line with Python 3.12 wheels (the 5.x line on PyPI requires Python 3.11). This aligns correctly with the Python 3.12 setup added in c21377a. ✅ Necessary version bump — LGTM.

Update (commit 93b45a8): Adds EULA acceptance and headless environment variables (OMNI_KIT_ACCEPT_EULA, ACCEPT_EULA, ISAAC_SIM_HEADLESS) to the test step. The comment explains these bypass Kit's interactive EULA prompt on first launch, preventing isaacsim.bootstrap_kernel from reading stdin during test collection (which causes pytest to abort). ✅ Standard CI headless configuration — LGTM.

Update (commit 48b42bc): Increases job timeout from 30 to 60 minutes. Sensible buffer given the heavy Isaac Sim installation and multi-GPU test setup overhead. ✅ No issues — LGTM.

Matches the IL pyproject.toml requires-python (>=3.12,<3.13) and the pin every other active workflow uses (docs, wheel, pre-commit, changelog-check, license-check, nightly-changelog). Without this, the runner falls back to system Python which lacks tomllib (added in 3.11+ stdlib), and ./isaaclab.sh --install crashes in isaaclab.cli.commands.install on the tomllib import.

…ncy group

The multi-GPU runner has no passwordless sudo for the github-runner user; ./isaaclab.sh --install crashes on 'sudo apt-get update' during the system-deps step. install.py:_install_system_deps short-circuits if shutil.which("cmake") finds cmake on PATH. The pip wheel ships a cmake binary under the setup-python@v5 venv's bin/, so adding 'pip install cmake' before the install step makes the apt-get branch unreachable.

./isaaclab.sh --install (default 'all') pulls in isaaclab_mimic, which depends on robomimic, which depends on egl_probe. egl_probe builds from source via CMake and needs libEGL/X11 headers that the multi-GPU runner image lacks — its setup.py runs 'cmake --version' inside an isolated build env where the host system libs aren't visible. FabricFrameView cuda:1 tests only need isaaclab core + isaaclab_physx + Isaac Sim. '--install none' covers exactly that and avoids the egl_probe build entirely.

isaacsim is an optional extra in source/isaaclab/setup.py and is not pulled by ./isaaclab.sh --install (any scope). Production runner images ship Isaac Sim baked in; this multi-GPU runner image does not. Without isaacsim installed, AppLauncher's import isaacsim is silently suppressed by contextlib.suppress(ModuleNotFoundError), leaving EXP_PATH unset. Test collection then crashes with KeyError: 'EXP_PATH' at AppLauncher(headless=True). Pinned to isaacsim[all,extscache]==5.1.0 to match the IL extras_require pin in source/isaaclab/setup.py.

isaacsim==5.1.0 (the pin in source/isaaclab/setup.py) only ships a Python 3.11 wheel, so pip cannot resolve it under the 3.12 baseline declared in pyproject.toml. 6.0.0.0 is the first line with a 3.12 wheel and matches the version installed in the local env_isaaclab conda env. Note: source/isaaclab/setup.py's extras_require['isaacsim'] pin is stale and should be bumped as a follow-up; out of scope here.

isaacsim.bootstrap_kernel runs on import and prompts for EULA acceptance on first launch. Under pytest the stdin capture causes 'reading from stdin while output is captured' and the kit kernel exits non-zero before test collection completes. OMNI_KIT_ACCEPT_EULA=yes / ACCEPT_EULA=Y / ISAAC_SIM_HEADLESS=1 are the same env-var bypass build.yaml uses in its docker-test debug recipe.

Attempt 8 reached the test step (all install/preflight steps green) but the 30-min budget ran out during pytest execution. First-launch Kit-app cold start primes extscache, which is slow on a bare runner without a prebaked cache. Subsequent runs on the same runner will be faster, but the first one needs more headroom.

github-actions Bot added the infrastructure label May 21, 2026

hujc7 changed the title ~~[CI] Enable FabricFrameView multi-GPU workflow trigger~~ [CI] Enable multi-GPU pytest workflow May 21, 2026

isaaclab-review-bot Bot reviewed May 21, 2026

View reviewed changes

hujc7 added 8 commits May 22, 2026 20:33

Retrigger: queue stalled 8+ hours, push empty commit to drop concurre…

c21377a

…ncy group

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Enable multi-GPU pytest workflow#5738

[CI] Enable multi-GPU pytest workflow#5738
hujc7 wants to merge 9 commits into
isaac-sim:developfrom
hujc7:jichuanh/enable-fabric-mgpu-ci

hujc7 commented May 21, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hujc7 commented May 21, 2026

1. Summary

2. Why now

3. Scope

4. CI footprint

5. Test plan

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Changes Reviewed

Analysis

CI Status

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaaclab-review-bot Bot left a comment •

edited

Loading