Skip to content

[CI] Enable multi-GPU pytest workflow#5738

Draft
hujc7 wants to merge 9 commits into
isaac-sim:developfrom
hujc7:jichuanh/enable-fabric-mgpu-ci
Draft

[CI] Enable multi-GPU pytest workflow#5738
hujc7 wants to merge 9 commits into
isaac-sim:developfrom
hujc7:jichuanh/enable-fabric-mgpu-ci

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 21, 2026

1. Summary

  • Re-enables on: pull_request trigger in .github/workflows/test-fabric-multi-gpu.yaml. Single file changed, 5 insertions / 12 deletions.
  • Unblocks the cuda:1-parameterized unit tests added in Enable mgpu in FrameView #5514 (source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py) — they are currently only reachable via workflow_dispatch.
  • First end-to-end exercise of the new multi-GPU runner pool (g6e.12xlarge, label [self-hosted, linux, x64, gpu, multi-gpu]) provisioned 2026-05-21.

2. Why now

The trigger block was commented out at #5514 merge time because no self-hosted runner advertised the multi-gpu label; any pull_request event would have queued indefinitely. The infra precondition is now satisfied, so the original trigger can be restored verbatim.

3. Scope

  • No test code changes. The cuda:1 tests, the ISAACLAB_TEST_MULTI_GPU=1 env gate, the 2-GPU pre-flight check, and the runner label all stay as-is.
  • This PR is the minimal "is the runner alive" check. If the workflow goes green, follow-ups can broaden coverage (path filter, additional cuda:1 parameterized tests).

4. CI footprint

The diff only touches .github/workflows/test-fabric-multi-gpu.yaml, which is not in build.yaml's Detect Changes allowlist. The standard L40 docker-test matrix self-skips with run_docker_tests=false, so iteration on this branch does not consume the production single-GPU pool. Only the new multi-GPU workflow runs.

5. Test plan

  • Workflow appears in the checks list on this draft PR.
  • 2-GPU pre-flight step passes (torch.cuda.device_count() >= 2).
  • All three cuda:1-parameterized tests in test_views_xform_prim_fabric.py pass.
  • If multi-GPU pool capacity is constrained (us-west-2d g6e.12xlarge shortage per Alexander 2026-05-21), confirm the job queues rather than fails and picks up once capacity returns.

The multi-GPU runner pool went live on 2026-05-21 (g6e.12xlarge,
self-hosted, linux, x64, gpu, multi-gpu).  The on: block in this
workflow was previously commented out because no runner advertised
that label; that precondition is now satisfied.

This change uncomments the original pull_request paths trigger and
removes the DISABLED comment.  No test or workflow logic changes.
@hujc7 hujc7 changed the title [CI] Enable FabricFrameView multi-GPU workflow trigger [CI] Enable multi-GPU pytest workflow May 21, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

PR #5738 — [CI] Enable FabricFrameView multi-GPU workflow trigger

Summary

This PR re-enables the pull_request trigger for the FabricFrameView multi-GPU test workflow. The trigger was previously commented out awaiting infrastructure provisioning of a multi-GPU runner.

Changes Reviewed

File Change Type Lines
.github/workflows/test-fabric-multi-gpu.yaml Modified +5/-12

Analysis

Clean change — The diff simply uncomments the original pull_request paths trigger and removes the DISABLED comment block. No workflow logic changes.

Path filters retained — The workflow appropriately triggers only on changes to:

  • source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
  • source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
  • .github/workflows/test-fabric-multi-gpu.yaml

workflow_dispatch preserved — Manual triggering capability remains for debugging/re-runs.

Excellent PR description — Clear rationale, scope, and test plan provided.

CI Status

  • Standard test jobs correctly self-skip (not in Detect Changes allowlist)
  • FabricFrameView multi-GPU unit tests job is pending/queued as expected

Verdict

No issues found. This is a minimal, well-documented infrastructure enablement change.


🤖 Automated review by isaaclab-review-bot | Commit: baf811f


Update (commit c21377a): New commit adds a Set up Python step using setup-python@v5 with Python 3.12. This is a sensible addition to ensure consistent Python environment setup before running the Isaac Lab install script. ✅ No issues — LGTM.


Update (commit 4f136d1): Adds pip install cmake step before Isaac Lab installation. The detailed comment explains this bypasses the sudo apt-get path in install.py (line 35) by placing cmake on PATH via the pip wheel. Smart workaround for CI environments. ✅ No issues — LGTM.


Update (commit 43ab9b9): Changes install command to ./isaaclab.sh --install none (core submodules only). Well-commented explanation: the FabricFrameView cuda:1 tests only need isaaclab core + isaaclab_physx + Isaac Sim, and this avoids building robomimic's egl_probe wheel which requires libEGL/X11 headers missing on the multi-GPU runner image. ✅ Sensible minimal install approach — LGTM.


Update (commit a5ca71d): Adds explicit pip install 'isaacsim[all,extscache]==5.1.0' step. Logical follow-up to the --install none change — since Isaac Sim is an optional extra not pulled by minimal install, it must be installed explicitly. The comment correctly notes that without it, AppLauncher's import isaacsim silently fails, leaving EXP_PATH unset and causing KeyError at test collection. ✅ Well-documented fix — LGTM.


Update (commit 1106337): Bumps Isaac Sim version from 5.1.0 to 6.0.0.0. Updated comment explains the version pin: isaacsim 6.0.0.0 is the only release line with Python 3.12 wheels (the 5.x line on PyPI requires Python 3.11). This aligns correctly with the Python 3.12 setup added in c21377a. ✅ Necessary version bump — LGTM.


Update (commit 93b45a8): Adds EULA acceptance and headless environment variables (OMNI_KIT_ACCEPT_EULA, ACCEPT_EULA, ISAAC_SIM_HEADLESS) to the test step. The comment explains these bypass Kit's interactive EULA prompt on first launch, preventing isaacsim.bootstrap_kernel from reading stdin during test collection (which causes pytest to abort). ✅ Standard CI headless configuration — LGTM.


Update (commit 48b42bc): Increases job timeout from 30 to 60 minutes. Sensible buffer given the heavy Isaac Sim installation and multi-GPU test setup overhead. ✅ No issues — LGTM.

hujc7 added 8 commits May 22, 2026 20:33
Matches the IL pyproject.toml requires-python (>=3.12,<3.13) and the
pin every other active workflow uses (docs, wheel, pre-commit,
changelog-check, license-check, nightly-changelog).

Without this, the runner falls back to system Python which lacks
tomllib (added in 3.11+ stdlib), and ./isaaclab.sh --install crashes
in isaaclab.cli.commands.install on the tomllib import.
The multi-GPU runner has no passwordless sudo for the github-runner
user; ./isaaclab.sh --install crashes on 'sudo apt-get update' during
the system-deps step.

install.py:_install_system_deps short-circuits if shutil.which("cmake")
finds cmake on PATH.  The pip wheel ships a cmake binary under the
setup-python@v5 venv's bin/, so adding 'pip install cmake' before
the install step makes the apt-get branch unreachable.
./isaaclab.sh --install (default 'all') pulls in isaaclab_mimic, which
depends on robomimic, which depends on egl_probe.  egl_probe builds
from source via CMake and needs libEGL/X11 headers that the multi-GPU
runner image lacks — its setup.py runs 'cmake --version' inside an
isolated build env where the host system libs aren't visible.

FabricFrameView cuda:1 tests only need isaaclab core + isaaclab_physx
+ Isaac Sim.  '--install none' covers exactly that and avoids the
egl_probe build entirely.
isaacsim is an optional extra in source/isaaclab/setup.py and is not
pulled by ./isaaclab.sh --install (any scope).  Production runner images
ship Isaac Sim baked in; this multi-GPU runner image does not.

Without isaacsim installed, AppLauncher's import isaacsim is silently
suppressed by contextlib.suppress(ModuleNotFoundError), leaving
EXP_PATH unset.  Test collection then crashes with
KeyError: 'EXP_PATH' at AppLauncher(headless=True).

Pinned to isaacsim[all,extscache]==5.1.0 to match the IL extras_require
pin in source/isaaclab/setup.py.
isaacsim==5.1.0 (the pin in source/isaaclab/setup.py) only ships a
Python 3.11 wheel, so pip cannot resolve it under the 3.12 baseline
declared in pyproject.toml.  6.0.0.0 is the first line with a 3.12
wheel and matches the version installed in the local env_isaaclab
conda env.

Note: source/isaaclab/setup.py's extras_require['isaacsim'] pin is
stale and should be bumped as a follow-up; out of scope here.
isaacsim.bootstrap_kernel runs on import and prompts for EULA
acceptance on first launch. Under pytest the stdin capture causes
'reading from stdin while output is captured' and the kit kernel
exits non-zero before test collection completes.

OMNI_KIT_ACCEPT_EULA=yes / ACCEPT_EULA=Y / ISAAC_SIM_HEADLESS=1
are the same env-var bypass build.yaml uses in its docker-test debug
recipe.
Attempt 8 reached the test step (all install/preflight steps green)
but the 30-min budget ran out during pytest execution.  First-launch
Kit-app cold start primes extscache, which is slow on a bare runner
without a prebaked cache.  Subsequent runs on the same runner will be
faster, but the first one needs more headroom.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant