Skip to content

Enable mgpu in FrameView#5514

Merged
kellyguo11 merged 8 commits into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu
May 20, 2026
Merged

Enable mgpu in FrameView#5514
kellyguo11 merged 8 commits into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu

Conversation

@pv-nvidia
Copy link
Copy Markdown
Contributor

@pv-nvidia pv-nvidia commented May 6, 2026

Description

Removes the cuda:0-only restriction in FabricFrameView. USDRT SelectPrims now accepts any CUDA device index, so Fabric acceleration runs on the simulation device (e.g., cuda:1) instead of silently falling back to the slower USD path. This unblocks distributed training where each process is pinned to a specific GPU.

Changes:

Type of change

  • New feature (non-breaking change which adds functionality)

cuda:0 continues to work exactly as before; cuda:1+ now also works instead of silently falling back to USD. No public API surface changed.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Note: this PR uses a fragment file at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst per the fragment-based changelog system.

Test plan

Three new tests gated by ISAACLAB_TEST_MULTI_GPU=1 and parameterized with ["cuda:1"]:

  • test_fabric_cuda1_world_pose_roundtripset_world_posesget_world_poses returns the same values on a non-primary CUDA device.
  • test_fabric_cuda1_no_usd_writeback — Fabric writes on cuda:1 do not write back to USD.
  • test_fabric_cuda1_scales_roundtrip — covers the set_scales write path on cuda:1.

A dedicated CI workflow (test-fabric-multi-gpu.yaml) runs on the [self-hosted, linux, x64, gpu, multi-gpu] runner with ISAACLAB_TEST_MULTI_GPU=1 set. Pre-flights with nvidia-smi and torch.cuda.device_count(), fails loudly if the runner has < 2 GPUs.

To verify locally on a multi-GPU machine:

ISAACLAB_TEST_MULTI_GPU=1 ./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

To verify the cuda:0 path is unchanged (multi-GPU tests auto-skip):

./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 6, 2026
@pv-nvidia pv-nvidia marked this pull request as draft May 6, 2026 12:32
@pv-nvidia pv-nvidia self-assigned this May 6, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR removes the cuda:0-only restriction from FabricFrameView, allowing Fabric GPU acceleration on any CUDA device index (e.g. cuda:1), which unblocks distributed training. It also drops the deprecated wp.to_torch() calls in favour of the .torch accessor on ProxyArray, adds three cuda:1-parameterised multi-GPU tests, and ships a dedicated CI workflow with a GPU pre-flight guard.

  • Device allowlist removed: _fabric_supported_devices, the __init__ guard, and the _initialize_fabric assertion are all deleted; fabric_stage.SelectPrims is now called with self._device directly, letting USDRT handle any CUDA index.
  • Return type asymmetry: get_world_poses() wraps its Fabric result in ProxyArray (exposing .torch), but get_scales() still returns a raw wp.array. The new test_fabric_cuda1_scales_roundtrip test calls .torch on that raw array, which will raise AttributeError on the multi-GPU runner and void the intended coverage.
  • Multi-GPU CI workflow: test-fabric-multi-gpu.yaml includes the GPU pre-flight step (torch.cuda.device_count() >= 2) that fails loud before pytest is invoked, addressing the gap called out in a prior review round.

Confidence Score: 4/5

Safe to merge after the get_scales() return-type fix; the multi-GPU test for scales will throw AttributeError at runtime without it.

The core Fabric device-allowlist removal is straightforward and the cuda:0 path is unaffected. The blocking concern is that get_scales() returns a raw wp.array while the new test expects .torch on it — ProxyArray provides .torch but wp.array does not — so test_fabric_cuda1_scales_roundtrip will fail with AttributeError on the multi-GPU runner, defeating its coverage purpose.

fabric_frame_view.py (get_scales return type) and test_views_xform_prim_fabric.py (test_fabric_cuda1_scales_roundtrip) need the matching fix before the multi-GPU runner runs.

Important Files Changed

Filename Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Removes the cuda:0-only device allowlist and the assertion in _initialize_fabric; drops the CPU fallback guard; adds follow-up TODOs. The get_scales() return type is a raw wp.array while get_world_poses() returns ProxyArray — asymmetry that affects the new scale tests.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Adds three cuda:1-gated multi-GPU tests and refines _skip_if_unavailable. The scales roundtrip test calls .torch on the return value of get_scales(), which returns a raw wp.array not a ProxyArray, so the accessor may be absent at runtime.
.github/workflows/test-fabric-multi-gpu.yaml New dedicated CI workflow for multi-GPU Fabric tests; includes a GPU pre-flight step that fails loudly if fewer than 2 GPUs are present, closing the gap noted in a previous review.
source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst Changelog fragment describing the multi-GPU Fabric fix; accurate and concise.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FabricFrameView
    participant USDRTSelectPrims
    participant WarpKernel
    participant UsdFrameView

    Caller->>FabricFrameView: "__init__(device=cuda:N)"
    Note over FabricFrameView: No device allowlist check (removed)

    Caller->>FabricFrameView: set_world_poses(positions)
    alt Fabric enabled
        FabricFrameView->>USDRTSelectPrims: "SelectPrims(device=cuda:N)"
        FabricFrameView->>WarpKernel: launch(compose_fabric_transformation)
        FabricFrameView->>FabricFrameView: _prepare_for_reuse()
    else Fabric disabled
        FabricFrameView->>UsdFrameView: set_world_poses(...)
    end

    Caller->>FabricFrameView: get_scales()
    alt Fabric enabled
        FabricFrameView->>WarpKernel: launch(decompose_fabric_transformation)
        FabricFrameView-->>Caller: wp.array (raw — no ProxyArray wrap)
    else Fabric disabled
        FabricFrameView->>UsdFrameView: get_scales()
        FabricFrameView-->>Caller: result
    end

    Caller->>FabricFrameView: get_world_poses()
    alt Fabric enabled
        FabricFrameView->>WarpKernel: launch(decompose_fabric_transformation)
        FabricFrameView-->>Caller: ProxyArray(positions), ProxyArray(orientations)
    end
Loading

Reviews (6): Last reviewed commit: "Split FabricFrameView multi-GPU tests in..." | Re-trigger Greptile

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from a6cd73e to 2c619fe Compare May 7, 2026 08:44
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia marked this pull request as ready for review May 11, 2026 11:29
isaaclab-review-bot[bot]

This comment was marked as off-topic.

Comment thread .github/workflows/test-multi-gpu.yaml Outdated
@pv-nvidia pv-nvidia changed the title Feat/frame view enable mgpu Enable mgpu in FrameView May 12, 2026
@pv-nvidia pv-nvidia changed the title Enable mgpu in FrameView pref: Enable mgpu in FrameView May 12, 2026
@pv-nvidia pv-nvidia added the enhancement New feature or request label May 12, 2026
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from 1c2e02d to 8de9a39 Compare May 17, 2026 22:23
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 8de9a39 to e206ba9 Compare May 20, 2026 14:11
pv-nvidia added 5 commits May 20, 2026 15:35
- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims
  no longer needs cuda:0.
- Refactor the Fabric write path into a single _compose_fabric_transform
  helper shared by set_world_poses, set_scales, and the initial
  USD->Fabric sync, collapsing the sync to one kernel launch with one
  PrepareForReuse.
- Replace the topology-invariant assert with RuntimeError so it survives
  python -O.
- Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both
  Fabric write paths, and run them in the existing test-multi-gpu CI
  job (one extra step, no new job).
The standard pytest invocation in CI runs the fabric test file without
filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get
scheduled on every runner including the single-GPU ones.  Previously
``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever
``GITHUB_ACTIONS=true`` and the requested device was missing, on the
theory that this would catch a misconfigured multi-GPU runner.  In
practice it just broke the standard CI: the dedicated
``test-fabric-multi-gpu`` workflow already pre-flights
``torch.cuda.device_count() >= 2`` before invoking pytest, so a
genuinely misconfigured multi-GPU runner is already caught there.

Always skip rather than fail when the requested ``cuda:N`` index isn't
available.  Drop the now-unused ``import os``.
Kit's CLI parser reads sys.argv directly at startup and segfaults on
pytest flags that collide with its own short options.  Running

    pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py

crashes during collection because Kit sees ``-m multi_gpu`` and exits
with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245)
inside ``simulation_app._start_app``.

Strip sys.argv to argv[0] before instantiating AppLauncher.  The test
file takes no CLI arguments of its own, mirroring the broader pattern
used by ``test_tiled_camera_env.py`` which assigns
``sys.argv[1:] = args_cli.unittest_args`` after argparse.
wp.to_torch on a ProxyArray is deprecated in favor of the .torch
accessor.  Switch the three call sites that consume the ProxyArray
returned by get_world_poses; leave get_scales call sites alone since
that method still returns a raw wp.array (no .torch accessor).
- Add a GPU-count pre-flight step to the test-fabric-multi-gpu CI job
  so a runner regression to a single GPU fails the workflow instead of
  silently skipping every cuda:1 test. This is what the comment in
  _skip_if_unavailable already promised existed.
- Note that the sys.argv strip in test_views_xform_prim_fabric.py must
  stay between the AppLauncher import and its instantiation; any CLI
  parser or reordering re-exposes Kit to pytest argv and segfaults at
  startup.
- Document the _fabric_usd_sync_done side effect on
  _compose_fabric_transform so callers can see why subsequent getters
  stop pulling from USD.
The class docstring and __init__ device-param doc still claimed
``cuda:0`` only.  Refresh both to note that Fabric acceleration runs on
any CUDA index, so the autodoc API page reflects the actual contract.
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from e206ba9 to 96f159e Compare May 20, 2026 15:38
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 2 times, most recently from ffb3e91 to f4dd500 Compare May 20, 2026 15:53
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from f4dd500 to cf57d31 Compare May 20, 2026 16:00
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from cf57d31 to a7a6956 Compare May 20, 2026 16:07
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — Updated Review (4f262aa)

Commit: 4f262aa6710b19679b5ab94015f0dde9a4fed38b
Previous review: 556b74b (workflow separation in progress)


📋 What Changed Since Last Review

Commit 4f262aa finalizes the workflow separation with a clean split:

Change Description
test-fabric-multi-gpu.yaml New dedicated workflow (60 lines) — self-contained CI for Fabric tests
test-multi-gpu.yaml ✅ Restored to upstream/develop (removed Fabric test job)
fabric_frame_view.py Minor: relocated TODO comments
changelog.d/*.rst Simplified wording
test_views_xform_prim_fabric.py Style cleanup only

Key improvement: Complete workflow separation. FabricFrameView changes now trigger only test-fabric-multi-gpu.yaml (via path filter), while test-multi-gpu.yaml returns to its upstream state for distributed-training validation. The two workflows are completely decoupled.


✅ Full PR Summary

This PR removes the cuda:0-only restriction from FabricFrameView, enabling Fabric GPU acceleration on any CUDA device. This unblocks distributed training where each rank is pinned to a non-primary GPU (e.g., cuda:1).

🔍 Code Review

Architecture:

  • ✅ Clean removal of _fabric_supported_devices allowlist and associated guards
  • ✅ Minimal, surgical change — core write paths unchanged
  • ✅ Well-scoped TODO comments reference follow-up PRs (#5673, #5674)
  • ✅ Docstrings updated to reflect multi-GPU support

Error Handling:

  • RuntimeError replaces assert for topology-change invariant (survives python -O)
  • _skip_if_unavailable() gracefully skips tests on single-GPU runners

Test Coverage:

  • ✅ Three cuda:1-parameterized tests: roundtrip poses, no-writeback, scales roundtrip
  • ✅ New multi_gpu pytest marker registered in pyproject.toml
  • ✅ Kit argv stripping prevents segfault from pytest flags
  • ✅ Uses .torch accessor instead of deprecated wp.to_torch()

CI Design:

  • test-fabric-multi-gpu.yaml — dedicated 60-line workflow with GPU pre-flight
  • ✅ Path-filtered triggers: only runs on changes to FabricFrameView or its tests
  • ✅ Runner pre-flight (torch.cuda.device_count() >= 2) fails loudly if misconfigured
  • test-multi-gpu.yaml byte-identical to upstream/develop — zero risk of regression

🚦 CI Status

Check Status
Build Wheel ✅ Pass
changelog fragments ✅ Pass
Broken Links ✅ Pass
labeler ✅ Pass
pre-commit ⏳ Pending
FabricFrameView multi-GPU tests ⏳ Pending (self-hosted runner)
license-check ⏳ Pending
Installation Tests ⏳ Pending
Docs ⏳ Pending

📝 Verdict

LGTM — ready for human review. The workflow separation is now complete and clean. Core functionality is well-tested with appropriate CI coverage. The change is low-risk: cuda:0 path is unchanged, and the new cuda:1 tests provide explicit coverage. Once the pending CI checks pass (particularly the multi-GPU tests on the self-hosted runner), this PR is ready to merge.


Reviewed at: 2026-05-20T16:25 UTC

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from a7a6956 to 556b74b Compare May 20, 2026 16:22
@isaac-sim isaac-sim deleted a comment from isaaclab-review-bot Bot May 20, 2026
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 556b74b to 4f262aa Compare May 20, 2026 16:24
Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Outdated
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 4f262aa to cc1d789 Compare May 20, 2026 16:34
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from cc1d789 to c42ec86 Compare May 20, 2026 16:37
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — Updated Review (a039313)

Commit: a03931317a2be2486357370d03326e9c398f86ef
Previous review: c42ec86 → Now fully finalized


📋 What Changed Since Last Review

Two commits since c42ec86:

Commit Change
60d99b86 Split FabricFrameView multi-GPU tests into dedicated workflow
a0393131 Disable multi-GPU workflow (no self-hosted runner available)

Key changes:

  1. Workflow separation completetest-fabric-multi-gpu.yaml is now a standalone workflow file, completely decoupled from test-multi-gpu.yaml
  2. Workflow disabled — Multi-GPU workflow set to workflow_dispatch only with clear comment explaining why (no multi-GPU runner provisioned yet)
  3. Pytest marker removedmulti_gpu marker dropped from pyproject.toml; tests now use ISAACLAB_TEST_MULTI_GPU=1 env var gate via @pytest.mark.skipif
  4. Changelog simplified — Removed extraneous "Changed" section, now focuses purely on the core fix
  5. Code cleanup — Removed redundant comments in fabric_frame_view.py, uses torch.as_tensor() consistently in tests

✅ Full PR Summary

This PR removes the cuda:0-only restriction from FabricFrameView, enabling Fabric GPU acceleration on any CUDA device (cuda:0, cuda:1, etc.). This unblocks distributed training where each rank is pinned to a non-primary GPU.

🔍 Code Review

Architecture:

  • ✅ Clean removal of _fabric_supported_devices allowlist
  • ✅ Minimal, surgical change — core Warp kernel paths unchanged
  • ✅ TODO comments reference follow-up PRs (#5673, #5674)
  • ✅ Docstrings updated for multi-GPU support

Test Coverage:

  • ✅ Three cuda:1-parameterized tests gated by ISAACLAB_TEST_MULTI_GPU=1
  • _skip_if_unavailable() gracefully skips on single-GPU runners
  • ✅ Uses torch.as_tensor() for Warp→Torch conversion (consistent API)

CI Design:

  • test-fabric-multi-gpu.yaml — dedicated workflow (disabled until runner available)
  • ✅ Path-filtered triggers ready for when runner is provisioned
  • test-multi-gpu.yaml returned to upstream state (zero diff risk)
  • ✅ Env var gating avoids pytest marker complexity

🚦 CI Status

Check Status
pre-commit ✅ Pass
Check changelog fragments ✅ Pass
Build Wheel ✅ Pass
Check for Broken Links ✅ Pass
Detect Changes ✅ Pass
labeler ✅ Pass
Load Config ✅ Pass
Installation Tests ⏳ Pending
Build Latest Docs ⏳ Pending
license-check ⏳ Pending

📝 Verdict

LGTM — ready for human review. The workflow separation is clean and complete. The PR is low-risk:

  • cuda:0 behavior unchanged
  • New cuda:1 tests provide explicit coverage (will run when multi-GPU runner is provisioned)
  • Multi-GPU workflow correctly disabled to avoid queue-indefinitely issue

Once CI passes, this is ready to merge.


Reviewed at: 2026-05-20T19:34 UTC

Move the test-fabric-multi-gpu job out of test-multi-gpu.yaml and into
a dedicated test-fabric-multi-gpu.yaml.  The two workflows share the
same runner label, install step, and GPU pre-flight, but trigger on
disjoint path sets so changes to FabricFrameView no longer gate the
distributed-training validation and vice versa.

test-multi-gpu.yaml is now byte-identical to upstream/develop.
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from c42ec86 to 60d99b8 Compare May 20, 2026 16:46
Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
No self-hosted runner with the 'multi-gpu' label is registered.
All runs queue indefinitely. Kept as workflow_dispatch only so it
can be manually triggered once a runner is provisioned.

See also .github/workflows/test-multi-gpu.yaml (same issue).
@kellyguo11 kellyguo11 changed the title pref: Enable mgpu in FrameView Enable mgpu in FrameView May 20, 2026
@kellyguo11 kellyguo11 merged commit aa19b08 into isaac-sim:develop May 20, 2026
64 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants