Enable mgpu in FrameView by pv-nvidia · Pull Request #5514 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-06T12:30:46Z

Description

Removes the cuda:0-only restriction in FabricFrameView. USDRT SelectPrims now accepts any CUDA device index, so Fabric acceleration runs on the simulation device (e.g., cuda:1) instead of silently falling back to the slower USD path. This unblocks distributed training where each process is pinned to a specific GPU.

Changes:

Drop device allowlist. Removes _fabric_supported_devices, the device guard in __init__, and the corresponding assertion in _initialize_fabric. Any CUDA device (or CPU) now works.
Multi-GPU test coverage. Three cuda:1-parameterized tests gated by ISAACLAB_TEST_MULTI_GPU=1 env var, plus a dedicated CI workflow on the multi-GPU runner that sets it.
Fix deprecated wp.to_torch() calls. Replaced with .torch accessor on ProxyArray (avoids DeprecationWarning).
TODOs for follow-up PRs.:
- refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory #5673
- refactor: fuse set_world_poses/set_scales into single _compose_fabric_transform #5674

Type of change

New feature (non-breaking change which adds functionality)

cuda:0 continues to work exactly as before; cuda:1+ now also works instead of silently falling back to USD. No public API surface changed.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

Note: this PR uses a fragment file at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst per the fragment-based changelog system.

Test plan

Three new tests gated by ISAACLAB_TEST_MULTI_GPU=1 and parameterized with ["cuda:1"]:

test_fabric_cuda1_world_pose_roundtrip — set_world_poses → get_world_poses returns the same values on a non-primary CUDA device.
test_fabric_cuda1_no_usd_writeback — Fabric writes on cuda:1 do not write back to USD.
test_fabric_cuda1_scales_roundtrip — covers the set_scales write path on cuda:1.

A dedicated CI workflow (test-fabric-multi-gpu.yaml) runs on the [self-hosted, linux, x64, gpu, multi-gpu] runner with ISAACLAB_TEST_MULTI_GPU=1 set. Pre-flights with nvidia-smi and torch.cuda.device_count(), fails loudly if the runner has < 2 GPUs.

To verify locally on a multi-GPU machine:

ISAACLAB_TEST_MULTI_GPU=1 ./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

To verify the cuda:0 path is unchanged (multi-GPU tests auto-skip):

./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

greptile-apps · 2026-05-06T12:35:49Z

Greptile Summary

This PR removes the cuda:0-only restriction from FabricFrameView, allowing Fabric GPU acceleration on any CUDA device index (e.g. cuda:1), which unblocks distributed training. It also drops the deprecated wp.to_torch() calls in favour of the .torch accessor on ProxyArray, adds three cuda:1-parameterised multi-GPU tests, and ships a dedicated CI workflow with a GPU pre-flight guard.

Device allowlist removed: _fabric_supported_devices, the __init__ guard, and the _initialize_fabric assertion are all deleted; fabric_stage.SelectPrims is now called with self._device directly, letting USDRT handle any CUDA index.
Return type asymmetry: get_world_poses() wraps its Fabric result in ProxyArray (exposing .torch), but get_scales() still returns a raw wp.array. The new test_fabric_cuda1_scales_roundtrip test calls .torch on that raw array, which will raise AttributeError on the multi-GPU runner and void the intended coverage.
Multi-GPU CI workflow: test-fabric-multi-gpu.yaml includes the GPU pre-flight step (torch.cuda.device_count() >= 2) that fails loud before pytest is invoked, addressing the gap called out in a prior review round.

Confidence Score: 4/5

Safe to merge after the get_scales() return-type fix; the multi-GPU test for scales will throw AttributeError at runtime without it.

The core Fabric device-allowlist removal is straightforward and the cuda:0 path is unaffected. The blocking concern is that get_scales() returns a raw wp.array while the new test expects .torch on it — ProxyArray provides .torch but wp.array does not — so test_fabric_cuda1_scales_roundtrip will fail with AttributeError on the multi-GPU runner, defeating its coverage purpose.

fabric_frame_view.py (get_scales return type) and test_views_xform_prim_fabric.py (test_fabric_cuda1_scales_roundtrip) need the matching fix before the multi-GPU runner runs.

Important Files Changed

Filename	Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py	Removes the cuda:0-only device allowlist and the assertion in _initialize_fabric; drops the CPU fallback guard; adds follow-up TODOs. The get_scales() return type is a raw wp.array while get_world_poses() returns ProxyArray — asymmetry that affects the new scale tests.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py	Adds three cuda:1-gated multi-GPU tests and refines _skip_if_unavailable. The scales roundtrip test calls .torch on the return value of get_scales(), which returns a raw wp.array not a ProxyArray, so the accessor may be absent at runtime.
.github/workflows/test-fabric-multi-gpu.yaml	New dedicated CI workflow for multi-GPU Fabric tests; includes a GPU pre-flight step that fails loudly if fewer than 2 GPUs are present, closing the gap noted in a previous review.
source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst	Changelog fragment describing the multi-GPU Fabric fix; accurate and concise.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FabricFrameView
    participant USDRTSelectPrims
    participant WarpKernel
    participant UsdFrameView

    Caller->>FabricFrameView: "__init__(device=cuda:N)"
    Note over FabricFrameView: No device allowlist check (removed)

    Caller->>FabricFrameView: set_world_poses(positions)
    alt Fabric enabled
        FabricFrameView->>USDRTSelectPrims: "SelectPrims(device=cuda:N)"
        FabricFrameView->>WarpKernel: launch(compose_fabric_transformation)
        FabricFrameView->>FabricFrameView: _prepare_for_reuse()
    else Fabric disabled
        FabricFrameView->>UsdFrameView: set_world_poses(...)
    end

    Caller->>FabricFrameView: get_scales()
    alt Fabric enabled
        FabricFrameView->>WarpKernel: launch(decompose_fabric_transformation)
        FabricFrameView-->>Caller: wp.array (raw — no ProxyArray wrap)
    else Fabric disabled
        FabricFrameView->>UsdFrameView: get_scales()
        FabricFrameView-->>Caller: result
    end

    Caller->>FabricFrameView: get_world_poses()
    alt Fabric enabled
        FabricFrameView->>WarpKernel: launch(decompose_fabric_transformation)
        FabricFrameView-->>Caller: ProxyArray(positions), ProxyArray(orientations)
    end

_{Reviews (6): Last reviewed commit: "Split FabricFrameView multi-GPU tests in..." | Re-trigger Greptile}

- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims no longer needs cuda:0. - Refactor the Fabric write path into a single _compose_fabric_transform helper shared by set_world_poses, set_scales, and the initial USD->Fabric sync, collapsing the sync to one kernel launch with one PrepareForReuse. - Replace the topology-invariant assert with RuntimeError so it survives python -O. - Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both Fabric write paths, and run them in the existing test-multi-gpu CI job (one extra step, no new job).

The standard pytest invocation in CI runs the fabric test file without filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get scheduled on every runner including the single-GPU ones. Previously ``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever ``GITHUB_ACTIONS=true`` and the requested device was missing, on the theory that this would catch a misconfigured multi-GPU runner. In practice it just broke the standard CI: the dedicated ``test-fabric-multi-gpu`` workflow already pre-flights ``torch.cuda.device_count() >= 2`` before invoking pytest, so a genuinely misconfigured multi-GPU runner is already caught there. Always skip rather than fail when the requested ``cuda:N`` index isn't available. Drop the now-unused ``import os``.

Kit's CLI parser reads sys.argv directly at startup and segfaults on pytest flags that collide with its own short options. Running pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py crashes during collection because Kit sees ``-m multi_gpu`` and exits with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245) inside ``simulation_app._start_app``. Strip sys.argv to argv[0] before instantiating AppLauncher. The test file takes no CLI arguments of its own, mirroring the broader pattern used by ``test_tiled_camera_env.py`` which assigns ``sys.argv[1:] = args_cli.unittest_args`` after argparse.

wp.to_torch on a ProxyArray is deprecated in favor of the .torch accessor. Switch the three call sites that consume the ProxyArray returned by get_world_poses; leave get_scales call sites alone since that method still returns a raw wp.array (no .torch accessor).

- Add a GPU-count pre-flight step to the test-fabric-multi-gpu CI job so a runner regression to a single GPU fails the workflow instead of silently skipping every cuda:1 test. This is what the comment in _skip_if_unavailable already promised existed. - Note that the sys.argv strip in test_views_xform_prim_fabric.py must stay between the AppLauncher import and its instantiation; any CLI parser or reordering re-exposes Kit to pytest argv and segfaults at startup. - Document the _fabric_usd_sync_done side effect on _compose_fabric_transform so callers can see why subsequent getters stop pulling from USD.

The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.

isaaclab-review-bot

🤖 Isaac Lab Review Bot — Updated Review (`4f262aa`)

Commit: 4f262aa6710b19679b5ab94015f0dde9a4fed38b
Previous review: 556b74b (workflow separation in progress)

📋 What Changed Since Last Review

Commit 4f262aa finalizes the workflow separation with a clean split:

Change	Description
`test-fabric-multi-gpu.yaml`	✅ New dedicated workflow (60 lines) — self-contained CI for Fabric tests
`test-multi-gpu.yaml`	✅ Restored to upstream/develop (removed Fabric test job)
`fabric_frame_view.py`	Minor: relocated TODO comments
`changelog.d/*.rst`	Simplified wording
`test_views_xform_prim_fabric.py`	Style cleanup only

Key improvement: Complete workflow separation. FabricFrameView changes now trigger only test-fabric-multi-gpu.yaml (via path filter), while test-multi-gpu.yaml returns to its upstream state for distributed-training validation. The two workflows are completely decoupled.

✅ Full PR Summary

This PR removes the cuda:0-only restriction from FabricFrameView, enabling Fabric GPU acceleration on any CUDA device. This unblocks distributed training where each rank is pinned to a non-primary GPU (e.g., cuda:1).

🔍 Code Review

Architecture:

✅ Clean removal of _fabric_supported_devices allowlist and associated guards
✅ Minimal, surgical change — core write paths unchanged
✅ Well-scoped TODO comments reference follow-up PRs (#5673, #5674)
✅ Docstrings updated to reflect multi-GPU support

Error Handling:

✅ RuntimeError replaces assert for topology-change invariant (survives python -O)
✅ _skip_if_unavailable() gracefully skips tests on single-GPU runners

Test Coverage:

✅ Three cuda:1-parameterized tests: roundtrip poses, no-writeback, scales roundtrip
✅ New multi_gpu pytest marker registered in pyproject.toml
✅ Kit argv stripping prevents segfault from pytest flags
✅ Uses .torch accessor instead of deprecated wp.to_torch()

CI Design:

✅ test-fabric-multi-gpu.yaml — dedicated 60-line workflow with GPU pre-flight
✅ Path-filtered triggers: only runs on changes to FabricFrameView or its tests
✅ Runner pre-flight (torch.cuda.device_count() >= 2) fails loudly if misconfigured
✅ test-multi-gpu.yaml byte-identical to upstream/develop — zero risk of regression

🚦 CI Status

Check	Status
Build Wheel	✅ Pass
changelog fragments	✅ Pass
Broken Links	✅ Pass
labeler	✅ Pass
pre-commit	⏳ Pending
FabricFrameView multi-GPU tests	⏳ Pending (self-hosted runner)
license-check	⏳ Pending
Installation Tests	⏳ Pending
Docs	⏳ Pending

📝 Verdict

LGTM — ready for human review. The workflow separation is now complete and clean. Core functionality is well-tested with appropriate CI coverage. The change is low-risk: cuda:0 path is unchanged, and the new cuda:1 tests provide explicit coverage. Once the pending CI checks pass (particularly the multi-GPU tests on the self-hosted runner), this PR is ready to merge.

Reviewed at: 2026-05-20T16:25 UTC

isaaclab-review-bot

🤖 Isaac Lab Review Bot — Updated Review (`a039313`)

Commit: a03931317a2be2486357370d03326e9c398f86ef
Previous review: c42ec86 → Now fully finalized

📋 What Changed Since Last Review

Two commits since c42ec86:

Commit	Change
`60d99b86`	Split FabricFrameView multi-GPU tests into dedicated workflow
`a0393131`	Disable multi-GPU workflow (no self-hosted runner available)

Key changes:

Workflow separation complete — test-fabric-multi-gpu.yaml is now a standalone workflow file, completely decoupled from test-multi-gpu.yaml
Workflow disabled — Multi-GPU workflow set to workflow_dispatch only with clear comment explaining why (no multi-GPU runner provisioned yet)
Pytest marker removed — multi_gpu marker dropped from pyproject.toml; tests now use ISAACLAB_TEST_MULTI_GPU=1 env var gate via @pytest.mark.skipif
Changelog simplified — Removed extraneous "Changed" section, now focuses purely on the core fix
Code cleanup — Removed redundant comments in fabric_frame_view.py, uses torch.as_tensor() consistently in tests

✅ Full PR Summary

This PR removes the cuda:0-only restriction from FabricFrameView, enabling Fabric GPU acceleration on any CUDA device (cuda:0, cuda:1, etc.). This unblocks distributed training where each rank is pinned to a non-primary GPU.

🔍 Code Review

Architecture:

✅ Clean removal of _fabric_supported_devices allowlist
✅ Minimal, surgical change — core Warp kernel paths unchanged
✅ TODO comments reference follow-up PRs (#5673, #5674)
✅ Docstrings updated for multi-GPU support

Test Coverage:

✅ Three cuda:1-parameterized tests gated by ISAACLAB_TEST_MULTI_GPU=1
✅ _skip_if_unavailable() gracefully skips on single-GPU runners
✅ Uses torch.as_tensor() for Warp→Torch conversion (consistent API)

CI Design:

✅ test-fabric-multi-gpu.yaml — dedicated workflow (disabled until runner available)
✅ Path-filtered triggers ready for when runner is provisioned
✅ test-multi-gpu.yaml returned to upstream state (zero diff risk)
✅ Env var gating avoids pytest marker complexity

🚦 CI Status

Check	Status
pre-commit	✅ Pass
Check changelog fragments	✅ Pass
Build Wheel	✅ Pass
Check for Broken Links	✅ Pass
Detect Changes	✅ Pass
labeler	✅ Pass
Load Config	✅ Pass
Installation Tests	⏳ Pending
Build Latest Docs	⏳ Pending
license-check	⏳ Pending

📝 Verdict

LGTM — ready for human review. The workflow separation is clean and complete. The PR is low-risk:

cuda:0 behavior unchanged
New cuda:1 tests provide explicit coverage (will run when multi-GPU runner is provisioned)
Multi-GPU workflow correctly disabled to avoid queue-indefinitely issue

Once CI passes, this is ready to merge.

Reviewed at: 2026-05-20T19:34 UTC

Move the test-fabric-multi-gpu job out of test-multi-gpu.yaml and into a dedicated test-fabric-multi-gpu.yaml. The two workflows share the same runner label, install step, and GPU pre-flight, but trigger on disjoint path sets so changes to FabricFrameView no longer gate the distributed-training validation and vice versa. test-multi-gpu.yaml is now byte-identical to upstream/develop.

No self-hosted runner with the 'multi-gpu' label is registered. All runs queue indefinitely. Kept as workflow_dispatch only so it can be manually triggered once a runner is provisioned. See also .github/workflows/test-multi-gpu.yaml (same issue).

pv-nvidia requested review from hhansen-bdai, kellyguo11 and pascal-roth as code owners May 6, 2026 12:30

github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 6, 2026

pv-nvidia marked this pull request as draft May 6, 2026 12:32

pv-nvidia self-assigned this May 6, 2026

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from a6cd73e to 2c619fe Compare May 7, 2026 08:44

This comment was marked as outdated.

Sign in to view

pv-nvidia marked this pull request as ready for review May 11, 2026 11:29

This comment was marked as off-topic.

Sign in to view

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread .github/workflows/test-multi-gpu.yaml Outdated

pv-nvidia changed the title ~~Feat/frame view enable mgpu~~ Enable mgpu in FrameView May 12, 2026

pv-nvidia changed the title ~~Enable mgpu in FrameView~~ pref: Enable mgpu in FrameView May 12, 2026

pv-nvidia added the enhancement New feature or request label May 12, 2026

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from 1c2e02d to 8de9a39 Compare May 17, 2026 22:23

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 8de9a39 to e206ba9 Compare May 20, 2026 14:11

pv-nvidia added 5 commits May 20, 2026 15:35

Update FabricFrameView docstrings for multi-GPU support

711264b

The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from e206ba9 to 96f159e Compare May 20, 2026 15:38

This comment was marked as outdated.

Sign in to view

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 2 times, most recently from ffb3e91 to f4dd500 Compare May 20, 2026 15:53

This comment was marked as outdated.

Sign in to view

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from f4dd500 to cf57d31 Compare May 20, 2026 16:00

This comment was marked as outdated.

Sign in to view

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from cf57d31 to a7a6956 Compare May 20, 2026 16:07

isaaclab-review-bot Bot reviewed May 20, 2026

View reviewed changes

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from a7a6956 to 556b74b Compare May 20, 2026 16:22

isaac-sim deleted a comment from isaaclab-review-bot Bot May 20, 2026

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 556b74b to 4f262aa Compare May 20, 2026 16:24

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Outdated

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 4f262aa to cc1d789 Compare May 20, 2026 16:34

This comment was marked as outdated.

Sign in to view

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from cc1d789 to c42ec86 Compare May 20, 2026 16:37

isaaclab-review-bot Bot reviewed May 20, 2026

View reviewed changes

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from c42ec86 to 60d99b8 Compare May 20, 2026 16:46

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py

pbarejko approved these changes May 20, 2026

View reviewed changes

kellyguo11 changed the title ~~pref: Enable mgpu in FrameView~~ Enable mgpu in FrameView May 20, 2026

kellyguo11 merged commit aa19b08 into isaac-sim:develop May 20, 2026
64 of 65 checks passed

isaaclab-review-bot Bot mentioned this pull request May 21, 2026

Implements passive fixed tendons with mjwarp #5522

Open

4 tasks

pv-nvidia mentioned this pull request May 21, 2026

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose #5728

Draft

hujc7 mentioned this pull request May 21, 2026

[CI] Enable multi-GPU pytest workflow #5738

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable mgpu in FrameView#5514

Enable mgpu in FrameView#5514
kellyguo11 merged 8 commits into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu

pv-nvidia commented May 6, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pv-nvidia commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist

Test plan

Uh oh!

greptile-apps Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot — Updated Review (4f262aa)

📋 What Changed Since Last Review

✅ Full PR Summary

🔍 Code Review

🚦 CI Status

📝 Verdict

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot — Updated Review (a039313)

📋 What Changed Since Last Review

✅ Full PR Summary

🔍 Code Review

🚦 CI Status

📝 Verdict

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pv-nvidia commented May 6, 2026 •

edited

Loading

greptile-apps Bot commented May 6, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading

🤖 Isaac Lab Review Bot — Updated Review (`4f262aa`)

isaaclab-review-bot Bot left a comment •

edited

Loading

🤖 Isaac Lab Review Bot — Updated Review (`a039313`)