Skip to content

[Test] Stabilize RTX MDL shader-warmup flake in rendering correctness tests#5687

Draft
hujc7 wants to merge 6 commits into
isaac-sim:developfrom
hujc7:jichuanh/stabilize-rendering-mdl-warmup
Draft

[Test] Stabilize RTX MDL shader-warmup flake in rendering correctness tests#5687
hujc7 wants to merge 6 commits into
isaac-sim:developfrom
hujc7:jichuanh/stabilize-rendering-mdl-warmup

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 19, 2026

Description

The rendering-correctness tests intermittently fail on cold-cache CI runners with "Camera output is all zeros or all inf" for simple_shading_*_mdl and simple_shading_constant_diffuse variants — the GPU returns a still-zero framebuffer because the RTX MDL material has not finished compiling by the time the test reads the camera tensor.

Add a poll-until-non-zero warmup that exits as soon as every camera-output tensor has a non-zero max:

  1. test_shadow_hand_vision_presets.py::test_camera_renders_not_empty — polls via env.step() with a 60-step cap. (This test has no flaky retry, so a single-step fixture is the actual failure surface.)
  2. rendering_test_utils.warmup_render_until_nonzero — iterates every sensor in env.scene.sensors and polls via sim.render() + scene.update() with a 30-pass cap. Called from the four rendering helpers in rendering_test_utils.py (rendering_test_shadow_hand / _cartpole / _dexsuite_kuka) and from test_rendering_registered_tasks.py. Mirrors the pattern already in DirectRLEnvCfg.num_rerenders_on_reset / wait_for_textures; physics state is not advanced, so the existing goldens at the post-init "first non-zero frame" stay valid.

An earlier attempt at a fixed-pass warmup over-rendered in the goldenfile helpers and broke dexsuite_kuka goldens at 91% pixel diff — early-exit on first-non-zero avoids that by landing at the same frame the goldens were captured at.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have run the `pre-commit` checks with `./isaaclab.sh --format`
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file
  • I have added my name to the `CONTRIBUTORS.md` or my name already exists there

… tests

`test_shadow_hand_vision_presets.py::test_camera_renders_not_empty` and the
three helper-driven rendering tests in `rendering_test_utils.py`
(`rendering_test_shadow_hand`/`_cartpole`/`_dexsuite_kuka`) intermittently
fail with "Camera output 'simple_shading_*_mdl' is all zeros or all inf
after stepping" on cold-cache CI runners.

Root cause is RTX MDL shader compile timing: a single `env.step()` returns
a still-zero framebuffer for shader variants that haven't finished
compiling. The variant that flakes (constant_diffuse / diffuse_mdl /
full_mdl) shifts between runs depending on which MDL pipeline finishes
first.

Step the env 10 frames before reading the camera tensor — empirically
enough across the MDL variants that flake. Adds ~1-2 s per parametrize
variant; cheap relative to the cost of a retry.

Affected jobs:
- `isaaclab_tasks [2/3]` (test_camera_renders_not_empty)
- `rendering-correctness-kitless` (test_rendering_*_kitless[*-simple_shading_*])
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 19, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR correctly addresses an RTX MDL shader-warmup race condition in the rendering correctness tests. The root cause analysis is solid: asynchronous shader compilation can leave the framebuffer zeroed when tests read camera tensors too early after environment creation.

Key Observations

✅ Strengths

  1. Well-designed helper function: The _warmup_render() abstraction is clean and consolidates the warmup logic in a single place, making future tuning straightforward.

  2. Good documentation: Both the docstring and inline comments explain why 10 steps are needed, including the specific MDL shader variants affected. This context will help future maintainers.

  3. Appropriate placement: Calling _warmup_render() after maybe_save_stage() but before validate_camera_outputs() is the right sequencing.

  4. Minimal scope: The change is surgical — it only touches the test infrastructure that was flaking, with no production code impact.

💭 Observations (not blockers)

  1. Magic number 10: The warmup step count is empirically determined. Consider adding a brief comment in the fixture (render_correctness_env) referencing _warmup_render so readers know both sites use the same rationale.

  2. No explicit warmup validation: The helper assumes stepping 10 times is sufficient but does not verify shaders actually compiled. This is likely fine for a test stabilization fix, but if flakes persist, a future enhancement could add a pixel-value sanity check after warmup.

  3. Changelog fragment: Well-written and follows the project format.

Verdict

LGTM 👍 — This is a pragmatic fix for a well-understood CI flake. The tradeoff of ~1-2s additional wall time per variant is absolutely worth the CI stability gain.


Update (08cd6c9): Excellent refinement! The switch from env.step() to sim.render() + scene.update() is a meaningful improvement:

  • No physics advancement: The warmup no longer steps physics state, so golden reference images captured at the post-init state remain valid without recapture
  • Better semantic fit: The new approach mirrors DirectRLEnvCfg.num_rerenders_on_reset / wait_for_textures patterns already established in the codebase
  • Clearer naming: num_passes better describes what's happening vs num_steps
  • Improved documentation: The expanded docstring now explicitly calls out the physics-preserving design and references the relevant config attributes

The changelog update accurately reflects the new approach, differentiating the render-only warmup (for golden-image tests) from the step-based warmup (for non-zero pixel checks where physics drift is acceptable).

LGTM 👍 — Ready to merge.


Update (c359111): Significant scope reduction — the PR now takes a simpler approach:

  • Removed _warmup_render() helper entirely
  • Removed warmup calls from rendering_test_shadow_hand, rendering_test_cartpole, and rendering_test_dexsuite_kuka
  • Retained only the step-based warmup for test_camera_renders_not_empty (which has no golden comparison)
  • 📝 Updated changelog to reflect the reduced scope, noting the golden-image tests already use @flaky(max_runs=3)

Assessment:

This is a reasonable decision. The tradeoffs:

Approach Pros Cons
New (flaky retries) Zero additional wall time; existing mechanism; simpler code May still fail 3x in rare cold-cache scenarios
Previous (warmup) Deterministic fix ~1-2s added per test; more invasive

The author is betting that @flaky(max_runs=3) provides sufficient coverage for the golden-image tests, since shader compilation is a one-time cold-cache event. If any retry succeeds (after shaders compile), the test passes. This is pragmatic.

LGTM 👍 — The reduced scope is justified. The only test that truly needed explicit warmup was test_camera_renders_not_empty (which lacked any retry mechanism), and that fix is retained.


Update (39e6bb7): Nice iteration! The warmup now uses adaptive polling instead of a fixed step count:

  • Polling with early exit: Instead of always stepping 10 times, the fixture now polls up to _MAX_WARMUP_STEPS = 60 but exits as soon as all camera outputs have max() > 0.2 (excluding inf values)
  • Handles all output types: The loop checks env._tiled_camera.data.output.values() so all data-type tensors (RGB, depth, etc.) are validated
  • Better worst-case coverage: 60 steps provides headroom for pathological cold-cache scenarios while the early-exit keeps typical runs fast
  • Improved changelog: Updated to describe the polling approach with the 60-step cap

Minor observations (not blockers):

  1. Threshold 0.2: The pixel value threshold is empirically chosen. It's reasonable but could be documented (e.g., why not 0.0 or 0.1?). For MDL shaders, fully black pixels return 0.0, so 0.2 is safely above that.

  2. Direct camera access: env._tiled_camera is a private attribute. If the API changes, this will break. For a test fixture this is acceptable since tests can be updated alongside implementation changes.

LGTM 👍 — The adaptive approach is strictly better than the fixed count: faster in the common case, more robust in edge cases. Ready to merge.


Update (3de115a): Great consolidation! The warmup logic is now properly refactored:

Changes:

  1. New helper _warmup_render_until_nonzero() in rendering_test_utils.py:

    • Extracted from the fixture into a reusable function
    • Uses sim.render() + scene.update(dt) instead of env.step() — physics state is NOT advanced
    • 30-pass cap (vs 60 in the fixture version)
    • Same 0.2 threshold logic with inf filtering
    • Flexible camera detection: checks _tiled_camera first, falls back to scene.sensors.get('base_camera')
    • Excellent docstring explaining the MDL async compilation race and why render-only passes preserve golden validity
  2. Applied to all three golden-image test helpers:

    • rendering_test_shadow_hand()
    • rendering_test_cartpole()
    • rendering_test_dexsuite_kuka()
  3. Expanded changelog:

    • Now documents both approaches: step-based for test_camera_renders_not_empty (60-step cap) and render-only for golden tests (30-pass cap)
    • Clearly explains why physics-preserving warmup is needed for golden comparisons

Assessment:

This is the right design:

Test Type Warmup Method Physics Impact Why
test_camera_renders_not_empty env.step() (60 cap) Yes Only checks non-zero; goldens don't matter
Golden-image tests sim.render() + scene.update() (30 cap) No Must match existing reference frames

The dual approach respects the invariants each test type requires. Render-only warmup for golden tests means no golden recapture is needed.

LGTM 👍 — Clean, well-documented, and production-ready. Ready to merge.


Update (eb23b0c): Excellent final polish! Key improvements:

  1. Public API exposure: _warmup_render_until_nonzerowarmup_render_until_nonzero — the underscore removal signals this is now a supported test utility

  2. Multi-sensor support: The helper now iterates over ALL sensors in env.scene.sensors instead of hardcoded _tiled_camera / base_camera lookups. This is more robust and future-proof for envs with multiple cameras

  3. Wrapped env handling: Added getattr(env, "unwrapped", env) to correctly access scene/sim on wrapped environments

  4. Extended coverage: Now also applied to test_rendering_registered_tasks.py, which was previously missing warmup and could flake on cold-cache runs

  5. Changelog accuracy: Updated to reflect the broader scope (four rendering helpers + registered tasks test)

The nested loop structure properly handles early-exit semantics and the defensive getattr chains ensure graceful no-ops for envs without standard sensor structures.

LGTM 👍 — This is the complete, polished solution. Ready to merge.

@hujc7 hujc7 closed this May 19, 2026
@hujc7
Copy link
Copy Markdown
Collaborator Author

hujc7 commented May 19, 2026

does not seem working. need to change golden file.

@hujc7
Copy link
Copy Markdown
Collaborator Author

hujc7 commented May 19, 2026

Reopening. Going back at this with the warmup + golden regeneration approach, validated locally before pushing.

@hujc7 hujc7 reopened this May 19, 2026
hujc7 added 3 commits May 19, 2026 09:04
The previous warmup used env.step() which advances physics and breaks
golden-image comparison in the rendering correctness tests. Switch to
sim.render() + scene.update() so shader compile finishes without moving
the scene state, mirroring num_rerenders_on_reset in the env classes.
The 10-pass sim.render warmup in the three goldenfile helpers
(rendering_test_shadow_hand / _cartpole / _dexsuite_kuka) caused
test_rendering_dexsuite_kuka and *_kitless variants to fail with large
pixel diffs vs the existing goldens — the warmup exposes a different
post-loading scene state for the dexsuite_kuka asset set.

Keep only the test_camera_renders_not_empty change (no golden compare,
env.step warmup is safe there). The goldenfile helpers already use
flaky(max_runs=3) for occasional MDL hiccups.
10 frames was not enough — CI still flaked on
test_camera_renders_not_empty[physx-isaacsim_rtx-simple_shading_diffuse_mdl].
Poll until every camera output tensor has a non-zero max with a 60-step
cap, exiting early once all outputs are ready.
@hujc7 hujc7 marked this pull request as ready for review May 19, 2026 11:39
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR adds a poll-until-non-zero MDL shader warmup to the rendering correctness tests to eliminate intermittent "Camera output is all zeros or all inf" failures on cold-cache CI runners, where the GPU returns a still-zero framebuffer before RTX MDL materials finish compiling.

  • warmup_render_until_nonzero (new helper in rendering_test_utils.py): drives up to 30 sim.render() + scene.update() passes — no physics advance — until every sensor tensor's finite max exceeds 0.2, then returns early so golden-compare state stays at the same frame the reference images were captured at.
  • render_correctness_env fixture (test_shadow_hand_vision_presets.py): replaces the single env.step() with a loop of up to 60 env.step() calls with the same non-zero check, exiting early once ready.
  • test_rendering_registered_tasks.py: calls warmup_render_until_nonzero before validate_camera_outputs.

Confidence Score: 5/5

Safe to merge; the warmup logic is correct, early-exits cleanly, and does not advance physics state, so existing golden images remain valid.

The change is narrowly scoped to test infrastructure: it adds a polling loop around render calls with a bounded cap, uses the same threshold already in the test assertions, and all three rendering helpers are updated consistently. No production code is touched.

No files require special attention.

Important Files Changed

Filename Overview
source/isaaclab_tasks/test/rendering_test_utils.py Adds warmup_render_until_nonzero helper that polls via sim.render() + scene.update() up to 30 passes until every camera tensor's finite max exceeds 0.2; integrated into all three rendering test helpers. Logic is correct; no diagnostic on cap exhaustion.
source/isaaclab_tasks/test/test_shadow_hand_vision_presets.py Replaces single env.step() in the render_correctness_env fixture with a poll loop of up to 60 steps using env.step() + tensor max check; also contains a leftover # import pdb; pdb.set_trace() comment.
source/isaaclab_tasks/test/test_rendering_registered_tasks.py Imports and calls warmup_render_until_nonzero before camera output validation; straightforward integration.
source/isaaclab_tasks/changelog.d/jichuanh-stabilize-rendering-mdl-warmup.rst New changelog fragment accurately describing the MDL shader warmup fix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Env constructed] --> B{warmup_render_until_nonzero}
    B --> C[Check all sensor tensors\nfinite.max > 0.2?]
    C -- All ready --> D[Return immediately\nstate at first-non-zero frame]
    C -- Not ready --> E[sim.render + scene.update\nno physics advance]
    E --> F{Pass count < 30?}
    F -- Yes --> C
    F -- No cap hit --> G[Return silently\nno warning emitted]
    D --> H[validate_camera_outputs\nor golden compare]
    G --> H
Loading

Reviews (3): Last reviewed commit: "Apply warmup to registered-tasks renderi..." | Re-trigger Greptile

Comment on lines +404 to +415
_MAX_WARMUP_STEPS = 60
for _ in range(_MAX_WARMUP_STEPS):
env.step(actions)
outputs_ready = True
for output in env._tiled_camera.data.output.values():
tensor = output.torch
finite = torch.where(torch.isinf(tensor), torch.zeros_like(tensor), tensor)
if finite.max() <= 0.2:
outputs_ready = False
break
if outputs_ready:
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No diagnostic when warmup cap is reached

If all 60 steps complete without any output reaching the threshold, the fixture silently yields and the test fails with the standard "Camera output is all zeros or all inf" assertion error — with no indication that the warmup cap was hit. Adding a warnings.warn or print on cap exhaustion would make it immediately clear in CI logs that the problem is slow shader compilation rather than a deeper rendering bug.

@hujc7 hujc7 marked this pull request as draft May 19, 2026 11:42
The three rendering_test_utils helpers (rendering_test_shadow_hand /
_cartpole / _dexsuite_kuka) now drive sim.render() + scene.update()
passes until every camera-output tensor has a non-zero max, with a
30-pass cap and early-exit. Physics is not advanced (no env.step) so
the existing goldens at the post-init "first non-zero frame" stay
valid. A previous fixed 10-pass over-rendered and broke dexsuite_kuka
goldens at 91% pixel diff.
@hujc7 hujc7 marked this pull request as ready for review May 19, 2026 13:16
@hujc7 hujc7 marked this pull request as draft May 19, 2026 13:38
Extend warmup_render_until_nonzero to iterate every sensor in
env.scene.sensors (instead of hardcoded _tiled_camera / base_camera) and
call it from test_rendering_registered_tasks.py. Also rename the helper
from _warmup_render_until_nonzero (module-private) to
warmup_render_until_nonzero now that it has cross-module callers.
@hujc7 hujc7 marked this pull request as ready for review May 19, 2026 15:57
@hujc7 hujc7 marked this pull request as draft May 19, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant