Skip to content

[Diag][DO NOT MERGE] Negative arm — numpy 2.3.5 atfork crash (no 5656)#5655

Draft
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/validate-numpy-pin-5642
Draft

[Diag][DO NOT MERGE] Negative arm — numpy 2.3.5 atfork crash (no 5656)#5655
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/validate-numpy-pin-5642

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 16, 2026

Purpose

Negative arm of the single-variable A/B that validates #5656.

Companion positive arm: #5684.

Setup

  • Branch base: develop (a9b62101ca6), no other workarounds.
  • One commit on top: source/conftest.py that prints the resolved numpy version and bundled OpenBLAS .so filename at pytest session start.
  • No isaaclab.sh --install overrides, no force-reinstall, no action.yml edits.

git diff jichuanh/validate-numpy-pin-5642..jichuanh/validate-numpy-2-3-4-worstcase between this branch and the positive arm equals exactly 5656's 10-site numpy!=2.3.5 exclusion — nothing else.

Expected result

  • Install resolves to numpy 2.3.5 (cmeel-boost cap funnels to 2.3.x; pip picks the highest available).
  • [dep-manifest] log line shows numpy 2.3.5 + libscipy_openblas64_-fdde5778.so.
  • Canary jobs SIGSEGV in libomni.platforminfo fork():
    • isaaclab_physx
    • isaaclab_newton
    • isaaclab_rl

If any canary passes, the negative test is invalid (something else on develop is already deferring/masking the bug — investigate before trusting the positive arm).

Lifecycle

Diagnostic-only. Do not merge. Closes once evidence is captured.

Related

@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 16, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Code Review Summary

This PR adds diagnostic instrumentation to validate that PR #5642's numpy!=2.3.5 pin is working correctly in CI. The approach is sound for gathering evidence from real CI logs.

✅ What Looks Good

  1. Consistent dependency pinning: The numpy>=2,!=2.3.5 constraint is applied uniformly across all four packages (isaaclab, isaaclab_tasks, isaaclab_rl, isaaclab_visualizers), ensuring no package inadvertently pulls in the problematic version.

  2. Well-documented: The commit messages, changelog fragments, and inline comments thoroughly explain the root cause (OpenBLAS pthread_atfork handler crash) with references to upstream issues.

  3. Appropriate diagnostic scope: The CI instrumentation prints version info without modifying test behavior—suitable for validation purposes.

📝 Suggestions

  1. Confirm removal before merge: The PR description notes this is diagnostic-only and should not be merged. Ensure the CI diagnostic block in .github/actions/run-tests/action.yml (lines 231-241) is removed before any final merge of the underlying fix.

  2. Consider adding a CI check assertion (optional): For stronger validation, the diagnostic could be modified to fail explicitly if numpy 2.3.5 or the -fdde5778 OpenBLAS hash is detected:

    if [[ "$NUMPY_VERSION" == "2.3.5" ]]; then
      echo "FATAL: numpy 2.3.5 detected despite exclusion pin"
      exit 1
    fi

    However, since this is temporary diagnostic code, the current passive logging approach is reasonable.

  3. isaaclab_rl constraint change: Note that isaaclab_rl/setup.py previously had an unconstrained numpy (no version specifier), while others had numpy>=2. The new constraint numpy>=2,!=2.3.5 is appropriate, but this effectively tightens the lower bound for that package—this should be intentional and is documented in the changelog.

🔍 Verification Steps

After CI completes, search the GitHub Actions logs for:

=== Dep manifest (numpy/scipy/openblas) ===

Confirm that:

  • numpy version is NOT 2.3.5
  • OpenBLAS filename does NOT contain -fdde5778

Overall, this is a clean diagnostic PR that serves its purpose of validating the numpy pin fix.


Update (c2d7833): Reviewed incremental changes. Two files modified:

1. .github/actions/run-tests/action.yml

  • Change: Reorders pytest execution to run before the dependency manifest diagnostic.
  • Why: The diagnostic imports numpy, which registers numpy's vendored OpenBLAS pthread_atfork handler. Running the diagnostic first would contaminate the pytest process and trigger Kit's libomni.platforminfo fork() SIGSEGV.
  • Implementation: Uses set +e / set -e with PYTEST_EXIT to capture pytest's exit code, allowing the diagnostic to still emit even on test failure, then exits with the original pytest status.
  • ✅ This is a sound fix—ensures diagnostics don't interfere with test execution.

2. source/isaaclab/isaaclab/cli/commands/install.py

  • Change: Adds forced numpy upgrade to ≥2.4.1 after pink-ik dependencies are installed.
  • Why: cmeel-boost caps numpy<2.4, but pip's resolution might pull numpy 2.3.5 (with the broken OpenBLAS). The explicit upgrade bypasses this cap.
  • Safety note: Comment explains numpy's stable C ABI keeps cmeel's compiled extensions working despite the version bump.
  • Missing return: A return statement was added to the error path, which is correct to avoid executing the numpy upgrade after a pink-ik install failure.
  • ✅ Appropriate defensive fix with clear documentation.

Overall: Changes are well-reasoned and address the root cause properly. The execution ordering fix in CI is particularly important for avoiding intermittent SIGSEGVs.


Update (f8e1633): Reviewed incremental changes.

.github/actions/run-tests/action.yml

  • Change: Reverts the pytest-before-diagnostic ordering from the previous commit. Now runs diagnostic first, then pytest.
  • Removed: The set +e / PYTEST_EXIT / set -e pattern and the early "Starting pytest" echo.
  • New flow: Diagnostic → pytest (simpler structure).

⚠️ Concern: This change re-introduces the ordering that was specifically avoided in c2d7833. The previous commit's rationale was:

"importing numpy in this shell beforehand registers numpy's vendored OpenBLAS pthread_atfork handler in the wrong process, which trips Kit's libomni.platforminfo fork() and SIGSEGVs"

Running the diagnostic before pytest may still trigger this issue unless:

  1. The diagnostic runs in a subprocess (via ./isaaclab.sh -p -c) that doesn't affect the parent shell's state
  2. Or the numpy pin has already eliminated the problematic OpenBLAS

Likely intentional: Since the diagnostic uses ./isaaclab.sh -p -c (a subprocess), the parent shell where pytest runs should not inherit the handler registration. The c2d7833 ordering change may have been overly cautious. This simplification is likely safe.

✅ Code is cleaner now, and the diagnostic isolation via subprocess should prevent contamination.

hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 16, 2026
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently
overridden during isaaclab.sh --install: each pip install -e <submodule>
runs an independent resolve, and the final pin-pink force-reinstall in
_ensure_pink_ik_dependencies_installed lands on numpy 2.3.5 because pip
sees only pin-pink's own deps (numpy>=1.19) plus cmeel-boost's numpy<2.4
cap.  numpy 2.3.5 ships a vendored OpenBLAS
(libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes
Kit's libomni.platforminfo fork() during SimulationApp startup.

Two changes, both restating an explicit "pip install --upgrade
numpy>=2.4.1" as the *last* pip invocation in each install path:

1. _ensure_numpy_above_openblas_atfork_bug() in install.py — runs
   unconditionally at the end of --install (not gated by the pink-ik
   probe outcome), so upgrades on an already-functioning env also
   pull numpy forward.
2. Dockerfile.curobo — apply the same upgrade after its post-install
   steps (nvidia-curobo + isaaclab_teleop editable install), which
   otherwise drag numpy back to 2.3.5 via dex-retargeting -> pin ->
   cmeel-boost.

pip prints a resolver warning about cmeel-boost's cap then installs
numpy 2.4.5 anyway.  numpy 2.4.1+ ships the upstream OpenBLAS atfork
fix, so the entire 2.3.x risk class is bypassed.  numpy's stable C
ABI keeps cmeel's compiled extensions (libpinocchio, libcoal, ...)
working at runtime.

Validated:
- env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink +
  daqp + qpsolvers all import; toy IK solve OK).
- IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5
  (test_pink_ik_components 21/21, test_local_frame_task 24/24,
   test_null_space_posture_task 9/9).
- PR isaac-sim#5655 (validation): every base-image test job reports numpy 2.4.5
  + openblas -32a4b2a6 (clean, not the broken -fdde5778).  Worst-case
  import order (numpy imported before pytest spawns Kit) also passes —
  confirming the upstream atfork fix is real, not just dodge-by-order.

Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 16, 2026
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently
overridden during isaaclab.sh --install: each pip install -e <submodule>
runs an independent resolve, and the final pin-pink force-reinstall in
_ensure_pink_ik_dependencies_installed lands on numpy 2.3.5 because pip
sees only pin-pink's own deps (numpy>=1.19) plus cmeel-boost's numpy<2.4
cap.  numpy 2.3.5 ships a vendored OpenBLAS
(libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes
Kit's libomni.platforminfo fork() during SimulationApp startup.

Two changes, both restating an explicit "pip install --upgrade
numpy>=2.4.1" as the *last* pip invocation in each install path:

1. _ensure_numpy_above_openblas_atfork_bug() in install.py — runs
   unconditionally at the end of --install (not gated by the pink-ik
   probe outcome), so upgrades on an already-functioning env also
   pull numpy forward.
2. Dockerfile.curobo — apply the same upgrade after its post-install
   steps (nvidia-curobo + isaaclab_teleop editable install), which
   otherwise drag numpy back to 2.3.5 via dex-retargeting -> pin ->
   cmeel-boost.

pip prints a resolver warning about cmeel-boost's cap then installs
numpy 2.4.5 anyway.  numpy 2.4.1+ ships the upstream OpenBLAS atfork
fix, so the entire 2.3.x risk class is bypassed.  numpy's stable C
ABI keeps cmeel's compiled extensions (libpinocchio, libcoal, ...)
working at runtime.

Validated:
- env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink +
  daqp + qpsolvers all import; toy IK solve OK).
- IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5
  (test_pink_ik_components 21/21, test_local_frame_task 24/24,
   test_null_space_posture_task 9/9).
- PR isaac-sim#5655 (validation): every base-image test job reports numpy 2.4.5
  + openblas -32a4b2a6 (clean, not the broken -fdde5778).  Worst-case
  import order (numpy imported before pytest spawns Kit) also passes —
  confirming the upstream atfork fix is real, not just dodge-by-order.

Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 18, 2026
numpy 2.3.5 ships a vendored OpenBLAS
(libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes
Kit's libomni.platforminfo fork() during SimulationApp startup. The
release is excluded at every site that pulls numpy directly or
transitively, so no pip resolve during isaaclab.sh --install or any
Docker image build can land on it -- even transiently:

  source/isaaclab/setup.py
  source/isaaclab_tasks/setup.py
  source/isaaclab_rl/setup.py
  source/isaaclab_visualizers/setup.py
  source/isaaclab_teleop/setup.py        (transitive via dex-retargeting)
  source/isaaclab_mimic/setup.py         (transitive via h5py)
  isaaclab.cli.commands.install._ensure_pink_ik_dependencies_installed
  isaaclab.cli.commands.install._maybe_preinstall_arm_nlopt
  docker/Dockerfile.base                 (ARM nlopt prep)
  docker/Dockerfile.curobo               (ARM nlopt prep + nvidia-curobo install)

Each touchpoint adds only the ``!=2.3.5`` exclusion; no other version
constraints are introduced.

Validated:
- env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp
  + qpsolvers all import; toy IK solve OK).
- IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5.
- PR isaac-sim#5655 worst-case run (diagnostic imports numpy before pytest spawns
  Kit, the order that originally crashed): 36 pass / 0 fail. The
  isaaclab_physx surface gripper SIGSEGV is gone.

Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
@hujc7
Copy link
Copy Markdown
Collaborator Author

hujc7 commented May 18, 2026

Validation/diagnostic-only PR — never intended to merge. The data captured here (numpy 2.3.5 + libscipy_openblas64_-fdde5778.so in every test container after isaaclab.sh --install on the #5642 branch, and 36/0 worst-case re-validation after #5656's fix landed locally) confirmed the fix that's now in #5656. Closing.

Adds source/conftest.py so pytest dumps the resolved numpy version and the
bundled OpenBLAS .so filename at session start. Used by the negative/positive
arm validation PRs to capture which numpy bundle each CI test container ends
up with after isaaclab.sh --install completes.

The conftest.py imports numpy at module load. This is what pytest does
naturally via isaaclab module imports anyway -- making the import explicit
here only adds visibility, not crash conditions.
@hujc7 hujc7 force-pushed the jichuanh/validate-numpy-pin-5642 branch from f8e1633 to 66a2147 Compare May 19, 2026 23:57
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 19, 2026
numpy 2.3.5 ships a vendored OpenBLAS
(libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes
Kit's libomni.platforminfo fork() during SimulationApp startup. The
release is excluded at every site that pulls numpy directly or
transitively, so no pip resolve during isaaclab.sh --install or any
Docker image build can land on it -- even transiently:

  source/isaaclab/setup.py
  source/isaaclab_tasks/setup.py
  source/isaaclab_rl/setup.py
  source/isaaclab_visualizers/setup.py
  source/isaaclab_teleop/setup.py        (transitive via dex-retargeting)
  source/isaaclab_mimic/setup.py         (transitive via h5py)
  isaaclab.cli.commands.install._ensure_pink_ik_dependencies_installed
  isaaclab.cli.commands.install._maybe_preinstall_arm_nlopt
  docker/Dockerfile.base                 (ARM nlopt prep)
  docker/Dockerfile.curobo               (ARM nlopt prep + nvidia-curobo install)

Each touchpoint adds only the ``!=2.3.5`` exclusion; no other version
constraints are introduced.

Validated:
- env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp
  + qpsolvers all import; toy IK solve OK).
- IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5.
- PR isaac-sim#5655 worst-case run (diagnostic imports numpy before pytest spawns
  Kit, the order that originally crashed): 36 pass / 0 fail. The
  isaaclab_physx surface gripper SIGSEGV is gone.

Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
@hujc7 hujc7 changed the title [Diag] Validate #5642 keeps numpy!=2.3.5 / openblas!=-fdde5778 [Diag] Negative arm — numpy 2.3.5 atfork crash (no 5656) May 19, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review: Diagnostic Conftest for numpy/OpenBLAS Validation

This is the negative arm of an A/B validation for PR #5656, designed to verify that the numpy!=2.3.5 exclusion pin is necessary.

✅ Diagnostic Implementation

The source/conftest.py is correctly implemented:

  1. Proper pytest hook location: Placed in source/ where pytest discovers it during test collection.

  2. Clean dependency detection: The script correctly:

    • Imports numpy to get numpy.__version__
    • Locates numpy.libs/ directory (one level up from numpy package)
    • Iterates sorted filenames looking for openblas (case-insensitive)
  3. Flush for CI visibility: flush=True ensures log lines appear in CI output even with buffering.

  4. Minimal footprint: No functional changes to test behavior — purely observational.

📝 Minor Notes

  • os.pardir usage: os.path.join(..., os.pardir, "numpy.libs") works but could also use pathlib for clarity. Fine as-is for a throwaway diagnostic.

  • Session-level output: The print happens at module load (collection time), which matches the docstring's intent of capturing what numpy landed before any tests run.

🔍 Expected Outcome

Per the PR description:

  • This branch (negative arm): Should resolve to numpy 2.3.5 + libscipy_openblas64_-fdde5778.so, and canary tests should SIGSEGV.
  • Positive arm (#5684): With numpy!=2.3.5 pin, should resolve to a different version and pass.

✅ Verdict

Diagnostic correctly implemented. Do not merge — intended for CI evidence collection only.


Reviewed at commit 66a2147 — note: this is a simplified rebase from the earlier review; prior commits with CI action changes are no longer in this branch.

@hujc7 hujc7 changed the title [Diag] Negative arm — numpy 2.3.5 atfork crash (no 5656) [Diag][DO NOT MERGE] Negative arm — numpy 2.3.5 atfork crash (no 5656) May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant