[Diag][DO NOT MERGE] Negative arm — numpy 2.3.5 atfork crash (no 5656)#5655
[Diag][DO NOT MERGE] Negative arm — numpy 2.3.5 atfork crash (no 5656)#5655hujc7 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
📋 Code Review Summary
This PR adds diagnostic instrumentation to validate that PR #5642's numpy!=2.3.5 pin is working correctly in CI. The approach is sound for gathering evidence from real CI logs.
✅ What Looks Good
-
Consistent dependency pinning: The
numpy>=2,!=2.3.5constraint is applied uniformly across all four packages (isaaclab,isaaclab_tasks,isaaclab_rl,isaaclab_visualizers), ensuring no package inadvertently pulls in the problematic version. -
Well-documented: The commit messages, changelog fragments, and inline comments thoroughly explain the root cause (OpenBLAS
pthread_atforkhandler crash) with references to upstream issues. -
Appropriate diagnostic scope: The CI instrumentation prints version info without modifying test behavior—suitable for validation purposes.
📝 Suggestions
-
Confirm removal before merge: The PR description notes this is diagnostic-only and should not be merged. Ensure the CI diagnostic block in
.github/actions/run-tests/action.yml(lines 231-241) is removed before any final merge of the underlying fix. -
Consider adding a CI check assertion (optional): For stronger validation, the diagnostic could be modified to fail explicitly if numpy 2.3.5 or the
-fdde5778OpenBLAS hash is detected:if [[ "$NUMPY_VERSION" == "2.3.5" ]]; then echo "FATAL: numpy 2.3.5 detected despite exclusion pin" exit 1 fi
However, since this is temporary diagnostic code, the current passive logging approach is reasonable.
-
isaaclab_rl constraint change: Note that
isaaclab_rl/setup.pypreviously had an unconstrainednumpy(no version specifier), while others hadnumpy>=2. The new constraintnumpy>=2,!=2.3.5is appropriate, but this effectively tightens the lower bound for that package—this should be intentional and is documented in the changelog.
🔍 Verification Steps
After CI completes, search the GitHub Actions logs for:
=== Dep manifest (numpy/scipy/openblas) ===
Confirm that:
- numpy version is NOT 2.3.5
- OpenBLAS filename does NOT contain
-fdde5778
Overall, this is a clean diagnostic PR that serves its purpose of validating the numpy pin fix.
Update (c2d7833): Reviewed incremental changes. Two files modified:
1. .github/actions/run-tests/action.yml
- Change: Reorders pytest execution to run before the dependency manifest diagnostic.
- Why: The diagnostic imports numpy, which registers numpy's vendored OpenBLAS
pthread_atforkhandler. Running the diagnostic first would contaminate the pytest process and trigger Kit'slibomni.platforminfofork() SIGSEGV. - Implementation: Uses
set +e/set -ewithPYTEST_EXITto capture pytest's exit code, allowing the diagnostic to still emit even on test failure, then exits with the original pytest status. - ✅ This is a sound fix—ensures diagnostics don't interfere with test execution.
2. source/isaaclab/isaaclab/cli/commands/install.py
- Change: Adds forced numpy upgrade to ≥2.4.1 after pink-ik dependencies are installed.
- Why: cmeel-boost caps
numpy<2.4, but pip's resolution might pull numpy 2.3.5 (with the broken OpenBLAS). The explicit upgrade bypasses this cap. - Safety note: Comment explains numpy's stable C ABI keeps cmeel's compiled extensions working despite the version bump.
- Missing
return: Areturnstatement was added to the error path, which is correct to avoid executing the numpy upgrade after a pink-ik install failure. - ✅ Appropriate defensive fix with clear documentation.
Overall: Changes are well-reasoned and address the root cause properly. The execution ordering fix in CI is particularly important for avoiding intermittent SIGSEGVs.
Update (f8e1633): Reviewed incremental changes.
.github/actions/run-tests/action.yml
- Change: Reverts the pytest-before-diagnostic ordering from the previous commit. Now runs diagnostic first, then pytest.
- Removed: The
set +e/PYTEST_EXIT/set -epattern and the early "Starting pytest" echo. - New flow: Diagnostic → pytest (simpler structure).
"importing numpy in this shell beforehand registers numpy's vendored OpenBLAS pthread_atfork handler in the wrong process, which trips Kit's libomni.platforminfo fork() and SIGSEGVs"
Running the diagnostic before pytest may still trigger this issue unless:
- The diagnostic runs in a subprocess (via
./isaaclab.sh -p -c) that doesn't affect the parent shell's state - Or the numpy pin has already eliminated the problematic OpenBLAS
Likely intentional: Since the diagnostic uses ./isaaclab.sh -p -c (a subprocess), the parent shell where pytest runs should not inherit the handler registration. The c2d7833 ordering change may have been overly cautious. This simplification is likely safe.
✅ Code is cleaner now, and the diagnostic isolation via subprocess should prevent contamination.
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently overridden during isaaclab.sh --install: each pip install -e <submodule> runs an independent resolve, and the final pin-pink force-reinstall in _ensure_pink_ik_dependencies_installed lands on numpy 2.3.5 because pip sees only pin-pink's own deps (numpy>=1.19) plus cmeel-boost's numpy<2.4 cap. numpy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes Kit's libomni.platforminfo fork() during SimulationApp startup. Two changes, both restating an explicit "pip install --upgrade numpy>=2.4.1" as the *last* pip invocation in each install path: 1. _ensure_numpy_above_openblas_atfork_bug() in install.py — runs unconditionally at the end of --install (not gated by the pink-ik probe outcome), so upgrades on an already-functioning env also pull numpy forward. 2. Dockerfile.curobo — apply the same upgrade after its post-install steps (nvidia-curobo + isaaclab_teleop editable install), which otherwise drag numpy back to 2.3.5 via dex-retargeting -> pin -> cmeel-boost. pip prints a resolver warning about cmeel-boost's cap then installs numpy 2.4.5 anyway. numpy 2.4.1+ ships the upstream OpenBLAS atfork fix, so the entire 2.3.x risk class is bypassed. numpy's stable C ABI keeps cmeel's compiled extensions (libpinocchio, libcoal, ...) working at runtime. Validated: - env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp + qpsolvers all import; toy IK solve OK). - IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5 (test_pink_ik_components 21/21, test_local_frame_task 24/24, test_null_space_posture_task 9/9). - PR isaac-sim#5655 (validation): every base-image test job reports numpy 2.4.5 + openblas -32a4b2a6 (clean, not the broken -fdde5778). Worst-case import order (numpy imported before pytest spawns Kit) also passes — confirming the upstream atfork fix is real, not just dodge-by-order. Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently overridden during isaaclab.sh --install: each pip install -e <submodule> runs an independent resolve, and the final pin-pink force-reinstall in _ensure_pink_ik_dependencies_installed lands on numpy 2.3.5 because pip sees only pin-pink's own deps (numpy>=1.19) plus cmeel-boost's numpy<2.4 cap. numpy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes Kit's libomni.platforminfo fork() during SimulationApp startup. Two changes, both restating an explicit "pip install --upgrade numpy>=2.4.1" as the *last* pip invocation in each install path: 1. _ensure_numpy_above_openblas_atfork_bug() in install.py — runs unconditionally at the end of --install (not gated by the pink-ik probe outcome), so upgrades on an already-functioning env also pull numpy forward. 2. Dockerfile.curobo — apply the same upgrade after its post-install steps (nvidia-curobo + isaaclab_teleop editable install), which otherwise drag numpy back to 2.3.5 via dex-retargeting -> pin -> cmeel-boost. pip prints a resolver warning about cmeel-boost's cap then installs numpy 2.4.5 anyway. numpy 2.4.1+ ships the upstream OpenBLAS atfork fix, so the entire 2.3.x risk class is bypassed. numpy's stable C ABI keeps cmeel's compiled extensions (libpinocchio, libcoal, ...) working at runtime. Validated: - env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp + qpsolvers all import; toy IK solve OK). - IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5 (test_pink_ik_components 21/21, test_local_frame_task 24/24, test_null_space_posture_task 9/9). - PR isaac-sim#5655 (validation): every base-image test job reports numpy 2.4.5 + openblas -32a4b2a6 (clean, not the broken -fdde5778). Worst-case import order (numpy imported before pytest spawns Kit) also passes — confirming the upstream atfork fix is real, not just dodge-by-order. Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
numpy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes Kit's libomni.platforminfo fork() during SimulationApp startup. The release is excluded at every site that pulls numpy directly or transitively, so no pip resolve during isaaclab.sh --install or any Docker image build can land on it -- even transiently: source/isaaclab/setup.py source/isaaclab_tasks/setup.py source/isaaclab_rl/setup.py source/isaaclab_visualizers/setup.py source/isaaclab_teleop/setup.py (transitive via dex-retargeting) source/isaaclab_mimic/setup.py (transitive via h5py) isaaclab.cli.commands.install._ensure_pink_ik_dependencies_installed isaaclab.cli.commands.install._maybe_preinstall_arm_nlopt docker/Dockerfile.base (ARM nlopt prep) docker/Dockerfile.curobo (ARM nlopt prep + nvidia-curobo install) Each touchpoint adds only the ``!=2.3.5`` exclusion; no other version constraints are introduced. Validated: - env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp + qpsolvers all import; toy IK solve OK). - IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5. - PR isaac-sim#5655 worst-case run (diagnostic imports numpy before pytest spawns Kit, the order that originally crashed): 36 pass / 0 fail. The isaaclab_physx surface gripper SIGSEGV is gone. Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
|
Validation/diagnostic-only PR — never intended to merge. The data captured here (numpy 2.3.5 + libscipy_openblas64_-fdde5778.so in every test container after isaaclab.sh --install on the #5642 branch, and 36/0 worst-case re-validation after #5656's fix landed locally) confirmed the fix that's now in #5656. Closing. |
Adds source/conftest.py so pytest dumps the resolved numpy version and the bundled OpenBLAS .so filename at session start. Used by the negative/positive arm validation PRs to capture which numpy bundle each CI test container ends up with after isaaclab.sh --install completes. The conftest.py imports numpy at module load. This is what pytest does naturally via isaaclab module imports anyway -- making the import explicit here only adds visibility, not crash conditions.
f8e1633 to
66a2147
Compare
numpy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes Kit's libomni.platforminfo fork() during SimulationApp startup. The release is excluded at every site that pulls numpy directly or transitively, so no pip resolve during isaaclab.sh --install or any Docker image build can land on it -- even transiently: source/isaaclab/setup.py source/isaaclab_tasks/setup.py source/isaaclab_rl/setup.py source/isaaclab_visualizers/setup.py source/isaaclab_teleop/setup.py (transitive via dex-retargeting) source/isaaclab_mimic/setup.py (transitive via h5py) isaaclab.cli.commands.install._ensure_pink_ik_dependencies_installed isaaclab.cli.commands.install._maybe_preinstall_arm_nlopt docker/Dockerfile.base (ARM nlopt prep) docker/Dockerfile.curobo (ARM nlopt prep + nvidia-curobo install) Each touchpoint adds only the ``!=2.3.5`` exclusion; no other version constraints are introduced. Validated: - env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp + qpsolvers all import; toy IK solve OK). - IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5. - PR isaac-sim#5655 worst-case run (diagnostic imports numpy before pytest spawns Kit, the order that originally crashed): 36 pass / 0 fail. The isaaclab_physx surface gripper SIGSEGV is gone. Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
There was a problem hiding this comment.
📋 Review: Diagnostic Conftest for numpy/OpenBLAS Validation
This is the negative arm of an A/B validation for PR #5656, designed to verify that the numpy!=2.3.5 exclusion pin is necessary.
✅ Diagnostic Implementation
The source/conftest.py is correctly implemented:
-
Proper pytest hook location: Placed in
source/where pytest discovers it during test collection. -
Clean dependency detection: The script correctly:
- Imports numpy to get
numpy.__version__ - Locates
numpy.libs/directory (one level up from numpy package) - Iterates sorted filenames looking for
openblas(case-insensitive)
- Imports numpy to get
-
Flush for CI visibility:
flush=Trueensures log lines appear in CI output even with buffering. -
Minimal footprint: No functional changes to test behavior — purely observational.
📝 Minor Notes
-
os.pardirusage:os.path.join(..., os.pardir, "numpy.libs")works but could also usepathlibfor clarity. Fine as-is for a throwaway diagnostic. -
Session-level output: The print happens at module load (collection time), which matches the docstring's intent of capturing what numpy landed before any tests run.
🔍 Expected Outcome
Per the PR description:
- This branch (negative arm): Should resolve to numpy 2.3.5 +
libscipy_openblas64_-fdde5778.so, and canary tests should SIGSEGV. - Positive arm (#5684): With
numpy!=2.3.5pin, should resolve to a different version and pass.
✅ Verdict
Diagnostic correctly implemented. Do not merge — intended for CI evidence collection only.
Reviewed at commit 66a2147 — note: this is a simplified rebase from the earlier review; prior commits with CI action changes are no longer in this branch.
Purpose
Negative arm of the single-variable A/B that validates #5656.
Companion positive arm: #5684.
Setup
a9b62101ca6), no other workarounds.source/conftest.pythat prints the resolved numpy version and bundled OpenBLAS .so filename at pytest session start.isaaclab.sh --installoverrides, no force-reinstall, no action.yml edits.git diff jichuanh/validate-numpy-pin-5642..jichuanh/validate-numpy-2-3-4-worstcasebetween this branch and the positive arm equals exactly 5656's 10-sitenumpy!=2.3.5exclusion — nothing else.Expected result
[dep-manifest]log line showsnumpy 2.3.5+libscipy_openblas64_-fdde5778.so.libomni.platforminfofork():isaaclab_physxisaaclab_newtonisaaclab_rlIf any canary passes, the negative test is invalid (something else on develop is already deferring/masking the bug — investigate before trusting the positive arm).
Lifecycle
Diagnostic-only. Do not merge. Closes once evidence is captured.
Related