Skip to content

[packaging] Reserve RPATH pad to avoid patchelf breaking ELF layout on RHEL 8.10 (#4271)#4656

Closed
lucbruni-amd wants to merge 3 commits intomainfrom
users/lucbruni-amd/patchelf-post-processing-rework
Closed

[packaging] Reserve RPATH pad to avoid patchelf breaking ELF layout on RHEL 8.10 (#4271)#4656
lucbruni-amd wants to merge 3 commits intomainfrom
users/lucbruni-amd/patchelf-post-processing-rework

Conversation

@lucbruni-amd
Copy link
Copy Markdown
Contributor

Motivation

Fixes #4271. pip install rocm[libraries] on RHEL 8.10 baremetal failed to
compile any HIP program because clang-offload-bundler (and every other
shipped executable) crashed with SIGSEGV before main() ever ran. The
crash is in the kernel's (4.18) execve() path, not in the binary itself.

Root cause: py_packaging.py runs patchelf --add-rpath + --set-rpath
on every shipped ELF to rewrite $ORIGIN relative paths into the wheel's
final _rocm_sdk_core layout. The new RPATH strings are longer than the
originals, so patchelf has to extend .dynstr. To make room, patchelf
prepends a writable PT_LOAD segment ahead of the canonical read-only
first segment. RHEL 8.10's 4.18 kernel rejects that layout in
load_elf_binary() and signals SIGSEGV to the parent. Newer kernels
(>= 5.4) tolerate it, which is why the nightly only broke on RHEL 8.10.

The issue report frames this in ET_EXEC terms (base 0x400000
0x3ff000), but TheRock ships 100% ET_DYN / PIE executables, so the
same kernel bug manifests as a writable first PT_LOAD (base stays
at 0x0). Either way the invariant "first PT_LOAD is not writable" is
what the kernel's ELF loader checks.

Technical Details

The fix pre-allocates enough RPATH string space at link time that the
later patchelf --set-rpath can overwrite in place without ever needing
to grow .dynstr. No patchelf behaviour change, no submodule changes,
no kernel workaround.

CMake side (cmake/therock_subproject_utils.cmake,
cmake/therock_subproject.cmake):

  • Define THEROCK_INSTALL_RPATH_PAD_SIZE (1024), a marker string
    __therock_patchelf_pad__, and both a CMake-list form
    (THEROCK_INSTALL_RPATH_PAD) and colon-joined form
    (..._PAD_COLON) of the pad.
  • Inject all four variables into every subproject's _init.cmake so the
    pad propagates to every ExternalProject_Add configure step without
    per-component boilerplate.
  • therock_set_install_rpath appends the pad to the auto-managed
    INSTALL_RPATH on non-Darwin ELF platforms.
  • Log output is kept clean: message(STATUS ...) prints
    (+patchelf pad) instead of the full 1 KB filler, so from-source
    build logs stay readable.

NO_INSTALL_RPATH carve-outs — three subprojects set their own
linker flags and bypass therock_set_install_rpath. They get the pad
from the super-project so no submodule changes are needed:

  • compiler/pre_hook_amd-llvm.cmake — appends the pad to
    CMAKE_INSTALL_RPATH and LIBOMP_INSTALL_RPATH.
  • compiler/pre_hook_amd-comgr.cmake — appends the pad to
    CMAKE_INSTALL_RPATH.
  • compiler/post_hook_hipify.cmake (new) — hipify-clang bakes its
    RPATH via LINK_FLAGS in the submodule's CMakeLists.txt. The
    post-hook appends the colon-joined pad to those LINK_FLAGS from the
    super-project, guarded on the exact -Wl,--rpath,$ORIGIN/../lib form
    the submodule currently uses. If that form ever changes upstream the
    CI gate below catches it.

Python packaging side (build_tools/_therock_utils/py_packaging.py):

  • _extend_rpath + _normalize_rpath are collapsed into a single
    _rewrite_rpath that reads the current RPATH, strips the pad marker
    and any trailing $ORIGIN/.__therock_patchelf_pad___XXX... entries,
    appends the dependency RPATHs computed for the wheel layout, and
    writes the result back with one patchelf --set-rpath --force-rpath
    call. Because we never exceed the pre-allocated string space,
    .dynstr is overwritten in place and the ELF program headers stay
    exactly as the linker produced them.
  • Packaging log output collapses the pad entry to <pad> in the
    before/after RPATH line.

Test Plan

Two CI gates added to rocm_sdk.tests.core_test (runs via rocm-sdk test in test_rocm_wheels.yml on every PR and nightly):

  • testPatchelfPadStripped — greps every shipped file for
    __therock_patchelf_pad__; asserts zero hits. Catches regressions
    where a newly added target bypasses both the auto-RPATH path and
    _rewrite_rpath (pad would leak into the wheel).
  • testExecutableElfLayoutIntact — parses the program headers of every
    64-bit LE ELF under _rocm_sdk_core/bin and asserts the first
    PT_LOAD segment is not writable. This is the exact invariant the
    RHEL 8.10 kernel checks; an RW-first binary would reproduce [Issue]: CI nightly hip programs fail to compile via pip install on RHEL 8.10 baremetal. #4271.

Local validation against a from-source build and build_python_packages.py
output:

  • CMake configure: every subproject's _init.cmake carries the pad
    definitions.
  • Compile + link: every auto-managed target, amd-llvm, amd-comgr,
    and hipify-clang ship with the 1 KB pad in their DT_RUNPATH /
    DT_RPATH at stage time.
  • build_python_packages.py: every executable logs REWRITE_RPATH: ... -> <final-rpath> with no pad entry in the final form.
  • rocm_sdk_core-*.whl: __therock_patchelf_pad__ not present in any
    wheel member; census of all 102 shipped ELF executables shows first
    PT_LOAD is R-only and base vaddr is 0x0 (canonical PIE).

End-to-end on the RHEL 8.10 baremetal system from the original repro:

  • readelf -lW on the installed clang-offload-bundler shows
    first PT_LOAD flags: R vaddr=0x0000000000000000 and a clean
    DT_RPATH with no pad marker.
  • strace -e execve returns execve(...) = 0 — the kernel accepts the
    layout. The original bug is cured.

Test Result

Repro environment:

  • RHEL 8.10, kernel 4.18.0-553.117.1.el8_10.x86_64, glibc 2.28
  • Wheel built locally from this branch (from-source, Ubuntu 24.04 host)

Before (baseline nightly build):

  • execve() → SIGSEGV before ld.so runs, dmesg shows a segfault
    record for the exec'd binary.

After (this branch):

  • execve()0, ld.so runs, binary proceeds to dynamic linking.
  • The installed-on-RHEL-8.10 wheel still errors from the newer host's
    glibc 2.38 / GLIBCXX_3.4.30 symbol references (unrelated ABI
    mismatch — local build was not a manylinux container). CI's
    manylinux_2_28 pipeline will produce the ABI-correct wheel; the
    kernel-layout cure demonstrated here is independent of the ABI issue.

Full from-source build and packaging logs available on request.

Follow-up

  • Land this, then run the existing manylinux_2_28 CI wheel on the RHEL
    8.10 box to confirm full end-to-end pip install rocm[libraries] +
    HIP compile works in the intended deployment configuration.
  • The new compiler/post_hook_hipify.cmake is the only place where the
    super-project special-cases hipify's hand-rolled LINK_FLAGS. If
    hipify upstream grows richer RPATH handling, the
    testExecutableElfLayoutIntact CI gate will flag any regression and
    the post-hook can be dropped.

Submission Checklist

py_packaging's patchelf --add-rpath + --set-rpath pass was extending
.dynstr on every shipped ELF, which made patchelf prepend a writable
PT_LOAD segment at a non-canonical base address. RHEL 8.10 / EL 4.18
kernels reject that layout in execve() with SIGSEGV, so `pip install
rocm[libraries]` failed to compile any HIP program on RHEL 8.10 baremetal.

Reserve ~1KB of RPATH string space at link time in every shipped ELF
so the packaging rewrite fits in place without resizing .dynstr. The
pad is defined once in cmake/therock_subproject_utils.cmake and injected
into every subproject's init file. Auto-managed targets get the pad via
therock_set_install_rpath; the three NO_INSTALL_RPATH carve-outs pick it
up via their pre-hook (amd-llvm, amd-comgr) or a new super-project
post-hook (hipify), so no submodule changes are required.

py_packaging's _extend_rpath + _normalize_rpath are collapsed into a
single _rewrite_rpath that strips the pad and writes the final entries
with one patchelf --set-rpath --force-rpath call.

Two gates added to rocm_sdk.tests.core_test (runs under rocm-sdk test on
every PR and nightly): testPatchelfPadStripped greps shipped files for
the pad marker, and testExecutableElfLayoutIntact parses 64-bit LE ELFs
under core/bin and asserts the first PT_LOAD segment is read-only.

Made-with: Cursor
The ~1KB RPATH pad added for issue #4271 makes from-source build logs
nearly unreadable when printed verbatim. Dump a compact placeholder
instead:

- therock_set_install_rpath now prints "(+patchelf pad)" once per
  target instead of the full $ORIGIN/.__therock_patchelf_pad___XXX...
  string; CMake still writes the real pad into INSTALL_RPATH.
- py_packaging._rewrite_rpath collapses the pad entry to "<pad>" when
  logging the before/after RPATH; patchelf still overwrites the real
  .dynstr bytes in place.

Shipped ELFs are unchanged. Cosmetic only.

Made-with: Cursor
@lucbruni-amd
Copy link
Copy Markdown
Contributor Author

Please mind the length of this PR in terms of both code changes and description. It is quite large, but we can treat this as a draft/experimental solution to the issue and as a means to discuss the final solution or alternatives (hence my request for more reviewers - feel free to remove yourself if your queue is overloaded).

Co-authored-by: Claude <noreply@anthropic.com>
Made-with: Cursor
@ScottTodd
Copy link
Copy Markdown
Member

Is this still needed after we upgraded patchelf in #4568 to fix #4561 ? The symptoms look similar.

@lucbruni-amd
Copy link
Copy Markdown
Contributor Author

lucbruni-amd commented Apr 17, 2026

Just tested the reproducer in #4271 again with the latest nightly 7.13.0a20260417 which should include the upgraded patchelf, and I no longer hit the issue. clang-offload-bundler is invoked during link and the build completes with [100%] Built target hip_hello_world, so execve() now succeeds on kernel 4.18. Apologies for the noise.

@pbhandar-amd, confirm on your side that this is fixed and we can close both the issue and this PR.

@lucbruni-amd
Copy link
Copy Markdown
Contributor Author

Closing as this issue has been resolved by #4568.

@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Apr 21, 2026
@lucbruni-amd lucbruni-amd deleted the users/lucbruni-amd/patchelf-post-processing-rework branch April 21, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Issue]: CI nightly hip programs fail to compile via pip install on RHEL 8.10 baremetal.

2 participants