merge main into amd-staging by ronlieb · Pull Request #2536 · ROCm/llvm-project

ronlieb · 2026-05-14T10:29:48Z

No description provided.

This commit adds initial documentation for the instrumentor to the html/man pages and provides a script that helps new users to setup the config and stubs file interactively. The script and docs have been created with Claude (AI) but proofread/tested and modified afterwards.

A previous commit switched us to use the value of the AT_EXECFN, which is an entry in the aux vector, as the executable path. As it turns out, if a symlink is used to launch a program, the symlink path will be in the AT_EXECFN string in core file memory. The PRPSINFO also contains a basename of the program, and it will also be the symlink basename. The best source of information to figure out the executable name is from the NT_FILE note. This always has the resolved path to the executable. Now the executable name is found in a reliable way starting with finding the NT_FILE entry for the main executable. This can reliably be done by finding the NT_FILE entry whose address contains the AT_PHDR aux vector value. This value is the address of the program headers for the main executable. If there is no NT_FILE entry we can find, we fall back to the AT_EXECFN entry from memory and then fallback to the basename in the PRPSINFO. This patch also creates a placeholder as the main executable when the executable can't be found to ensure users can see which executable they will need to track down in order to load the core file. The tests added will test the order of precedence. It does this by creating a core file with: - NT_FILE entry with a path of "/path/nt_file_foo" - AT_EXECFN in the aux vector with a path of "/path/execfn_foo" - NT_PRPSINFO entry with a path of "prpsinfo_foo" We then test that the correct entry is found as the best path option is removed from the core file.

check_cxx_compiler_flag, when passing multiple flags, we must separate them using a SEMICOLON-separated list. Not spaces. These checks succeed incorrectly sometimes because "-Werror -mcrc" has a different return value than "-Werror" "-mcrc" on some systems. This issue was verified with LLVM_ENABLE_PROJECTS=llvm;compiler-rt, and I'm uncertain whether it exists in runtime CMake builds. Nonetheless, it's still a bug. See: https://cmake.org/cmake/help/latest/module/CheckCXXCompilerFlag.html This issue was identified downstream in ChromiumOS. ChromiumOS Bug: https://issuetracker.google.com/507177988

before ```SystemVerilog (* x = "x" *) foreach(x[x]) x = x; ``` after ```SystemVerilog (* x = "x" *) foreach (x[x]) x = x; ``` The code for handling statements like the `foreach` preceded the part for handling the attributes inside `(* *)`. So there was a problem with some of the statements following attributes. The patch moves the part for the statements down. The loop in the code was also unnecessary.

This fixes 882d025. Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>

Add a test case to verify that initFlags() correctly reads the SCUDO_ALLOCATION_RING_BUFFER_SIZE environment variable and updates the corresponding flag. This increases line coverage for flags.cpp to 100%.

…VM IR input (llvm#197566) 1. Replace the C++ source test that required compiling with %clangxx and separate Input files with self-contained .ll tests using split-file. 2. Split the test into two files: - clang-sycl-linker.ll: basic tool behavior (link, dev libs, AOT, errors) - clang-sycl-linker-split-mode.ll: device code split mode handling Co-Authored-By: Claude

Add first class support for building test inferiors without debug info, instead of having to pass `-g0` in the Makefile or the build dictionary. ``` def test(self): self.build(debug_info="none") ``` rdar://164923931

Summary: There's two ways you can put multiple binaries in the section. Either use the version two multi-binary support or just concatenate them. This PR changes the llvm-offload-binary tool to use the multi-support rather than directly concatenating them. The motivation for this is to save space and make it easier to support compression in the future. Compression would be a flag in the header and the compression is only really valuable if it can combine the architecture variants. ELF section compression is a little spotty but would be another good solution.

This operator creates a new ``list`` containing the same elements as *list* but in sorted order. To determine the order, TableGen binds the variable *var* to each element and evaluates the *key* expression, which presumably refers to *var*. The key must produce a ``string`` or integer value (``bit``, ``bits``, or ``int``); all keys must be of the same type. Elements with equal keys preserve their original relative order, resulting in a stable sort. For example, to sort a list of records by their ``Name`` field:: ` list<Thing> sorted = !sort(t, Things, t.Name);`

…WARD_SLASH is ON (llvm#184556) This patch fixes several LLVM test failures on Windows that occur when the LLVM_WINDOWS_PREFER_FORWARD_SLASH CMake option is enabled. The failures were caused by tests either hardcoding backslash expectations in FileCheck or constructing paths with strict backslashes in C++ unit tests, both of which break when the environment is configured to prefer forward slashes. Specific changes: - `llvm-cov` lit tests: Changed the path separators with `-DSEP=%{fs-sep}`. - `llvm-objdump` lit test: Relaxed `source-interleave-prefix-windows.test` to accept either forward or backward slashes using the `{{[/\\]}}` regex. This makes the path matching resilient to the underlying separator preference without losing precision. - CommandLineTest.cpp: Conditionalized the TestRoot variable to use `C:/` instead of `C:\` based on the build configuration. - Path.cpp (makeLongFormPath test): - Updated the OneDir string literal to conditionally use `/` or `\`. - Updated the ContainsDotAndDotDot lambda to check for `.` and `..` components with the correct separator style based on the build configuration.

This change defines 4 new output patterns, `PAIR8`,`EVEN8`, `AEXT8`, and `TRUNC4`, and uses them to implement the lowering of the intrinsics `int_ppc_amo_l[dw]at` and `int_ppc_amo_l[dw]at_cond` in TableGen. As result, the output pattern to generate the instructions becomes more understandable,, and the C++ code can be removed.

…#197096) This PR adds BF16 to I8 saturating FP to int convert custom lowering.

…trinsic (llvm#197380) Fix HLSL builtin to SPIR-V intrinsic lowering: most intrinsics calls must use CallingConv::C. Relates to llvm#197608 which tries to add CallingConv CHECK to IR Verifier.

In 2021, Augusto changed the Target::ReadMemory API from taking a `prefer_file_cache` argument to taking a `force_live_memory` argument, with opposite meanings - where we used to pass true, the callers now needed to pass false. The default argument was false, so many callers omitted the argument altogether after the change. One of the edits to UnwindAssemblyInstEmulation::GetNonCallSiteUnwindPlanFromAssembly unintentionally swapped the intended behavior -- this method which reads the bytes of a function's instructions for emulation should get the bytes from the local binary, if possible, else read from live memory. But it was changed to force reading from live memory unconditionally. This leads to an extra memory read for every function we see for the first time in a single `lldb` process run (the UnwindTable they are added to is part of the Module, and kept in the global Module cache). It's not a major perf regression, but these are extra memory reads that we don't need to be doing. I audited all the other changes in the 2021 PR and this was the only mistake like this. rdar://177026608

This is the last patch for global/namespace thread-local variables. This patch emits the final 'init' function, which calls all other init functions, plus does the guard variable for the unordered variants.

This is a pretty trivial bit of adjustments that have to happen when emitting a materialized temporary, and is effectively a clone of classic codegen. Our output is effectively identical (other than some minor re-orering problems).

…lvm#197563)

…llvm#197474) On MSVC, Profile-* tests must link with the same CRT model as the clang_rt.profile static archive they exercise. When that archive pulls in RTInterception / RTSanitizerCommon object libraries, those are built with MultiThreadedDLL (/MD), so the .objs reference `__imp_*` symbols. The test binary defaults to /MT and fails to link with LNK2019 (`__imp__stricmp` from `interception_win.cpp`) and LNK4098 default-lib conflicts. Match the DLL CRT on the test side so test executables and the static archive use the same runtime. The change is gated on `COMPILER_RT_HAS_INTERCEPTION` and `!COMPILER_RT_PROFILE_BAREMETAL`, so configurations that don't pull interception into profile are unaffected. Split out as NFC from llvm#177665 per review feedback.

The [RFC](https://discourse.llvm.org/t/rfc-remove-80-column-limit-in-documentation-files/89678/41) on removing 80 columns limit got accepted. So we should no longer enforce that rule in clang-tidy's code-linter workflow.

Right now it takes validation path of an inline constant if fits even though it is forced to literal encoding.

…ed} (llvm#197518) A device-typed dummy with `!dir$ ignore_tkr(m)` is meant to be an overload discriminator (only selected for actuals with an explicit `device/managed/unified` attribute). Skip the host->device relaxation in AreCompatibleCUDADataAttrs when `IgnoreTKR::Managed` is set so unattributed host actuals no longer bind to such a dummy. Also document the §3.2.3 matching distance table next to GetMatchingDistance and add LIT tests for the full Table 2 grid and the ignore_tkr(m) carve-out.

This PR allows duplicate OpenACC `private` and `firstprivate` clauses. While maintaining the restriction on `reduction` clauses.

This method needs to match the set of cases handled in parseSummaryEntry.

…97565) Add support for DWARF opcodes seen in GCC-generated binaries: - DW_FORM_ref_udata: ULEB128-encoded CU-relative DIE reference. - DW_OP_regval_type (0xa5): DWARF5 expression opcode with operands (SizeLEB, BaseTypeRef). The BaseTypeRef was not being updated when DIEs were relocated because cloneExpression only handled (Size1, BaseTypeRef) patterns. Generalized the first-operand copying to use raw bytes from the data stream instead of assuming a single byte. Fixes llvm#188250 Assisted-by: Claude Opus 4.6/4.7

This commit fixes the handling of `launch_bounds` within OpenMP's `ompx_attribute`. The third attribute value, the maximum blocks, was not parsed correctly.

…#197537) Adds a new CMake option, OFF by default, to gate entrypoints with known-incomplete implementations. This lets developers build and test partially-implemented functions without exposing them to production users. The motivating case is `sysconf`, which only handles three of the required `_SC_*` constants (`_SC_PAGESIZE`, `_SC_NPROCESSORS_CONF`, `_SC_NPROCESSORS_ONLN`) and returns `EINVAL` for everything else. Functions like this are useful to have in a build for testing progress, but shouldn't be part of a default full build until the implementation is complete. Changes: - `libc/CMakeLists.txt`: adds `option(LLVM_LIBC_ENABLE_EXPERIMENTAL_ENTRYPOINTS ... OFF)` - `libc/cmake/modules/LLVMLibCCompileOptionRules.cmake`: propagates `-DLIBC_EXPERIMENTAL_ENTRYPOINTS` when ON - `libc/cmake/modules/LLVMLibCTestRules.cmake`: same for test compile options - `libc/config/linux/{x86_64,aarch64,riscv}/entrypoints.txt`: moves `sysconf` behind the new flag The flag does not require `LLVM_LIBC_FULL_BUILD` since overlay builds may also have incomplete entrypoints that benefit from this gating.

The combine was added in D48569 8 years ago with the aim of preserving flags, but the current LangRef says the status flags are not observable in the default FP environment. The main motivation for this change is to enable scalar float reciprocal generation v_s_rcp_f32 on newer hardware. There is no v_s_rcp_iflag_f32, so the combine effectively blocks the selection. See: pseudo-scalar-transcendental.ll.

We were losing the MMO when converting the load. Make sure we copy them over, which apparently alters codegen more than I expected and helps keep postinc generation after llvm#196305.

llvm#183506 revealed a pre-existing use-after-scope in createInstrInfo (MSan bot: https://lab.llvm.org/buildbot/#/builders/164/builds/21562 [*]). This patch fixes the issue by changing the stack-allocated AArch64Subtarget (which goes out of scope once createInstrInfo() returns) into heap-allocated, allowing it to be safely stored in the returned AArch64InstrInfo. ----- [*] WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55555666fabd in llvm::AArch64InstrInfo::getInstSizeInBytes(llvm::MachineInstr const&) const /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp:247:5 ... /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:85:3 #9 0x555556508559 in InstSizes_MOVaddrTagged_Test::TestBody() /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:301:3 ... Member fields were destroyed #0 0x555556498a1d in __sanitizer_dtor_callback_fields /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1074:5 #1 0x5555564fbda6 in ~Triple /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/TargetParser/Triple.h:348:12 #2 0x5555564fbda6 in ~Triple /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/TargetParser/Triple.h:47:7 #3 0x5555564fbda6 in llvm::AArch64Subtarget::~AArch64Subtarget() /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Target/AArch64/AArch64Subtarget.h:38:7 #4 0x555556503396 in (anonymous namespace)::createInstrInfo(llvm::TargetMachine*) /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:38:1 #5 0x5555565084cb in InstSizes_MOVaddrTagged_Test::TestBody() /home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:299:42

### Summary part of : llvm#185382 Lower `vtrn1` and `vtrn2` intrinsics in https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#transpose-elements All the intrinsics are handled inline in llvm-project/build/lib/clang/23/include/arm_neon.h like: ``` #ifdef __LITTLE_ENDIAN__ __ai __attribute__((target("neon"))) int8x8x2_t vtrn_s8(int8x8_t __p0, int8x8_t __p1) { int8x8x2_t __ret; __builtin_neon_vtrn_v(&__ret, __builtin_bit_cast(int8x8_t, __p0), __builtin_bit_cast(int8x8_t, __p1), 0); return __ret; } #else __ai __attribute__((target("neon"))) int8x8x2_t vtrn_s8(int8x8_t __p0, int8x8_t __p1) { int8x8x2_t __ret; int8x8_t __rev0; __rev0 = __builtin_shufflevector(__p0, __p0, __lane_reverse_64_8); int8x8_t __rev1; __rev1 = __builtin_shufflevector(__p1, __p1, __lane_reverse_64_8); __builtin_neon_vtrn_v(&__ret, __builtin_bit_cast(int8x8_t, __rev0), __builtin_bit_cast(int8x8_t, __rev1), 0); __ret.val[0] = __builtin_shufflevector(__ret.val[0], __ret.val[0], __lane_reverse_64_8); __ret.val[1] = __builtin_shufflevector(__ret.val[1], __ret.val[1], __lane_reverse_64_8); return __ret; } #endif ``` So no additional special lowering logic is needed.

In certain codebases (e.g. embedded) — function declarations could accumulate a long prefix of specifiers and attributes (`static`, `inline`, `__attribute__((...))`, project-specific `AttributeMacros`, etc.) before the return type, which buries the core prototype and pushes parameters past the column limit. This patch adds a `BreakBeforeReturnType` style option that places that prefix on its own line(s): ```cpp __attribute__((always_inline)) static inline int do_thing(int a, int b, int c); ``` The recognized prefix tokens are function/storage specifiers (`static`, `extern`, `inline`, `virtual`, `constexpr`, `consteval`, `friend`, `export`, `_Noreturn`, `__forceinline`), C++11 attribute groups `[[...]]`, GNU/MSVC attribute groups `__attribute__((...))` / `__declspec(...)`, and identifiers configured via `AttributeMacros`. The new `BreakBeforeReturnTypeStyle` enum has values `None`, `All`, `TopLevel`, `AllDefinitions`, and `TopLevelDefinitions`. The default is `None`, preserving previous behavior. Constructors and destructors are not affected. The option composes with `BreakAfterReturnType`, `BreakAfterAttributes`, and `BreakTemplateDeclarations`. `ContinuationIndenter::getNewLineColumn` is adjusted so the wrapped return type is dedented to the line's base indent when the preceding token is a function/storage specifier keyword, matching the behavior already used after attribute groups. Adds tests in `FormatTest.cpp`. Assisted-by: Claude (claude-opus-4-7, Claude Code)

This auto-assigns PR reviewers, per the GitHub documentation.

This code path is not really used with upstream code generation.

Optimized AArch32 implementations of `muldf3` and `divdf3` are provided. The division function is particularly tricky because its Newton-Raphson approximation strategy requires a rigorous error bound. In this version of the commit I've left out the full supporting machinery that validates the error bound via Gappa and Rocq, but full details are provided via links to the upstream version of this code in the Arm Optimized Routines repository, and to a pair of Arm Community blog posts.

…#197632)

While working on a PR to add a cost model for VPDerivedIV recipes I noticed that a loop in or_reduction_with_freeze: test/Transforms/LoopVectorize/AArch64/reduction-cost.ll stopped vectorising because the cost model decided it was no longer worth it. However, the main cause of this was the incredibly high cost (14) of freeze for VF=2. We were using the cost of a vector mul instruction as a proxy for the freeze cost, which is incredibly bad for an AArch64 target without SVE since the operation needs scalarising. As far as I understand, the freeze instruction does not lead to any actual code being generated and acts merely as a barrier to potentially unsafe optimisations. As such, I've updated the cost model to return 0 instead.

…9924) The structure of these comparison functions consists of a header file containing the main code, and several `.S` files that include that header with different macro definitions, so that they can use the same procedure to determine the logical comparison result and then just translate it into a return value in different ways.

Add implicit uses to ds_bvh_stack instructions to avoid reuse of VGPRs allocated to bvh_intersect_ray results prior to ds_bvh_stack. This reduces likelihood of a premature s_wait_bvhcnt occuring due to partial reallocation of unused bvh_intersect_ray results registers.

…vm#197249) When completing in the middle of an existing identifier (e.g. `fo^o<int>(42)`), the next-token check lexes the character immediately after the cursor, which prevents parens suppression to kick in. After the fix, we go to the end of the current identifier first and only then we start lexing for the next token, which handles redundant parens even when the cursor is mid-identifier. This also fixes the parens suppression in the replace mode which by design is used mid-identifier. Fixes clangd/clangd#387

…types (llvm#197141) Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>

Fixes llvm#196662. --------- Co-authored-by: owenca <owenpiano@gmail.com>

…9925) These comparison functions follow the same structure as the double-precision ones in a prior commit, of a header file containing the main logic and some entry points varying the construction of the return value. In this case, we have provided versions for Thumb1 as well as Arm/Thumb2.

Allow 32-bit targets to correctly lower i64 ISD::VECREDUCE min/max nodes via ReplaceNodeResults - this is necessary once we're finally ready for llvm#194473 and remove combineMinMaxReduction entirely Improve handling of v2iXX reduction stages by consistently preferring binop(extract(),extract()) scalarisation on SSE targets (if the vector binop isn't legal).

rampitec and others added 30 commits May 13, 2026 16:37

[AMDGPU] Fix conflicted literal test. NFC. (llvm#197587)

111ec2f

[Instrumentor][FIX] Fix oversight in docs heading (llvm#197594)

3ccc276

[Bazel] Fixes 882d025 (llvm#197593)

fe787a8

This fixes 882d025. Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>

[BOLT][NFCI] Drop CFG profile attachment in DataAggregator (llvm#195986)

37c5916

[scudo] Add test for initFlags()

c5f8414

Add a test case to verify that initFlags() correctly reads the SCUDO_ALLOCATION_RING_BUFFER_SIZE environment variable and updates the corresponding flag. This increases line coverage for flags.cpp to 100%.

[clang-format][NFC] Correct comment (llvm#197592)

8ebd857

[AMDGPU] Add lit64 machine verifier (llvm#196457)

d2a57ec

[X86][AVX10.2] Add BF16 to (U/S)8 saturating FP to int lowering (llvm…

f996980

…#197096) This PR adds BF16 to I8 saturating FP to int convert custom lowering.

[Clang][HLSL] Use EmitIntrinsicCall instead of EmitRuntimeCall for in…

7c7a47e

…trinsic (llvm#197380) Fix HLSL builtin to SPIR-V intrinsic lowering: most intrinsics calls must use CallingConv::C. Relates to llvm#197608 which tries to add CallingConv CHECK to IR Verifier.

[CIR] Lower 'init' functions for global TLS (llvm#197460)

2ee4669

This is the last patch for global/namespace thread-local variables. This patch emits the final 'init' function, which calls all other init functions, plus does the guard variable for the unordered variants.

[CIR]Materialize temp adjustments (llvm#197585)

f7f6040

This is a pretty trivial bit of adjustments that have to happen when emitting a materialized temporary, and is effectively a clone of classic codegen. Our output is effectively identical (other than some minor re-orering problems).

[flang][cuda] Use wider cudaMemcpy2D rows for descriptor transfers (l…

b638763

…lvm#197563)

[AMDGPU] Validate forced lit() immediate (llvm#196623)

e2b5048

Right now it takes validation path of an inline constant if fits even though it is forced to literal encoding.

[flang][openacc] allow duplicate data sharing clauses (llvm#197019)

83ae5cc

This PR allows duplicate OpenACC `private` and `firstprivate` clauses. While maintaining the restriction on `reduction` clauses.

Handle typeidCompatibleVTable in skipModuleSummaryEntry (llvm#196849)

6cdd328

This method needs to match the set of cases handled in parseSummaryEntry.

kevinsala and others added 21 commits May 13, 2026 22:51

[OpenMP] Fix launch_bounds for OpenMP ompx_attribute (llvm#195665)

29206d7

This commit fixes the handling of `launch_bounds` within OpenMP's `ompx_attribute`. The third attribute value, the maximum blocks, was not parsed correctly.

[AArch64] Keep MMO when converting gather lane to LDRSui. (llvm#197522)

7ae25fb

We were losing the MMO when converting the load. Make sure we copy them over, which apparently alters codegen more than I expected and helps keep postinc generation after llvm#196305.

Add new libc GH team to CODEOWNERS (llvm#197630)

2045ee5

This auto-assigns PR reviewers, per the GitHub documentation.

[OFFLOAD][L0] Simplify kernel setGroups logic (llvm#197411)

e1135dc

This code path is not really used with upstream code generation.

[LV][NFC] Generate full CHECK lines for reduction-small-size.ll (llvm…

20f4289

…#197632)

[AMDGPU] Update permlane_bcast/down/up/xor intrinsic to support more …

3fda43d

…types (llvm#197141) Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>

[TableGen] Simplify Record type checks. NFC. (llvm#197450)

6293f16

[clang-format] Fix parsing of goto labels (llvm#197538)

d566c8c

Fixes llvm#196662. --------- Co-authored-by: owenca <owenpiano@gmail.com>

merge main into amd-staging

1389d1a

ronlieb requested review from a team, A-Skvortsov, dpalermo and skganesan008 May 14, 2026 10:29

ronlieb requested a review from vangthao95 as a code owner May 14, 2026 10:29

ronlieb removed request for A-Skvortsov and vangthao95 May 14, 2026 10:30

dpalermo approved these changes May 14, 2026

View reviewed changes

ronlieb closed this May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge main into amd-staging#2536

merge main into amd-staging#2536
ronlieb wants to merge 52 commits into
amd-stagingfrom
amd/merge/upstream_merge_20260514051204

ronlieb commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ronlieb commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants