Skip to content

merge main into amd-staging#2536

Closed
ronlieb wants to merge 52 commits into
amd-stagingfrom
amd/merge/upstream_merge_20260514051204
Closed

merge main into amd-staging#2536
ronlieb wants to merge 52 commits into
amd-stagingfrom
amd/merge/upstream_merge_20260514051204

Conversation

@ronlieb
Copy link
Copy Markdown
Collaborator

@ronlieb ronlieb commented May 14, 2026

No description provided.

rampitec and others added 30 commits May 13, 2026 16:37
This commit adds initial documentation for the instrumentor to the
html/man pages and provides a script that helps new users to setup the
config and stubs file interactively.

The script and docs have been created with Claude (AI) but
proofread/tested and modified afterwards.
A previous commit switched us to use the value of the AT_EXECFN, which
is an entry in the aux vector, as the executable path. As it turns out,
if a symlink is used to launch a program, the symlink path will be in
the AT_EXECFN string in core file memory. The PRPSINFO also contains a
basename of the program, and it will also be the symlink basename. The
best source of information to figure out the executable name is from the
NT_FILE note. This always has the resolved path to the executable.

Now the executable name is found in a reliable way starting with finding
the NT_FILE entry for the main executable. This can reliably be done by
finding the NT_FILE entry whose address contains the AT_PHDR aux vector
value. This value is the address of the program headers for the main
executable. If there is no NT_FILE entry we can find, we fall back to
the AT_EXECFN entry from memory and then fallback to the basename in the
PRPSINFO. This patch also creates a placeholder as the main executable
when the executable can't be found to ensure users can see which
executable they will need to track down in order to load the core file.

The tests added will test the order of precedence. It does this by
creating a core file with:
- NT_FILE entry with a path of "/path/nt_file_foo"
- AT_EXECFN in the aux vector with a path of "/path/execfn_foo"
- NT_PRPSINFO entry with a path of "prpsinfo_foo"

We then test that the correct entry is found as the best path option is
removed from the core file.
check_cxx_compiler_flag, when passing multiple flags, we must separate
them using a SEMICOLON-separated list. Not spaces. These checks
succeed incorrectly sometimes because "-Werror -mcrc" has a different
return value than "-Werror" "-mcrc" on some systems.

This issue was verified with LLVM_ENABLE_PROJECTS=llvm;compiler-rt,
and I'm uncertain whether it exists in runtime CMake builds.
Nonetheless, it's still a bug.

See:
https://cmake.org/cmake/help/latest/module/CheckCXXCompilerFlag.html

This issue was identified downstream in ChromiumOS.

ChromiumOS Bug:
https://issuetracker.google.com/507177988
before

```SystemVerilog
(* x = "x" *) foreach(x[x]) x = x;
```

after

```SystemVerilog
(* x = "x" *) foreach (x[x])
  x = x;
```

The code for handling statements like the `foreach` preceded the part
for handling the attributes inside `(* *)`. So there was a problem with
some of the statements following attributes. The patch moves the part
for the statements down. The loop in the code was also unnecessary.
This fixes 882d025.

Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>
Add a test case to verify that initFlags() correctly reads the
SCUDO_ALLOCATION_RING_BUFFER_SIZE environment variable and updates the
corresponding flag. This increases line coverage for flags.cpp to 100%.
…VM IR input (llvm#197566)

1. Replace the C++ source test that required compiling with %clangxx and
separate Input files with self-contained .ll tests using split-file.

2. Split the test into two files:
- clang-sycl-linker.ll: basic tool behavior (link, dev libs, AOT,
errors)
  - clang-sycl-linker-split-mode.ll: device code split mode handling

Co-Authored-By: Claude
Add first class support for building test inferiors without debug info,
instead of having to pass `-g0` in the Makefile or the build dictionary.

```
def test(self):
    self.build(debug_info="none")
```

rdar://164923931
Summary:
There's two ways you can put multiple binaries in the section. Either
use the version two multi-binary support or just concatenate them. This
PR changes the llvm-offload-binary tool to use the multi-support rather
than directly concatenating them.

The motivation for this is to save space and make it easier to support
compression in the future. Compression would be a flag in the header and
the compression is only really valuable if it can combine the
architecture variants. ELF section compression is a little spotty but
would be another good solution.
This operator creates a new ``list`` containing the same elements as
*list*
but in sorted order. To determine the order, TableGen binds the variable
*var* to each element and evaluates the *key* expression, which
presumably
refers to *var*. The key must produce a ``string`` or integer value
(``bit``, ``bits``, or ``int``); all keys must be of the same type.
Elements
with equal keys preserve their original relative order, resulting in a
stable
sort.

For example, to sort a list of records by their ``Name`` field::

`  list<Thing> sorted = !sort(t, Things, t.Name);`
…WARD_SLASH is ON (llvm#184556)

This patch fixes several LLVM test failures on Windows that occur when
the LLVM_WINDOWS_PREFER_FORWARD_SLASH CMake option is enabled.

The failures were caused by tests either hardcoding backslash
expectations in FileCheck or constructing paths with strict backslashes
in C++ unit tests, both of which break when the environment is
configured to prefer forward slashes.

Specific changes:
- `llvm-cov` lit tests: Changed the path separators with
`-DSEP=%{fs-sep}`.
- `llvm-objdump` lit test: Relaxed
`source-interleave-prefix-windows.test` to accept either forward or
backward slashes using the `{{[/\\]}}` regex. This makes the path
matching resilient to the underlying separator preference without losing
precision.
- CommandLineTest.cpp: Conditionalized the TestRoot variable to use
`C:/` instead of `C:\` based on the build configuration.
- Path.cpp (makeLongFormPath test):
  - Updated the OneDir string literal to conditionally use `/` or `\`.
- Updated the ContainsDotAndDotDot lambda to check for `.` and `..`
components with the correct separator style based on the build
configuration.
This change defines 4 new output patterns, `PAIR8`,`EVEN8`, `AEXT8`, and
`TRUNC4`, and uses them to implement the lowering of the intrinsics
`int_ppc_amo_l[dw]at` and `int_ppc_amo_l[dw]at_cond` in TableGen. As
result, the output pattern to generate the instructions becomes more
understandable,, and the C++ code can be removed.
…#197096)

This PR adds BF16 to I8 saturating FP to int convert custom lowering.
…trinsic (llvm#197380)

Fix HLSL builtin to SPIR-V intrinsic lowering: most intrinsics calls
must use CallingConv::C.

Relates to llvm#197608 which tries to add CallingConv CHECK to IR Verifier.
In 2021, Augusto changed the Target::ReadMemory API from taking a
`prefer_file_cache` argument to taking a `force_live_memory` argument,
with opposite meanings - where we used to pass true, the callers now
needed to pass false. The default argument was false, so many callers
omitted the argument altogether after the change.

One of the edits to
UnwindAssemblyInstEmulation::GetNonCallSiteUnwindPlanFromAssembly
unintentionally swapped the intended behavior -- this method which reads
the bytes of a function's instructions for emulation should get the
bytes from the local binary, if possible, else read from live memory.
But it was changed to force reading from live memory unconditionally.
This leads to an extra memory read for every function we see for the
first time in a single `lldb` process run (the UnwindTable they are
added to is part of the Module, and kept in the global Module cache).

It's not a major perf regression, but these are extra memory reads that
we don't need to be doing.

I audited all the other changes in the 2021 PR and this was the only
mistake like this.

rdar://177026608
This is the last patch for global/namespace thread-local variables. This
patch emits the final 'init' function, which calls all other init
functions, plus does the guard variable for the unordered variants.
This is a pretty trivial bit of adjustments that have to happen when
emitting a materialized temporary, and is effectively a clone of classic
codegen. Our output is effectively identical (other than some minor
re-orering problems).
…llvm#197474)

On MSVC, Profile-* tests must link with the same CRT model as the
clang_rt.profile static archive they exercise. When that archive pulls
in RTInterception / RTSanitizerCommon object libraries, those are built
with MultiThreadedDLL (/MD), so the .objs reference `__imp_*` symbols.
The test binary defaults to /MT and fails to link with LNK2019
(`__imp__stricmp` from `interception_win.cpp`) and LNK4098 default-lib
conflicts.

Match the DLL CRT on the test side so test executables and the static
archive use the same runtime. The change is gated on
`COMPILER_RT_HAS_INTERCEPTION` and `!COMPILER_RT_PROFILE_BAREMETAL`, so
configurations that don't pull interception into profile are unaffected.

Split out as NFC from llvm#177665 per review feedback.
The
[RFC](https://discourse.llvm.org/t/rfc-remove-80-column-limit-in-documentation-files/89678/41)
on removing 80 columns limit got accepted. So we should no longer
enforce that rule in clang-tidy's code-linter workflow.
Right now it takes validation path of an inline constant if fits
even though it is forced to literal encoding.
…ed} (llvm#197518)

A device-typed dummy with `!dir$ ignore_tkr(m)` is meant to be an
overload discriminator (only selected for actuals with an explicit
`device/managed/unified` attribute). Skip the host->device relaxation in
AreCompatibleCUDADataAttrs when `IgnoreTKR::Managed` is set so
unattributed host actuals no longer bind to such a dummy.

Also document the §3.2.3 matching distance table next to
GetMatchingDistance and add LIT tests for the full Table 2 grid
and the ignore_tkr(m) carve-out.
This PR allows duplicate OpenACC `private` and `firstprivate` clauses.
While maintaining the restriction on `reduction` clauses.
This method needs to match the set of cases handled in parseSummaryEntry.
…97565)

Add support for DWARF opcodes seen in GCC-generated binaries:

- DW_FORM_ref_udata: ULEB128-encoded CU-relative DIE reference.

- DW_OP_regval_type (0xa5): DWARF5 expression opcode with operands
(SizeLEB, BaseTypeRef). The BaseTypeRef was not being updated when DIEs
were relocated because cloneExpression only handled (Size1, BaseTypeRef)
patterns. Generalized the first-operand copying to use raw bytes from
the data stream instead of assuming a single byte.

Fixes llvm#188250

Assisted-by: Claude Opus 4.6/4.7
kevinsala and others added 21 commits May 13, 2026 22:51
This commit fixes the handling of `launch_bounds` within OpenMP's
`ompx_attribute`. The third attribute value, the maximum blocks, was not
parsed correctly.
…#197537)

Adds a new CMake option, OFF by default, to gate entrypoints with
known-incomplete implementations. This lets developers build and test
partially-implemented functions without exposing them to production
users.

The motivating case is `sysconf`, which only handles three of the
required `_SC_*` constants (`_SC_PAGESIZE`, `_SC_NPROCESSORS_CONF`,
`_SC_NPROCESSORS_ONLN`) and returns `EINVAL` for everything else.
Functions like this are useful to have in a build for testing progress,
but shouldn't be part of a default full build until the implementation
is complete.

Changes:
- `libc/CMakeLists.txt`: adds
`option(LLVM_LIBC_ENABLE_EXPERIMENTAL_ENTRYPOINTS ... OFF)`
- `libc/cmake/modules/LLVMLibCCompileOptionRules.cmake`: propagates
`-DLIBC_EXPERIMENTAL_ENTRYPOINTS` when ON
- `libc/cmake/modules/LLVMLibCTestRules.cmake`: same for test compile
options
- `libc/config/linux/{x86_64,aarch64,riscv}/entrypoints.txt`: moves
`sysconf` behind the new flag

The flag does not require `LLVM_LIBC_FULL_BUILD` since overlay builds
may also have incomplete entrypoints that benefit from this gating.
The combine was added in D48569 8 years ago with the aim of preserving
flags, but the current LangRef says the status flags are not observable
in the default FP environment.

The main motivation for this change is to enable scalar float reciprocal
generation v_s_rcp_f32 on newer hardware. There is no v_s_rcp_iflag_f32,
so the combine effectively blocks the selection.
See: pseudo-scalar-transcendental.ll.
We were losing the MMO when converting the load. Make sure we copy them
over, which apparently alters codegen more than I expected and helps
keep postinc generation after llvm#196305.
llvm#183506 revealed a pre-existing
use-after-scope in createInstrInfo (MSan bot:
https://lab.llvm.org/buildbot/#/builders/164/builds/21562 [*]).

This patch fixes the issue by changing the stack-allocated
AArch64Subtarget (which goes out of scope once createInstrInfo()
returns) into heap-allocated, allowing it to be safely stored in the
returned AArch64InstrInfo.

-----

[*] WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x55555666fabd in
llvm::AArch64InstrInfo::getInstSizeInBytes(llvm::MachineInstr const&)
const
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp:247:5
...

/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:85:3
#9 0x555556508559 in InstSizes_MOVaddrTagged_Test::TestBody()
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:301:3
...

  Member fields were destroyed
#0 0x555556498a1d in __sanitizer_dtor_callback_fields
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1074:5
#1 0x5555564fbda6 in ~Triple
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/TargetParser/Triple.h:348:12
#2 0x5555564fbda6 in ~Triple
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/TargetParser/Triple.h:47:7
#3 0x5555564fbda6 in llvm::AArch64Subtarget::~AArch64Subtarget()
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Target/AArch64/AArch64Subtarget.h:38:7
#4 0x555556503396 in (anonymous
namespace)::createInstrInfo(llvm::TargetMachine*)
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:38:1
#5 0x5555565084cb in InstSizes_MOVaddrTagged_Test::TestBody()
/home/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/unittests/Target/AArch64/InstSizes.cpp:299:42
### Summary

part of : llvm#185382

Lower `vtrn1` and `vtrn2` intrinsics in
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#transpose-elements

All the intrinsics are handled inline in
llvm-project/build/lib/clang/23/include/arm_neon.h like:

```
#ifdef __LITTLE_ENDIAN__
__ai __attribute__((target("neon"))) int8x8x2_t vtrn_s8(int8x8_t __p0, int8x8_t __p1) {
  int8x8x2_t __ret;
  __builtin_neon_vtrn_v(&__ret, __builtin_bit_cast(int8x8_t, __p0), __builtin_bit_cast(int8x8_t, __p1), 0);
  return __ret;
}
#else
__ai __attribute__((target("neon"))) int8x8x2_t vtrn_s8(int8x8_t __p0, int8x8_t __p1) {
  int8x8x2_t __ret;
  int8x8_t __rev0;  __rev0 = __builtin_shufflevector(__p0, __p0, __lane_reverse_64_8);
  int8x8_t __rev1;  __rev1 = __builtin_shufflevector(__p1, __p1, __lane_reverse_64_8);
  __builtin_neon_vtrn_v(&__ret, __builtin_bit_cast(int8x8_t, __rev0), __builtin_bit_cast(int8x8_t, __rev1), 0);

  __ret.val[0] = __builtin_shufflevector(__ret.val[0], __ret.val[0], __lane_reverse_64_8);
  __ret.val[1] = __builtin_shufflevector(__ret.val[1], __ret.val[1], __lane_reverse_64_8);
  return __ret;
}
#endif
```

So no additional special lowering logic is needed.
In certain codebases (e.g. embedded) — function declarations could
accumulate a long prefix of specifiers and attributes (`static`,
`inline`, `__attribute__((...))`, project-specific `AttributeMacros`,
etc.) before the return type, which buries the core prototype and pushes
parameters past the column limit.

This patch adds a `BreakBeforeReturnType` style option that places that
prefix on its own line(s):

```cpp
__attribute__((always_inline)) static inline
int do_thing(int a, int b, int c);
```

The recognized prefix tokens are function/storage specifiers (`static`,
`extern`, `inline`, `virtual`, `constexpr`, `consteval`, `friend`,
`export`, `_Noreturn`, `__forceinline`), C++11 attribute groups
`[[...]]`, GNU/MSVC attribute groups `__attribute__((...))` /
`__declspec(...)`, and identifiers configured via `AttributeMacros`.

The new `BreakBeforeReturnTypeStyle` enum has values `None`, `All`,
`TopLevel`, `AllDefinitions`, and `TopLevelDefinitions`. The default is
`None`, preserving previous behavior. Constructors and destructors are
not affected. The option composes with `BreakAfterReturnType`,
`BreakAfterAttributes`, and `BreakTemplateDeclarations`.

`ContinuationIndenter::getNewLineColumn` is adjusted so the wrapped
return type is dedented to the line's base indent when the preceding
token is a function/storage specifier keyword, matching the behavior
already used after attribute groups.

Adds tests in `FormatTest.cpp`.

Assisted-by: Claude (claude-opus-4-7, Claude Code)
This auto-assigns PR reviewers, per the GitHub documentation.
This code path is not really used with upstream code generation.
Optimized AArch32 implementations of `muldf3` and `divdf3` are provided.
The division function is particularly tricky because its Newton-Raphson
approximation strategy requires a rigorous error bound. In this version
of the commit I've left out the full supporting machinery that validates
the error bound via Gappa and Rocq, but full details are provided via
links to the upstream version of this code in the Arm Optimized Routines
repository, and to a pair of Arm Community blog posts.
While working on a PR to add a cost model for VPDerivedIV recipes I
noticed that a loop in or_reduction_with_freeze:

test/Transforms/LoopVectorize/AArch64/reduction-cost.ll

stopped vectorising because the cost model decided it was no longer
worth it. However, the main cause of this was the incredibly high cost
(14) of freeze for VF=2. We were using the cost of a vector mul
instruction as a proxy for the freeze cost, which is incredibly bad for
an AArch64 target without SVE since the operation needs scalarising.

As far as I understand, the freeze instruction does not lead to any
actual code being generated and acts merely as a barrier to potentially
unsafe optimisations. As such, I've updated the cost model to return 0
instead.
…9924)

The structure of these comparison functions consists of a header file
containing the main code, and several `.S` files that include that
header with different macro definitions, so that they can use the same
procedure to determine the logical comparison result and then just
translate it into a return value in different ways.
Add implicit uses to ds_bvh_stack instructions to avoid reuse of VGPRs
allocated to bvh_intersect_ray results prior to ds_bvh_stack. This
reduces likelihood of a premature s_wait_bvhcnt occuring due to partial
reallocation of unused bvh_intersect_ray results registers.
…vm#197249)

When completing in the middle of an existing identifier (e.g.
`fo^o<int>(42)`), the next-token check lexes the character immediately
after the cursor, which prevents parens suppression to kick in.

After the fix, we go to the end of the current identifier first and only
then we start lexing for the next token, which handles redundant parens
even when the cursor is mid-identifier.

This also fixes the parens suppression in the replace mode which by
design is used mid-identifier.

Fixes clangd/clangd#387
…types (llvm#197141)

Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>
Fixes llvm#196662.

---------

Co-authored-by: owenca <owenpiano@gmail.com>
…9925)

These comparison functions follow the same structure as the
double-precision ones in a prior commit, of a header file containing the
main logic and some entry points varying the construction of the return
value.

In this case, we have provided versions for Thumb1 as well as
Arm/Thumb2.
Allow 32-bit targets to correctly lower i64 ISD::VECREDUCE min/max nodes
via ReplaceNodeResults - this is necessary once we're finally ready for
llvm#194473 and remove combineMinMaxReduction entirely

Improve handling of v2iXX reduction stages by consistently preferring
binop(extract(),extract()) scalarisation on SSE targets (if the vector
binop isn't legal).
@ronlieb ronlieb requested review from a team, A-Skvortsov, dpalermo and skganesan008 May 14, 2026 10:29
@ronlieb ronlieb requested a review from vangthao95 as a code owner May 14, 2026 10:29
@ronlieb ronlieb closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.