8382713: [VectorAPI] Perform late inlining of failed vector intrinsics by jatin-bhateja · Pull Request #30876 · openjdk/jdk

jatin-bhateja · 2026-04-22T09:49:38Z

Currently, we attempt lazy intrinsification of vector intrinsics during incremental inlining stage, in case intrinsification fail due to non-constant context expected by the inline expander, a static call is generated, this incurs a call overhead penalty.

As per following comments from @iwanowww on JDK-8303762 pull request
#24104 (comment)

We should attempt procedure inlining of failed vector intrinsics to avoid penalties associated with call overhead, for vector operations whose fall back implementation uses other vector APIs it will also save boxing penalty.

Patch address this concern by adding a new hybrid call generator (LateInlineVectorCallGenerator ) which encapsulates both intrinsic and parser call generator. During incremental inlining, the intrinsic gets multiple chances to succeed. If all attempts fail, the fallback implementation is inlined instead, absorbing call over head penalties.

Please review and share your feedback.

Best Regards,
Jatin

I confirm that I make this contribution in accordance with the OpenJDK Interim AI Policy.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

JDK-8382713: [VectorAPI] Perform late inlining of failed vector intrinsics (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/30876/head:pull/30876
$ git checkout pull/30876

Update a local copy of the PR:
$ git checkout pull/30876
$ git pull https://git.openjdk.org/jdk.git pull/30876/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 30876

View PR using the GUI difftool:
$ git pr show -t 30876

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/30876.diff

Using Webrev

Link to Webrev Comment

jatin-bhateja · 2026-04-22T09:49:53Z

/label add hotspot-compiler-dev

bridgekeeper · 2026-04-22T09:51:31Z

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2026-04-22T09:52:04Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2026-04-22T09:53:14Z

@jatin-bhateja
The hotspot-compiler label was successfully added.

openjdk · 2026-04-22T09:53:21Z

The total number of required reviews for this PR has been set to 2 based on the presence of this label: hotspot-compiler. This can be overridden with the /reviewers command.

openjdk · 2026-04-22T09:53:58Z

@jatin-bhateja To determine the appropriate audience for reviewing this pull request, one or more labels corresponding to different subsystems will normally be applied automatically. However, no automatic labelling rule matches the changes in this pull request. In order to have an "RFR" email sent to the correct mailing list, you will need to add one or more applicable labels manually using the /label pull request command.

Applicable Labels

build
client
compiler
core-libs
hotspot
hotspot-compiler
hotspot-gc
hotspot-jfr
hotspot-runtime
i18n
ide-support
javadoc
jdk
net
nio
security
serviceability
shenandoah

mlbridge · 2026-04-22T09:57:48Z

Webrevs

iwanowww

Thanks, Jatin!

iwanowww · 2026-04-22T19:12:26Z

+  }
+};
+
+bool LateInlineVectorCallGenerator::inline_fallback() const {


What's the purpose of this method? All vector intrinsics do have fallback implementation. If there are any cases added later, then they don't have to rely on LateInlineVectorCallGenerator.

jatin-bhateja · 2026-04-27T05:36:27Z

Hi @iwanowww , your comments have been addressed.

jatin-bhateja · 2026-04-29T04:23:31Z

I modified BackSholes benchmark to use FloatVector.SPECIES_512, and then explicitly passed
-XX:UseAVX=2 to force intrinsic failure. Following are the performance numbers with and without
InlineVectorFallback, we see some improvements despite of error margins.

CommandLine: java -jar target/benchmarks.jar -f 1 -i 5 -wi 1 -w 30 -jvmArgs "-XX:UseAVX=2 --add-modules=jdk.incubator.vector -XX:+UnlockDiagnosticVMOptions -XX:+InlineVectorFallback" BlackScholes.vector_black_scholes


With -XX:-InlineVectorFallback
Benchmark                          (size)   Mode  Cnt     Score      Error  Units
BlackScholes.vector_black_scholes    1024  thrpt    5  7460.391 ± 1412.273  ops/s

With -XX:+InlineVectorFallback
Benchmark                          (size)   Mode  Cnt     Score      Error  Units
BlackScholes.vector_black_scholes    1024  thrpt    5  7851.062 ± 1765.271  ops/s

jatin-bhateja · 2026-05-01T06:58:58Z

Hi @iwanowww , your comments have been addressed.

iwanowww

Overall, looks good. Minor suggestions follow.

iwanowww · 2026-05-04T20:58:24Z

  product(bool, EnableVectorAggressiveReboxing, false, EXPERIMENTAL,        \
          "Enables aggressive reboxing of vectors")                         \
                                                                            \
+  product(bool, InlineVectorFallback, true, DIAGNOSTIC,                     \


Let's call it IncrementalInlineVector and put it next to IncrementalInline et al.

iwanowww · 2026-05-05T16:57:03Z


    if (failing())  return;
+
+    if (_late_inlines.length() == 0 && _vector_late_inlines.length() > 0) {


Can you extract it into a helper method? Otherwise, the patch looks good. I'll submit it for testing.

I suggest to rename transfer_vector_late_inlines() to process_vector_late_inlines() and move _vector_late_inlines.length() > 0 guard there.

jatin-bhateja · 2026-05-06T07:44:06Z

Hi @iwanowww , your comments have been addressed, please share the results of your test run.

iwanowww · 2026-05-06T20:47:11Z

Unfortunately, I see multiple failures in Vector API-related tests. They failed mostly on linux-aarch64, but there were few linux-x64 failures [1] as well. I'll take a closer look, but it seems the problem on linux-aarch64 is that fallback implementations are unconditionally inlined and it causes problems (multiple tests on 512-bit vectors fail due memory exhaustion [2]).

[1] In particular:

compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)

Failed IR Rules (3) of Methods (2)
----------------------------------
1) Method "compiler.vectorapi.TestVectorTest::branch" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_", "_#CMOVE_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"

2) Method "compiler.vectorapi.TestVectorTest::cmove" - [Failed IR rules: 2]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_TEST#_", "1", "_#CMOVE_I#_", "1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 7 = 1 [given]

compiler/vectorapi/VectorMaskCompareNotTest.java (w/ -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation)

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareULEMaskNotByte" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"_#XOR_V_MASK#_", "= 0", "_#XOR_V#_", "= 0", "_#VECTOR_MASK_CAST#_", "= 1", "_#VECTOR_MASK_CMP#_", "= 3"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(XorVMask.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 3 = 0 [given]
         
         * Constraint 2: "(\\d+(\\s){2}(XorV.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 3 = 0 [given]

[2] jdk/incubator/vector/ByteVector512LoadStoreTests.java

    --- Allocation timelime by phase ---
        Phase seq. number                             Bytes                  Nodes
                 (380 older entries lost)
          >4                  incrementalInline 157724064 (+0)        32839 (+0) 
...
           >227          incrementalInline_igvn 174587184 (+261984)   34935 (-18) 
          <4 (cont.)          incrementalInline 174587184 (+0)        34935 (+0) 
         <2 (cont.)                   optimizer 180612856 (+6025672)  34697 (-238) 
...
         <295 (cont.)                  regalloc 177447520 (+0)        83805 (+0) 
          >305                         buildIFG 221892144 (+44444624)  83398 (-407) 
...
         <295 (cont.)                  regalloc 299650328 (+0)        81331 (+0) 
          >318                    regAllocSplit 1073764680 (+774114352)  81331 (+0) 
    ---

#  Internal Error (.../src/hotspot/share/compiler/compilationMemoryStatistic.cpp:935), pid=1510298, tid=1510316
#  fatal error: c2 (1695) jdk/incubator/vector/ByteVector512$ByteShuffle512::intoMemorySegment((Ljava/lang/foreign/MemorySegment;JLjava/nio/ByteOrder;)V): Hit MemLimit - limit: 1073741824 now: 1073764680

jatin-bhateja · 2026-05-08T10:58:31Z

[1] In particular:

compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)

Failed IR Rules (3) of Methods (2)
----------------------------------
1) Method "compiler.vectorapi.TestVectorTest::branch" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_", "_#CMOVE_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"

2) Method "compiler.vectorapi.TestVectorTest::cmove" - [Failed IR rules: 2]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_TEST#_", "1", "_#CMOVE_I#_", "1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 7 = 1 [given]

Hi @iwanowww , this failure is related to use of UseAVX=0, here fromBitsCoerced is not intrinsified, earlier it remained as CallStaticJavaNode but now it gets inlined, new inlined context has graph shape which infers CMoveI and CmpI and test failed since IR rule don't expect these nodes, one target agnostic fix is to guard these IR rules with -XX:-IncrementalInlineVector flag, but it will defeat the purpose of this test since IncrementalInlineVector is default on. Since test runs on multiple targets guarding by UseAVX > 0 may not be desirable.

Let me know what do you think ?

* compiler/vectorapi/VectorMaskCompareNotTest.java (w/ `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`)
Failed IR Rules (1) of Methods (1)

Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareULEMaskNotByte" - [Failed IR rules: 1]:

@ir rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"#XOR_V_MASK#", "= 0", "#XOR_V#", "= 0", "#VECTOR_MASK_CAST#", "= 1", "#VECTOR_MASK_CMP#", "= 3"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"

Phase "PrintIdeal":

counts: Graph contains wrong number of nodes:

Constraint 1: "(\d+(\s){2}(XorVMask.)+(\s){2}===.)"

Failed comparison: [found] 3 = 0 [given]

Constraint 2: "(\d+(\s){2}(XorV.)+(\s){2}===.)"

Failed comparison: [found] 3 = 0 [given]

With the patch, the vector intrinsic fallback inlining generates more code in the compilation unit, this effects inlining of
other methods, e.g. AbstractMask::intoArray. When intoArray is NOT inlined, the mask must be boxed before passing to a non-inline method, as a result of this VectorMaskCmp encapsulated in VectorBoxNode get addition user which is VectorStoreMask created at https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L270 during VectorBoxNode scalarization.

This increase the outcout of VectorMaskCmpNode and inhabits optimization which folds XorVMask (VectorMaskCmp, maskAll(true)
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectornode.cpp#L2366

Increasing InlineSmallCode to 10000 allows intoArray to be inlined, mask is not boxed, VectorMaskCmp has outcnt=1, XorVMask is folded and Test Passes

[2] jdk/incubator/vector/ByteVector512LoadStoreTests.java

    --- Allocation timelime by phase ---
        Phase seq. number                             Bytes                  Nodes
                 (380 older entries lost)
          >4                  incrementalInline 157724064 (+0)        32839 (+0) 
...
           >227          incrementalInline_igvn 174587184 (+261984)   34935 (-18) 
          <4 (cont.)          incrementalInline 174587184 (+0)        34935 (+0) 
         <2 (cont.)                   optimizer 180612856 (+6025672)  34697 (-238) 
...
         <295 (cont.)                  regalloc 177447520 (+0)        83805 (+0) 
          >305                         buildIFG 221892144 (+44444624)  83398 (-407) 
...
         <295 (cont.)                  regalloc 299650328 (+0)        81331 (+0) 
          >318                    regAllocSplit 1073764680 (+774114352)  81331 (+0) 
    ---

#  Internal Error (.../src/hotspot/share/compiler/compilationMemoryStatistic.cpp:935), pid=1510298, tid=1510316
#  fatal error: c2 (1695) jdk/incubator/vector/ByteVector512$ByteShuffle512::intoMemorySegment((Ljava/lang/foreign/MemorySegment;JLjava/nio/ByteOrder;)V): Hit MemLimit - limit: 1073741824 now: 1073764680

Over all there is a tradeoff of unconditionally inlining vector intrinsic since most of them a bulky and it may impact inlining decisions within their calling context.

Do you think its beneficial to limit the scope of inlining to only few intrinsics initially e.g.
https://github.com/jatin-bhateja/jdk/blob/46fcc9acc05bdef5fd01f4972ed9a66de5f07198/src/hotspot/share/opto/callGenerator.cpp#L463

Please let me know your views.

openjdk · 2026-05-08T11:06:40Z

@jatin-bhateja Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

iwanowww · 2026-05-08T19:28:31Z

compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)

target agnostic fix is to guard these IR rules with -XX:-IncrementalInlineVector flag, but it will defeat the purpose of this test since IncrementalInlineVector is default on.

I don't see why it defeats the purpose of the test. It's an IR test and limiting possible IR shapes is fine.

compiler/vectorapi/VectorMaskCompareNotTest.java

With the patch, the vector intrinsic fallback inlining generates more code in the compilation unit, this effects inlining of
other methods, e.g. AbstractMask::intoArray.

Do we miss @ForceInline on AbstractMask::intoArray? Any other methods not inlined?

Do you think its beneficial to limit the scope of inlining to only few intrinsics initially.

I think regular inlining heuristics should be applied to vector fallback implementations.

8382713: [VectorAPI] Perform late inlining of failed vector intrinsics

c7e6fce

openjdk Bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Apr 22, 2026

openjdk Bot added the rfr Pull request is ready for review label Apr 22, 2026

jatin-bhateja mentioned this pull request Apr 22, 2026

8303762: Optimize vector slice operation with constant index using VPALIGNR instruction #24104

Open

4 tasks

iwanowww reviewed Apr 22, 2026

View reviewed changes

Review comments resolutions

931d45e

iwanowww reviewed Apr 27, 2026

View reviewed changes

Comment thread src/hotspot/share/opto/compile.cpp Outdated

Comment thread src/hotspot/share/opto/compile.cpp Outdated

Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated

Review comments resolutions

e779a2f

iwanowww reviewed May 4, 2026

View reviewed changes

Review comments resolutions

5f46f5b

iwanowww reviewed May 5, 2026

View reviewed changes

Review comments resolution

d18bd2a

Review comment resolution

8171911

jatin-bhateja force-pushed the JDK-8382713 branch from 1cceb24 to 8171911 Compare May 8, 2026 11:04


		if (failing()) return;

		if (_late_inlines.length() == 0 && _vector_late_inlines.length() > 0) {

Conversation

jatin-bhateja commented Apr 22, 2026 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

jatin-bhateja commented Apr 22, 2026

Uh oh!

bridgekeeper Bot commented Apr 22, 2026

Uh oh!

openjdk Bot commented Apr 22, 2026

Uh oh!

openjdk Bot commented Apr 22, 2026

Uh oh!

openjdk Bot commented Apr 22, 2026

Uh oh!

openjdk Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

iwanowww left a comment

Choose a reason for hiding this comment

Uh oh!

iwanowww Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jatin-bhateja commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jatin-bhateja commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented May 1, 2026

Uh oh!

iwanowww left a comment

Choose a reason for hiding this comment

Uh oh!

iwanowww May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iwanowww May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iwanowww May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja commented May 6, 2026

Uh oh!

iwanowww commented May 6, 2026

Uh oh!

jatin-bhateja commented May 8, 2026

Failed IR Rules (1) of Methods (1)

Uh oh!

openjdk Bot commented May 8, 2026

Uh oh!

iwanowww commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

jatin-bhateja commented Apr 22, 2026 •

edited by openjdk Bot

Loading

openjdk Bot commented Apr 22, 2026 •

edited

Loading

mlbridge Bot commented Apr 22, 2026 •

edited

Loading

jatin-bhateja commented Apr 29, 2026 •

edited

Loading

iwanowww May 5, 2026 •

edited

Loading