Skip to content

8382713: [VectorAPI] Perform late inlining of failed vector intrinsics#30876

Open
jatin-bhateja wants to merge 6 commits intoopenjdk:masterfrom
jatin-bhateja:JDK-8382713
Open

8382713: [VectorAPI] Perform late inlining of failed vector intrinsics#30876
jatin-bhateja wants to merge 6 commits intoopenjdk:masterfrom
jatin-bhateja:JDK-8382713

Conversation

@jatin-bhateja
Copy link
Copy Markdown
Member

@jatin-bhateja jatin-bhateja commented Apr 22, 2026

Currently, we attempt lazy intrinsification of vector intrinsics during incremental inlining stage, in case intrinsification fail due to non-constant context expected by the inline expander, a static call is generated, this incurs a call overhead penalty.

As per following comments from @iwanowww on JDK-8303762 pull request
#24104 (comment)

We should attempt procedure inlining of failed vector intrinsics to avoid penalties associated with call overhead, for vector operations whose fall back implementation uses other vector APIs it will also save boxing penalty.

Patch address this concern by adding a new hybrid call generator (LateInlineVectorCallGenerator ) which encapsulates both intrinsic and parser call generator. During incremental inlining, the intrinsic gets multiple chances to succeed. If all attempts fail, the fallback implementation is inlined instead, absorbing call over head penalties.

Please review and share your feedback.

Best Regards,
Jatin



Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8382713: [VectorAPI] Perform late inlining of failed vector intrinsics (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/30876/head:pull/30876
$ git checkout pull/30876

Update a local copy of the PR:
$ git checkout pull/30876
$ git pull https://git.openjdk.org/jdk.git pull/30876/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 30876

View PR using the GUI difftool:
$ git pr show -t 30876

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/30876.diff

Using Webrev

Link to Webrev Comment

@jatin-bhateja
Copy link
Copy Markdown
Member Author

/label add hotspot-compiler-dev

@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Apr 22, 2026

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 22, 2026

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk Bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Apr 22, 2026
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 22, 2026

@jatin-bhateja
The hotspot-compiler label was successfully added.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 22, 2026

The total number of required reviews for this PR has been set to 2 based on the presence of this label: hotspot-compiler. This can be overridden with the /reviewers command.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 22, 2026

@jatin-bhateja To determine the appropriate audience for reviewing this pull request, one or more labels corresponding to different subsystems will normally be applied automatically. However, no automatic labelling rule matches the changes in this pull request. In order to have an "RFR" email sent to the correct mailing list, you will need to add one or more applicable labels manually using the /label pull request command.

Applicable Labels
  • build
  • client
  • compiler
  • core-libs
  • hotspot
  • hotspot-compiler
  • hotspot-gc
  • hotspot-jfr
  • hotspot-runtime
  • i18n
  • ide-support
  • javadoc
  • jdk
  • net
  • nio
  • security
  • serviceability
  • shenandoah

@openjdk openjdk Bot added the rfr Pull request is ready for review label Apr 22, 2026
@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Jatin!

}
};

bool LateInlineVectorCallGenerator::inline_fallback() const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this method? All vector intrinsics do have fallback implementation. If there are any cases added later, then they don't have to rely on LateInlineVectorCallGenerator.

Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/doCall.cpp Outdated
@jatin-bhateja
Copy link
Copy Markdown
Member Author

Hi @iwanowww , your comments have been addressed.

Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
@jatin-bhateja
Copy link
Copy Markdown
Member Author

jatin-bhateja commented Apr 29, 2026

I modified BackSholes benchmark to use FloatVector.SPECIES_512, and then explicitly passed
-XX:UseAVX=2 to force intrinsic failure. Following are the performance numbers with and without
InlineVectorFallback, we see some improvements despite of error margins.

CommandLine: java -jar target/benchmarks.jar -f 1 -i 5 -wi 1 -w 30 -jvmArgs "-XX:UseAVX=2 --add-modules=jdk.incubator.vector -XX:+UnlockDiagnosticVMOptions -XX:+InlineVectorFallback" BlackScholes.vector_black_scholes


With -XX:-InlineVectorFallback
Benchmark                          (size)   Mode  Cnt     Score      Error  Units
BlackScholes.vector_black_scholes    1024  thrpt    5  7460.391 ± 1412.273  ops/s

With -XX:+InlineVectorFallback
Benchmark                          (size)   Mode  Cnt     Score      Error  Units
BlackScholes.vector_black_scholes    1024  thrpt    5  7851.062 ± 1765.271  ops/s

@jatin-bhateja
Copy link
Copy Markdown
Member Author

Hi @iwanowww , your comments have been addressed.

Copy link
Copy Markdown
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. Minor suggestions follow.

Comment thread src/hotspot/share/opto/c2_globals.hpp Outdated
product(bool, EnableVectorAggressiveReboxing, false, EXPERIMENTAL, \
"Enables aggressive reboxing of vectors") \
\
product(bool, InlineVectorFallback, true, DIAGNOSTIC, \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it IncrementalInlineVector and put it next to IncrementalInline et al.

Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated

if (failing()) return;

if (_late_inlines.length() == 0 && _vector_late_inlines.length() > 0) {
Copy link
Copy Markdown
Contributor

@iwanowww iwanowww May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract it into a helper method? Otherwise, the patch looks good. I'll submit it for testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename transfer_vector_late_inlines() to process_vector_late_inlines() and move _vector_late_inlines.length() > 0 guard there.

@jatin-bhateja
Copy link
Copy Markdown
Member Author

Hi @iwanowww , your comments have been addressed, please share the results of your test run.

@iwanowww
Copy link
Copy Markdown
Contributor

iwanowww commented May 6, 2026

Unfortunately, I see multiple failures in Vector API-related tests. They failed mostly on linux-aarch64, but there were few linux-x64 failures [1] as well. I'll take a closer look, but it seems the problem on linux-aarch64 is that fallback implementations are unconditionally inlined and it causes problems (multiple tests on 512-bit vectors fail due memory exhaustion [2]).

[1] In particular:

  • compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)
Failed IR Rules (3) of Methods (2)
----------------------------------
1) Method "compiler.vectorapi.TestVectorTest::branch" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_", "_#CMOVE_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"

2) Method "compiler.vectorapi.TestVectorTest::cmove" - [Failed IR rules: 2]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_TEST#_", "1", "_#CMOVE_I#_", "1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 7 = 1 [given]
  • compiler/vectorapi/VectorMaskCompareNotTest.java (w/ -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation)
Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareULEMaskNotByte" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"_#XOR_V_MASK#_", "= 0", "_#XOR_V#_", "= 0", "_#VECTOR_MASK_CAST#_", "= 1", "_#VECTOR_MASK_CMP#_", "= 3"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(XorVMask.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 3 = 0 [given]
         
         * Constraint 2: "(\\d+(\\s){2}(XorV.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 3 = 0 [given]

[2] jdk/incubator/vector/ByteVector512LoadStoreTests.java

    --- Allocation timelime by phase ---
        Phase seq. number                             Bytes                  Nodes
                 (380 older entries lost)
          >4                  incrementalInline 157724064 (+0)        32839 (+0) 
...
           >227          incrementalInline_igvn 174587184 (+261984)   34935 (-18) 
          <4 (cont.)          incrementalInline 174587184 (+0)        34935 (+0) 
         <2 (cont.)                   optimizer 180612856 (+6025672)  34697 (-238) 
...
         <295 (cont.)                  regalloc 177447520 (+0)        83805 (+0) 
          >305                         buildIFG 221892144 (+44444624)  83398 (-407) 
...
         <295 (cont.)                  regalloc 299650328 (+0)        81331 (+0) 
          >318                    regAllocSplit 1073764680 (+774114352)  81331 (+0) 
    ---

#  Internal Error (.../src/hotspot/share/compiler/compilationMemoryStatistic.cpp:935), pid=1510298, tid=1510316
#  fatal error: c2 (1695) jdk/incubator/vector/ByteVector512$ByteShuffle512::intoMemorySegment((Ljava/lang/foreign/MemorySegment;JLjava/nio/ByteOrder;)V): Hit MemLimit - limit: 1073741824 now: 1073764680

@jatin-bhateja
Copy link
Copy Markdown
Member Author

[1] In particular:

  • compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)
Failed IR Rules (3) of Methods (2)
----------------------------------
1) Method "compiler.vectorapi.TestVectorTest::branch" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_", "_#CMOVE_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"

2) Method "compiler.vectorapi.TestVectorTest::cmove" - [Failed IR rules: 2]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_TEST#_", "1", "_#CMOVE_I#_", "1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 7 = 1 [given]

Hi @iwanowww , this failure is related to use of UseAVX=0, here fromBitsCoerced is not intrinsified, earlier it remained as CallStaticJavaNode but now it gets inlined, new inlined context has graph shape which infers CMoveI and CmpI and test failed since IR rule don't expect these nodes, one target agnostic fix is to guard these IR rules with -XX:-IncrementalInlineVector flag, but it will defeat the purpose of this test since IncrementalInlineVector is default on. Since test runs on multiple targets guarding by UseAVX > 0 may not be desirable.

Let me know what do you think ?


* compiler/vectorapi/VectorMaskCompareNotTest.java (w/ `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`)

Failed IR Rules (1) of Methods (1)

  1. Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareULEMaskNotByte" - [Failed IR rules: 1]:
    • @ir rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"#XOR_V_MASK#", "= 0", "#XOR_V#", "= 0", "#VECTOR_MASK_CAST#", "= 1", "#VECTOR_MASK_CMP#", "= 3"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"

      Phase "PrintIdeal":

      • counts: Graph contains wrong number of nodes:
        • Constraint 1: "(\d+(\s){2}(XorVMask.)+(\s){2}===.)"

          • Failed comparison: [found] 3 = 0 [given]
        • Constraint 2: "(\d+(\s){2}(XorV.)+(\s){2}===.)"

          • Failed comparison: [found] 3 = 0 [given]

With the patch, the vector intrinsic fallback inlining generates more code in the compilation unit, this effects inlining of
other methods, e.g. AbstractMask::intoArray. When intoArray is NOT inlined, the mask must be boxed before passing to a non-inline method, as a result of this VectorMaskCmp encapsulated in VectorBoxNode get addition user which is VectorStoreMask created at https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L270 during VectorBoxNode scalarization.

This increase the outcout of VectorMaskCmpNode and inhabits optimization which folds XorVMask (VectorMaskCmp, maskAll(true)
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectornode.cpp#L2366

Increasing InlineSmallCode to 10000 allows intoArray to be inlined, mask is not boxed, VectorMaskCmp has outcnt=1, XorVMask is folded and Test Passes

[2] jdk/incubator/vector/ByteVector512LoadStoreTests.java

    --- Allocation timelime by phase ---
        Phase seq. number                             Bytes                  Nodes
                 (380 older entries lost)
          >4                  incrementalInline 157724064 (+0)        32839 (+0) 
...
           >227          incrementalInline_igvn 174587184 (+261984)   34935 (-18) 
          <4 (cont.)          incrementalInline 174587184 (+0)        34935 (+0) 
         <2 (cont.)                   optimizer 180612856 (+6025672)  34697 (-238) 
...
         <295 (cont.)                  regalloc 177447520 (+0)        83805 (+0) 
          >305                         buildIFG 221892144 (+44444624)  83398 (-407) 
...
         <295 (cont.)                  regalloc 299650328 (+0)        81331 (+0) 
          >318                    regAllocSplit 1073764680 (+774114352)  81331 (+0) 
    ---

#  Internal Error (.../src/hotspot/share/compiler/compilationMemoryStatistic.cpp:935), pid=1510298, tid=1510316
#  fatal error: c2 (1695) jdk/incubator/vector/ByteVector512$ByteShuffle512::intoMemorySegment((Ljava/lang/foreign/MemorySegment;JLjava/nio/ByteOrder;)V): Hit MemLimit - limit: 1073741824 now: 1073764680

Over all there is a tradeoff of unconditionally inlining vector intrinsic since most of them a bulky and it may impact inlining decisions within their calling context.

Do you think its beneficial to limit the scope of inlining to only few intrinsics initially e.g.
https://github.com/jatin-bhateja/jdk/blob/46fcc9acc05bdef5fd01f4972ed9a66de5f07198/src/hotspot/share/opto/callGenerator.cpp#L463

Please let me know your views.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented May 8, 2026

@jatin-bhateja Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@iwanowww
Copy link
Copy Markdown
Contributor

iwanowww commented May 8, 2026

compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)

target agnostic fix is to guard these IR rules with -XX:-IncrementalInlineVector flag, but it will defeat the purpose of this test since IncrementalInlineVector is default on.

I don't see why it defeats the purpose of the test. It's an IR test and limiting possible IR shapes is fine.

compiler/vectorapi/VectorMaskCompareNotTest.java

With the patch, the vector intrinsic fallback inlining generates more code in the compilation unit, this effects inlining of
other methods, e.g. AbstractMask::intoArray.

Do we miss @ForceInline on AbstractMask::intoArray? Any other methods not inlined?

Do you think its beneficial to limit the scope of inlining to only few intrinsics initially.

I think regular inlining heuristics should be applied to vector fallback implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

2 participants