Skip to content

Wire calculatePartialSums to native SIMD via Panama FFI downcall#651

Open
r-devulap wants to merge 5 commits intomainfrom
use-native-calcpartialsum
Open

Wire calculatePartialSums to native SIMD via Panama FFI downcall#651
r-devulap wants to merge 5 commits intomainfrom
use-native-calcpartialsum

Conversation

@r-devulap
Copy link
Copy Markdown
Contributor

@r-devulap r-devulap commented Apr 2, 2026

This change uses a native implementation of calculatePartialSums to accelerate PQ query scoring.
On ada002-100k with FUSED_PQ (numPQsubspaces/M =96, JDK build 23.0.1+11-39), it delivers 2–3× higher QPS and 40–65% lower mean latency across common overquery settings. Index build time, disk usage, and heap usage show no meaningful regression. The optimization is isolated to the PQ path; non‑PQ queries are unaffected.

Combined QPS and Latency Results (FUSED_PQ)

topK = 10

Overquery QPS (main) QPS (native) Speedup Latency ms (main) Latency ms (native) Latency ↓
8,987 26,101 2.9× 0.462 0.171 −63%
8,361 21,706 2.6× 0.505 0.202 −60%
7,199 14,364 2.0× 0.590 0.292 −51%
10× 5,796 9,640 1.7× 0.731 0.431 −41%

topK = 100

Overquery QPS (main) QPS (native) Speedup Latency ms (main) Latency ms (native) Latency ↓
5,743 9,568 1.7× 0.736 0.439 −40%
4,174 5,574 1.3× 1.022 0.733 −28%
``

Summary of changes in this PR:

  • Wire calculatePartialSums in NativeVectorUtilSupport to a new Panama FFI downcall for the native calculate_partial_sums_f32_512 SIMD implementation.
  • Replace icelake-server gcc target with skylake-avx512 in build script (icelake-server isnt required to build our native code)
  • Remove global mutable state: eliminate initialIndexRegister, indexIncrement, maskSeventhBit, maskEighthBit globals and their constructor initializer; move mask constants (maskSeventhBit, maskEighthBit) to local scope inside lookup_partial_sums
  • Add shared reduce_add_128_ps and reduce_add_256_ps helper functions using proper horizontal-add sequences instead of store-to-array loops
  • Remove redundant if (length >= N) guards in all SIMD kernels — the loop body already handles the zero-iteration case correctly
  • Replace store-to-aligned-array horizontal reduction pattern with the new helpers across all 128- and 256-bit dot product and euclidean distance functions
  • Remove preferred_size parameter from dot_product_f32 and euclidean_f32; always dispatch to AVX-512 when length >= 16
  • Standardize inline annotations: replace attribute((always_inline)) inline with JV_FINLINE / JV_INLINE macros throughout

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@r-devulap r-devulap force-pushed the use-native-calcpartialsum branch from 70cd2fb to de4ff79 Compare April 7, 2026 04:33
Copy link
Copy Markdown
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see much more coverage of these with numerical tests. Are there some already which aren't seen here?

Copy link
Copy Markdown
Contributor

@ashkrisk ashkrisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an excellent set of optimizations. Left a few comments.

+1 to @jshook's comment about numerical tests. This PR touches almost every single function in the native supporting library, and it would be good to have a set of tests accompanying it, perhaps also in C.

if [ "$(printf '%s\n' "$MIN_GCC_VERSION" "$CURRENT_GCC_VERSION" | sort -V | head -n1)" = "$MIN_GCC_VERSION" ]; then
rm -rf ../resources/libjvector.so
gcc -fPIC -O3 -march=icelake-server -c jvector_simd.c -o jvector_simd.o
gcc -fPIC -O3 -march=skylake-avx512 -c jvector_simd.c -o jvector_simd.o
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason to lower the target micro-architecture version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change addresses an actual bug in our build configuration. We currently compile targeting icelake-server, but at runtime we only check for skylake-avx512. This mismatch allows the compiler to emit Ice Lake–specific instructions that may be executed on a Skylake CPU, which can result in a SIGILL.

Targeting skylake-avx512 resolves this issue and is sufficient for the kernels we currently have; there’s no requirement for Ice Lake–specific features here.

case 0:
calculate_partial_sums_euclidean_f32_512(codebook, codebookIndex, size, clusterCount, query, queryOffset, partialSums);
break;
case 1:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use public enums here? Jextract should automatically make the enums available to the Java code as constants. Alternatively we could skip the parameter-based dispatch altogether and simply expose both versions of the function to Java code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could skip the parameter-based dispatch altogether and simply expose both versions of the function to Java code.

Agree. I have updated to use this approach.

__m512 vaMagnitude = _mm512_setzero_ps();
int i = 0;
int limit = baseOffsetsLength - (baseOffsetsLength % 16);
const __m512i initialIndexRegister = _mm512_setr_epi32(-16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good that this isn't a global variable anymore, but given that it's used in multiple places does it make sense to have it as a global constant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally try to avoid global variables and prefer function‑local const values to keep dependencies explicit and contained, unless avoiding globals would cause significant duplication. In this case, it’s only used in two places, so the duplication is minimal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of functions that are no longer in the public header are still declared here. Should fix itself on re-running jextract.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — these should indeed be removed. It looks like we check in NativeSimdOps.java with signatures that are supposed to mirror the public header. That also explains why I keep seeing a locally modified file every time I build this branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

@r-devulap
Copy link
Copy Markdown
Contributor Author

I would like to see much more coverage of these with numerical tests. Are there some already which aren't seen here?

@jshook @ashkrisk
Apologies for the delayed response. For the primary changes introduced in this patch, there is already solid test coverage exercising all the modified scenarios. These are covered in the existing tests here:

private void testPQEncodings(int dimension, int codebooks) {
// Generate a PQ for random vectors
var vectors = createRandomVectors(512, dimension);
var ravv = new ListRandomAccessVectorValues(vectors, dimension);
var pq = ProductQuantization.compute(ravv, codebooks, 256, false);
// Compress the vectors
var cv = pq.encodeAll(ravv);
// compare the encoded similarities to the raw
for (var vsf : List.of(VectorSimilarityFunction.EUCLIDEAN, VectorSimilarityFunction.DOT_PRODUCT, VectorSimilarityFunction.COSINE)) {
double delta = 0;
for (int i = 0; i < 10; i++) {
var q = TestUtil.randomVector(getRandom(), dimension);
var f = cv.precomputedScoreFunctionFor(q, vsf);
for (int j = 0; j < vectors.size(); j++) {
delta += abs(f.similarityTo(j) - vsf.compare(q, vectors.get(j)));
}
}
// https://chat.openai.com/share/7ced3fc8-275a-4134-978c-c822275c3e1f
// is there a better way to check for within-expected bounds?
var expectedDelta = vsf == VectorSimilarityFunction.EUCLIDEAN
? 96.98 * log(3.26 + dimension) / log(1.92 + codebooks) - 112.15
: 152.69 * log(3.76 + dimension) / log(1.95 + codebooks) - 180.86;
// expected is accurate to within about 10% *on average*. experimentally 25% is not quite enough
// to avoid false positives, so we pad by 40%
assert delta <= 1.4 * expectedDelta : String.format("%s > %s for %s with %d dimensions and %d codebooks", delta, expectedDelta, vsf, dimension, codebooks);
}
}
@Test
public void testPQEncodings() {
// start with i=2 (dimension 4) b/c dimension 2 is an outlier for our error prediction
for (int i = 2; i <= 8; i++) {
for (int M = 1; M <= i; M++) {
testPQEncodings(2 * i, M);
}
}
}

Beyond that, there are only two additional native functions exposed to Java via FFI—assemble_and_sum_f32_512 and pq_decoded_cosine_similarity_f32_512—and neither of these is affected by this change. The remaining modifications are mechanical in nature (e.g., adding static and inline qualifiers) and do not alter behavior.

@r-devulap
Copy link
Copy Markdown
Contributor Author

+1 to @jshook's comment about numerical tests. This PR touches almost every single function in the native supporting library, and it would be good to have a set of tests accompanying it, perhaps also in C.

I think this is covered by my earlier comment, but to clarify: only three native functions are exposed to Java, and this patch modifies just one of them. That function already has strong numerical test coverage, which was actually helpful in catching bugs in an earlier version of the code.

r-devulap added 5 commits May 4, 2026 08:00
* Replace icelake-server gcc target with skylake-avx512 in build script

* Remove global mutable state: eliminate initialIndexRegister,
  indexIncrement, maskSeventhBit, maskEighthBit globals and their
  constructor initializer; move mask constants (maskSeventhBit,
  maskEighthBit) to local scope inside lookup_partial_sums

* Add shared reduce_add_128_ps and reduce_add_256_ps helper functions
  using proper horizontal-add sequences instead of store-to-array loops

* Remove redundant if (length >= N) guards in all SIMD kernels — the
  loop body already handles the zero-iteration case correctly

* Replace store-to-aligned-array horizontal reduction pattern with the
  new helpers across all 128- and 256-bit dot product and euclidean
  distance functions

* Remove preferred_size parameter from dot_product_f32 and
  euclidean_f32; always dispatch to AVX-512 when length >= 16

* Standardize inline annotations: replace __attribute__((always_inline))
  inline with JV_FINLINE / JV_INLINE macros throughout
…zes 4,8 & 16 on AVX-512

Add SIMD fast paths in calculate_partial_sums_dot_f32_512 and
calculate_partial_sums_euclidean_f32_512 for the two most common PQ
subvector sizes:

- size == 4: broadcast a 128-bit query fragment across all four 128-bit
  lanes of a ZMM register, load four consecutive centroids at once, and
  reduce each lane independently using two shuffle+add pairs. Produces 4
  partial sums per loop iteration instead of 1.

- size == 8: broadcast a 256-bit query fragment across both 256-bit
  halves of a ZMM register, load two consecutive centroids at once, and
  reduce across 128-bit lanes followed by within-lane shuffles. Produces
  2 partial sums per loop iteration instead of 1.

- size == 16: query and the centroid fit into a ZMM register, load
  the query into zmm and then loop over the centroids. Produces one
  partial sum per loop iteration, but prevents having to load the
  query multiple times.

Both paths fall back to the default way of computing dot_product_f32
/ euclidean_f32 in a loop for any tail elements or unsupported
sizes.
@r-devulap r-devulap force-pushed the use-native-calcpartialsum branch from 159c21f to 0276bf1 Compare May 4, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants