Port datastax/jvector#659: Streaming N:1 on-disk graph index compaction#6
Merged
Port datastax/jvector#659: Streaming N:1 on-disk graph index compaction#6
Conversation
Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1 merging of on-disk HNSW indexes without full in-memory materialization. Supports deletion filtering via live-node bitsets, custom ordinal mapping, and PQ codebook retraining.
Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.
Add JFR recording, system stats collection, JSONL logging, git info capture, thread allocation tracking, dataset partitioning, and cloud storage layout utilities used by CompactorBenchmark. Switch jvector-examples logging from logback to log4j2 for consistency with benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar.
JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR recording, and JSONL result logging. Includes BenchmarkParamCounter for progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow, and exec-maven-plugin integration. Add forced vectorization provider property to VectorizationProvider for benchmark reproducibility.
Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files.
The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars.
Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar.
- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md
Use -cp instead of -jar in docs since the benchmarks-jmh-*.jar glob matches the -javadoc jar first. Change default dataset from glove-100-angular to ada002-100k. Note -Xmx should be adjusted to fit the dataset.
The benchmarks-jmh-*.jar glob expands to multiple jars (shaded + javadoc), causing -cp to misinterpret the second jar as the main class. Configure shade plugin outputFile to produce a fixed compactor-benchmark.jar name. Update docs and CI workflow.
Simplify WorkloadMode enum: PARTITION_ONLY/COMPACT_ONLY/COMPACT_AND_RECALL/ BUILD_FROM_SCRATCH collapsed into PARTITION/COMPACT/BUILD/PARTITION_AND_COMPACT plus a separate measureRecall flag. Fix buildFromScratch timing to include PQ computation and graph construction (previously only timed the write step). Add fair comparison guidelines to CompactorBenchmark.md.
Support 10%/90% and 1%/99% partition splits for benchmarking compaction of a small new segment into a large existing index. Add split distribution reference table to CompactorBenchmark.md.
Closed
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports datastax/jvector#659 (Streaming N:1 compaction) into this fork. All 13 upstream commits cherry-picked onto our
main(which carries the four local performance commits) without conflicts.Adds
OnDiskGraphIndexCompactor, a streaming N:1 compaction algorithm for merging multiple on-disk HNSW graph indexes into a single compacted index, plus PQ codebook retraining (PQRetrainer),CompactorBenchmark(JMH), and supporting reporting/storage utilities.See
docs/compaction.mdandbenchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/CompactorBenchmark.mdfor the full algorithm description and benchmarking instructions.Upstream commits (cherry-picked, in order)
7c6ccd99Add on-disk graph index compaction algorithm52e72171Add compaction unit tests475ee063Add reporting and storage infrastructure for CompactorBenchmarkce40c754Add CompactorBenchmark and toolingc75256afUpdate build config and project metadata for compaction415f907bFix JMH jar selection in run-compaction.yml224a709aFix CompactorBenchmark invocation in run-compaction.yml191a40d2Address PR review feedback (extractsCompactWriter, rewritesSystemStatsCollectorin pure Java)06fff177Fix benchmark invocation in docs and default dataset6178afa1Fix jar selection: use fixed output name compactor-benchmark.jar0ab1deafRefactor workload modes and fix build-from-scratch timing3127043fAdd TIERED_10_90 and TIERED_1_99 split distributions632bc76dfix for bug when fused pq is used with no hierarchy (fix for bug when fused pq is used with no hierarchy datastax/jvector#664)Verification
mvn -DskipTests -pl jvector-base,jvector-tests,jvector-examples,benchmarks-jmh -am compile→ SUCCESSmvn -pl jvector-tests -am -Dtest=TestOnDiskGraphIndexCompactor -Dsurefire.failIfNoSpecifiedTests=false test→ 7 tests, 0 failuresTest plan
CompactorBenchmarkend-to-end on a representative datasetBufferVectorFloat, segmentedDenseIntMap, and stripedSparseIntMappaths under load