implement algorithm 5 for inplace repair and algorithm 6 to clean up … by Kartikk1127 · Pull Request #648 · datastax/jvector

Kartikk1127 · 2026-03-19T20:57:47Z

Problem

The current markNodeDeleted + cleanup() workflow has O(N) cost per deletion:

markNodeDeleted only flips a bit in the deleted set
cleanup() scans every node in the graph via nodeStream to find in-neighbors
of the deleted node, then rebuilds their neighbor lists

This means deletion cost degrades linearly as the graph grows, and crucially it
grows over time as more deletions accumulate.

Solution

The IP-DiskANN paper (arXiv:2502.13826) describes two algorithms that solve this:

Algorithm 5 — In-place deletion repair:
Instead of scanning all N nodes to find in-neighbors, run a GreedySearch toward
the deleted node's vector. Nodes the search visits are the approximate in-neighbors.
This reduces in-neighbor discovery from O(N) to O(DELETION_LD) where DELETION_LD
is the beam width of the search.

The sequence per deletion:

Flip the deleted bit
Update entry point if deleted node is the current entry point
GreedySearch toward deleted node's vector with beam width DELETION_LD
For each visited node z where z → deleted_node exists: repair z's neighbor
list using the top-DELETION_LD search results as replacement candidates
Physically remove the node

Algorithm 6 — Dangling edge sweep:
Algorithm 5 repairs in-neighbors found via the search path, but greedy search
is approximate and may miss some. Algorithm 6 is a periodic O(N × M) sweep
(no distance calculations) that removes any remaining out-edges pointing to
absent nodes.

Benchmark Results (SIFT-1M, M=16, efConstruction=200, efSearch=200)

100K deletions (10% of index), 1000 query vectors, topK=10:

Batch	Deleted	Avg/delete	Recall@10	Degradation
1/10	10,000	1.55ms	0.9517	0.17%
2/10	20,000	1.54ms	0.9508	0.26%
5/10	50,000	1.52ms	0.9459	0.75%
10/10	100,000	1.49ms	0.9279	2.55%

Baseline recall: 0.9534 → Post-deletion recall: 0.9279 (2.55% degradation)

Key observations:

Per-deletion cost is flat across all 10 batches (1.49–1.55ms) — does not grow as deletions accumulate
Old O(N) nodeStream approach: ~23ms/delete initially, growing to ~39ms — and compounding
100K deletions completed in 2.5 minutes vs projected ~60 minutes with old approach
Recall degradation stays well within acceptable bounds at 10% deletion rate

API changes

markNodeDeleted becomes self-contained — no cleanup() call needed after deletion.
cleanup() is still required before writing to disk.

consolidateDanglingEdges() is a new public method for Algorithm 6 execution.

Implementation

The PR implements the algorithm.

References

IP-DiskANN paper: https://arxiv.org/abs/2502.13826
Algorithm 5: Section 3.2
Algorithm 6: Section 3.3

…dangling edges. cleanup() method can be deprecated now

Kartikk1127 · 2026-04-06T19:55:47Z

Hey, is there any update on this PR? Happy to talk if it needs to be better aligned with PR policies or anything else

dian-lun-lin · 2026-04-27T21:42:03Z

Thanks for the PR! The implementation overall looks good to me. My main concern is latency — specifically, how the per-call cost of markNodeDeleted has changed and how Algorithm 6 affects tail latency.

The "1.49–1.55 ms, flat across batches" result is good news for Algorithm 5 in isolation, but I don't think it tells us about the impact of Algorithm 6. Two structural reasons it gets hidden:

The auto-trigger never fires in testRecallDegradation. With consolidationThreshold = 0.20 (default) and indexSize = 10,000, the trigger condition (total - lastAt) >= 0.20 × size(0) requires ≥ ~2,000 deletes. The test only deletes 1,000. So Algorithm 6 doesn't run at all — we're measuring Algorithm 5 alone. That's why per-deletion cost is flat: there's no sweep happening to spike it.
Batch averaging masks individual spikes. avgPerDelete = batchMs / 100 divides any single slow call by 100. A 50× spike on one delete out of 100 barely moves the displayed number.

I think there are two missing parts from benchmarking:

Million-scale dataset. Algorithm 6 is O(N × M) — on 10K nodes the sweep seems to be invisible.
Enough deletes to actually cross the threshold. Either lower setConsolidationThreshold(0.05) or delete ≥ 20% of the graph. Otherwise the synchronous consolidateDanglingEdges() call inside markNodeDeleted simply never runs and we have no data on it.

I'd suggest a benchmark that builds a million-scale index and deletes 30% with per-call timing.

implement algorithm 5 for inplace repair and algorithm 6 to clean up …

e96a1e0

…dangling edges. cleanup() method can be deprecated now

Kartikk1127 requested review from MarkWolters, jshook and tlwillke as code owners March 19, 2026 20:57

remove sift dataset and use random vectors

03b655e

Kartikk1127 mentioned this pull request Mar 19, 2026

Improve markNodeDeleted: O(DELETION_LD) in-place repair via Algorithm 5 + Algorithm 6 (IP-DiskANN) #646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement algorithm 5 for inplace repair and algorithm 6 to clean up …#648

implement algorithm 5 for inplace repair and algorithm 6 to clean up …#648
Kartikk1127 wants to merge 2 commits intodatastax:mainfrom
Kartikk1127:improve_markNodeDeleted

Kartikk1127 commented Mar 19, 2026 •

edited

Loading

Uh oh!

Kartikk1127 commented Apr 6, 2026

Uh oh!

dian-lun-lin commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kartikk1127 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Benchmark Results (SIFT-1M, M=16, efConstruction=200, efSearch=200)

API changes

Implementation

References

Uh oh!

Kartikk1127 commented Apr 6, 2026

Uh oh!

dian-lun-lin commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kartikk1127 commented Mar 19, 2026 •

edited

Loading