feat(data): coalesce position deletes into range inserts#645
Open
Baunsgaard wants to merge 2 commits into
Open
feat(data): coalesce position deletes into range inserts#645Baunsgaard wants to merge 2 commits into
Baunsgaard wants to merge 2 commits into
Conversation
0236bf7 to
3a182b2
Compare
Add ForEachPositionDelete (the C++ equivalent of Java's PositionDeleteRangeConsumer) and route DeleteLoader through it, replacing the per-position PositionDeleteIndex::Delete(pos) call. The function sniffs a 1024-position prefix and dispatches to either run coalescing (CRoaring addRange) or bulk addMany grouped by high-32-bit key. Also rework DeleteLoader::LoadPositionDelete to read Arrow batches via nanoarrow's ArrowArrayView directly. When the delete file's referenced_data_file matches the target (V2 writer hint), positions are passed as a zero-copy span; otherwise a per-batch staging vector filters by path. Local microbenchmarks: 2.2x-10.6x for ForEachPositionDelete and 2.1x-2.5x end-to-end through LoadPositionDeletes. Equivalent of apache/iceberg#16052.
3a182b2 to
efb1db3
Compare
Adds an integration test that exercises the loader's referenced_data_file fast path with enough rows (128) to clear the consumer's 64-element sniff threshold, and an assertion on the existing mixed-paths test that locks in the filter-path routing. Documents which branch each test covers so a future refactor of PositionDeleteWriter or the loader can't silently take the wrong path.
wgtmac
reviewed
May 22, 2026
| // Bulk-add positions sharing high-32-bit `key`. Internal hook for | ||
| // `PositionDeleteIndex::BulkAddForKey`; per-key grouping is the caller's | ||
| // job, keeping this a thin wrapper around CRoaring's `addMany`. | ||
| void AddManyForKey(int32_t key, const uint32_t* positions, size_t n); |
Member
There was a problem hiding this comment.
Suggested change
| void AddManyForKey(int32_t key, const uint32_t* positions, size_t n); | |
| void AddManyForKey(int32_t key, std::span<const uint32_t> positions); |
I'd prefer span here
| // `addMany` (through `BulkAddForKey`). The thread-local buffer is | ||
| // reused across calls; nested invocations on the same thread would | ||
| // corrupt it -- see `\warning` on `ForEachPositionDelete`. | ||
| thread_local std::vector<uint32_t> bulk_key_positions; |
Member
There was a problem hiding this comment.
This looks hacky and limits its use. How about passing bulk_key_positions as a parameter so that caller can take full control over it?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add ForEachPositionDelete (the C++ equivalent of Java's PositionDeleteRangeConsumer) and route DeleteLoader through it, replacing the per-position PositionDeleteIndex::Delete(pos) call. The function sniffs a 1024-position prefix and dispatches to either run coalescing (CRoaring addRange) or bulk addMany grouped by high-32-bit key.
Also rework DeleteLoader::LoadPositionDelete to read Arrow batches via nanoarrow's ArrowArrayView directly. When the delete file's referenced_data_file matches the target (V2 writer hint), positions are passed as a zero-copy span; otherwise a per-batch staging vector filters by path.
Local microbenchmarks: 2.2x-10.6x for ForEachPositionDelete and 2.1x-2.5x end-to-end through LoadPositionDeletes. Equivalent of apache/iceberg#16052.