perf: optimize range tombstone lookup in table-skip and point-read#55
Conversation
- Separate KV/RT seqno tracking: new `seqno#kv_max` metadata field excludes range tombstone seqnos, enabling table-skip for covering RTs stored in the same table (previously impossible because `rt.seqno > get_highest_seqno()` was always false for co-located RTs) - Binary search on sorted RT list for table-skip: partition_point narrows candidates to RTs with start <= table_min, reducing O(tables × rt_count) to O(tables × (log(rt_count) + k)) - Binary search for point-read suppression: per-table RT lists are sorted on load, enabling partition_point to skip RTs with start > key - Backward-compatible: old tables without `seqno#kv_max` fall back to `seqno#max` (conservative but correct) Closes #27
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis change optimizes range-tombstone suppression/lookups by tracking a KV-only highest seqno, deterministically sorting decoded range tombstones by (start asc, seqno desc), and narrowing candidate tombstones with binary-search ( Changes
Sequence Diagram(s)sequenceDiagram
participant Writer as Table Writer
participant Meta as Table Metadata
participant Table as SST Table
participant TreeIter as TreeIterator
participant RangeIter as RangeIterator
Note over Writer,Meta: Write path
Writer->>Meta: update highest_kv_seqno (max KV seqnos)
Writer->>Meta: persist highest_kv_seqno & highest_seqno
Note over Table: Recover
Table->>Meta: load highest_kv_seqno
Table->>Table: sort decoded RTs by (start asc, seqno desc)
Note over RangeIter,Table: Range-iteration path
RangeIter->>Table: get_highest_kv_seqno()
Table-->>RangeIter: kv_only_seqno
RangeIter->>Table: partition_point(rt.start ≤ table_min)
RangeIter->>Table: evaluate visibility & coverage on prefix
Note over TreeIter,Table: Point-read path
TreeIter->>Table: partition_point(rt.start ≤ key)
TreeIter->>Table: evaluate rt.should_suppress(...) on prefix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/range.rs`:
- Around line 242-268: The partition_point call on all_range_tombstones assumes
it's sorted by rt.start but the list is only sorted later by
all_range_tombstones.sort_by(...); fix by ensuring the slice is sorted before
using partition_point (move the sort_by earlier, before computing
table_min/table_max and candidate_end) or, alternatively, replace the
partition_point logic with a linear scan over all_range_tombstones that checks
rt.start, rt.visible_at(*cutoff), rt.fully_covers(table_min, table_max) and
rt.seqno > table_kv_seqno (using symbols: all_range_tombstones, partition_point,
sort_by, candidate_end, is_covered, table_min/table_max,
table.get_highest_kv_seqno, rt.visible_at, rt.fully_covers) so the decision to
skip a table is based on correct ordering or a full scan.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 77b32b1d-cfd1-4e31-a66a-547c1b0dc9ba
📒 Files selected for processing (7)
src/range.rssrc/table/meta.rssrc/table/mod.rssrc/table/writer/meta.rssrc/table/writer/mod.rssrc/tree/mod.rstests/range_tombstone.rs
There was a problem hiding this comment.
Pull request overview
Improves range tombstone (RT) suppression performance for table-skip and point reads by separating KV-only sequence number bounds and using binary search over sorted RT lists.
Changes:
- Add KV-only max seqno metadata (
seqno#kv_max) to enable skipping tables even when covering RTs are co-located in the same SST. - Use
partition_pointto reduce RT candidate scanning for table-skip and point reads. - Add targeted tests covering KV/RT seqno separation and binary-search boundary cases.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Adds tests for KV-only seqno metadata and binary-search-based suppression behavior. |
| src/tree/mod.rs | Uses binary search over per-table sorted RTs for point-read suppression. |
| src/table/writer/mod.rs | Tracks and persists KV-only max seqno as new table metadata. |
| src/table/writer/meta.rs | Extends writer metadata with highest_kv_seqno. |
| src/table/mod.rs | Sorts decoded RTs to support binary-search suppression; exposes get_highest_kv_seqno(). |
| src/table/meta.rs | Parses optional seqno#kv_max field with fallback for older tables. |
| src/range.rs | Uses binary search over RT list to reduce table-skip candidate checks and relies on KV-only seqno bound. |
- Sort SST-sourced RTs before table-skip loop so partition_point operates on sorted data (was using unsorted list — incorrect results) - Propagate read_u64 error for seqno#kv_max instead of silent fallback to surface metadata corruption - Use should_suppress() in point-read path instead of inlined logic to avoid semantic drift from the canonical suppression check - Update sort comment to reference Ord implementation
…r binary search - Sort by start key explicitly instead of delegating to Ord, so the partition_point invariant is enforced locally and independent of future Ord changes - Replace [..candidate_end] slicing with .iter().take(candidate_end) to avoid clippy::indexing_slicing suppressions
- Sort per-table RTs by (start asc, seqno desc) instead of start-only, so suppression checks short-circuit on the highest-seqno RT first - Align sort order documentation with actual comparator
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
…riant - Switch to sort_unstable_by for RT sorts (no stability needed) - Clarify that highest_kv_seqno tracks all data-block item types (values, point tombstones, weak tombstones), not just values - Restore inclusive-max vs half-open documentation in table-skip - Expand backward-compat fallback rationale in ParsedMeta
- Reject tables where highest_kv_seqno exceeds overall max seqno, which indicates on-disk metadata corruption - Reword seqno-desc tiebreaker documentation to match actual behavior
… scope Test verifies logical suppression (which works regardless of whether table-skip fires), not the table-skip optimization specifically.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…riants - Tables without RTs have highest_kv_seqno == highest_seqno - RT-only tables have highest_kv_seqno == 0 (sentinel restored) - Assert highest_kv_seqno <= highest_seqno for all table types
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
…function Extract highest_kv_seqno corruption check into validated_kv_seqno() with unit tests for valid, equal, zero, and exceeds-max cases.
- Fix test postcondition: assert_eq!(Some(0), level_table_count(0)) instead of unwrap_or(0) — properly distinguishes "L0 exists and is empty" from unexpected None (Copilot #54, CodeRabbit #56) - Fix ZstdDictionary::new() doc: handle stores full 64-bit xxh3 hash internally; id() returns lower 32 bits on demand (CodeRabbit #55)
Summary
seqno#kv_maxtable metadata field tracks highest KV-only seqno, enabling table-skip for covering range tombstones co-located in the same SSTpartition_pointon sorted RT list narrows candidates from O(n) to O(log n + k) per tablepartition_pointskips RTs withstart > keyseqno#kv_maxfall back toseqno#maxseqno#kv_max <= seqno#maxon loadTest plan
kv_seqno_excludes_range_tombstone_seqno— verifies metadata separationkv_seqno_equals_overall_when_no_range_tombstones— KV-only table invariantkv_seqno_zero_for_rt_only_table— RT-only sentinel restorationcolocated_range_tombstone_suppresses_keys— co-located RT suppressiontable_skip_rt_start_equals_table_min— binary search boundary casepoint_read_binary_search_multiple_rts— multi-RT per tableCoverage note
Patch coverage 85.71% (6 lines uncovered in
src/table/meta.rs):seqno#kv_max > seqno#max. Cannot be triggered through normal table writer API — defensive guard for on-disk corruptionseqno#maxwhenseqno#kv_maxkey is absent. Only reachable with tables written before this PR — no way to create such tables via current APICloses #27