Skip to content

feat: add range tombstones (delete_range / delete_prefix)#21

Merged
polaz merged 85 commits into
mainfrom
feat/#16-feat-range-tombstones--deleterange--deleteprefix
Mar 20, 2026
Merged

feat: add range tombstones (delete_range / delete_prefix)#21
polaz merged 85 commits into
mainfrom
feat/#16-feat-range-tombstones--deleterange--deleteprefix

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Mar 16, 2026

Summary

  • Add native range tombstone support for efficient deletion of contiguous key ranges
  • remove_range(start, end, seqno) and remove_prefix(prefix, seqno) API on AbstractTree
  • MVCC-correct: tombstones respect snapshot visibility and do not suppress newer values
  • Full persistence through flush, compaction, and recovery

Technical Details

Core types (ported algorithms from upstream PR fjall-rs#242, own SST format):

  • RangeTombstone — half-open [start, end) interval with seqno
  • ActiveTombstoneSet / ActiveTombstoneSetReverse — sweep-line trackers for iteration
  • IntervalTree — AVL-balanced interval tree in memtable for O(log n) queries
  • RangeTombstoneFilter — bidirectional wrapper that suppresses covered entries

Read path:

  • Point reads check range tombstone suppression across all layers (active memtable -> sealed -> SST)
  • Range/prefix iteration uses RangeTombstoneFilter with sweep-line activation/expiry
  • Table-skip optimization: range iteration skips entire SST tables when a range tombstone fully covers the table's key range with a higher seqno (avoids I/O)
  • Point reads prefilter RT-bearing SSTs by persisted key_range, then linearly scan only candidate RT blocks

Write path:

  • Flush: range tombstones written unclipped to SST — they must cover keys in older SSTs outside the memtable's key range. Table key_range widened to include RT coverage so compaction overlap detection includes these tables. MultiWriter uses clip_range_tombstones=false for flush
  • Flush/finish now finalize buffered KV blocks before attaching RT metadata, so widened key_range survives Writer::finish() / rotation and remains sound for future overlap checks and point-read prefiltering
  • Fresh flush outputs enter L0 as singleton runs first; optimize_runs() only fuses truly disjoint tables back together, so RT-widened sibling tables do not violate Run::get_for_key()'s non-overlap assumption
  • Compaction: range tombstones clipped to each output table's key range via intersect_opt(). MultiWriter uses clip_range_tombstones=true (via .use_clip_range_tombstones())
  • All RT collection points deduplicate (sort + dedup): flush, compaction input, and range iteration — prevents RT accumulation from MultiWriter rotation
  • RT-only tables (no KV data): synthetic WeakTombstone sentinel uses the start key of the lowest-seqno tombstone (with start as a deterministic tiebreaker), while meta.first_key/meta.last_key still cover min(rt.start)/max(rt.end) for compaction overlap
  • GC: bottom-level RT eviction is intentionally deferred until compaction becomes RT-aware enough to physically drop covered KVs
  • MultiWriter rotation: RTs distributed across all output tables

Visibility / API:

  • AbstractTree remains public as a consumer-facing bound, but is now sealed to crate-owned tree types; internal RT-aware flush hooks stay hidden implementation details
  • Sealing AbstractTree is an intentional public-API break for external implementers in this fork; we are not preserving extension-trait semver compatibility for downstream custom implementations, and the longer-term facade/internal split is tracked in Refactor AbstractTree into a public facade and internal engine trait #29
  • All range tombstone helper APIs (is_key_suppressed_by_range_tombstone, range_tombstones_sorted, range_tombstones()) are pub(crate) — internal implementation detail, not part of public API
  • RangeTombstone type is pub(crate) (module not publicly exported)

SST format:

  • New BlockType::RangeTombstone = 4
  • Raw wire format: [start_len:u16_le][start][end_len:u16_le][end][seqno:u64_le] (repeated)
  • New "range_tombstones" section in table TOC
  • Block type validated on load; start < end invariant enforced during decoding
  • RT-only memtable flush: tombstones re-inserted into active memtable (SST requires KV data for valid index)
  • Key length validated via u16::try_from — oversized keys rejected with log::warn diagnostic
  • upper_bound_exclusive returns Option<UserKey> (None only for the lexicographically maximal encodable user key)

Disk Format Compatibility

This PR now targets disk format V4.

The format bump is intentional and covers the new persisted semantics introduced in the fork:

  • Range tombstones add a new SST range_tombstones section / BlockType::RangeTombstone
  • Merge operators introduce a persisted MergeOperand value kind in the internal value encoding on the merge-operator branch

Keeping these changes under V3 would be unsafe:

  • an older V3 reader could open a database containing range tombstones, ignore the new SST section, and return keys that are logically deleted
  • an older V3 reader could not safely interpret persisted merge operands, because it does not implement merge-resolution semantics in reads and compaction

Because both features change the meaning of on-disk data rather than only extending in-memory APIs, they must share a new compatibility gate.

V4 is therefore the correct boundary for this fork's new persisted semantics:

  • V3 databases remain readable by the new code
  • databases written with these new semantics must not be opened by older V3 binaries

Known Limitations

  • SST block format is fork-specific (will need migration when upstream settles on format)
  • Point-read SST suppression still linearly scans RTs within candidate tables (table-local prefiltering is implemented; full indexed lookup is future work)
  • Bottom-level RT GC is intentionally deferred until compaction can physically remove RT-covered KVs
  • Compaction clip can drop RT fragments covering gaps between output tables (fix: compaction clip can drop RT covering gaps between output tables #32)
  • No WAL persistence (consistent with crate design — caller manages WAL)

Test Plan

  • 62 unit tests for core types (RangeTombstone, ActiveTombstoneSet, IntervalTree, RangeTombstoneFilter)
  • 22+ integration/regression tests covering:
    • Point read suppression, MVCC visibility, newer values not suppressed
    • Forward and reverse range iteration
    • Cross-layer suppression (memtable -> SST)
    • Sealed memtable tombstones
    • remove_prefix API, including empty/all-0xFF soft-reject edge cases
    • Overlapping tombstones
    • Persistence through flush, compaction, and recovery
    • RT-only memtable flush and sentinel regressions
    • Disjoint RT flush and compaction
    • Narrow range scans with disjoint RTs in memtables and after recovery
    • Buffered-writer regression: persisted SST key_range keeps RT coverage after flush/finish
    • Bottom-level compaction regression: RT-deleted keys stay hidden under high GC watermark
    • Manifest compatibility checks (V4 written by new code, safe V3 recovery still accepted, unsupported versions rejected on open)
    • Bulk-ingest point reads remain suppressible by later visible range tombstones
  • Full local suite passes: cargo test --all-features -- --nocapture
  • Fresh CI matrix is still running on the current PR head (lint already green; remaining Windows/cross jobs pending)

Closes #16

Summary by CodeRabbit

  • New Features

    • Range-deletion APIs: remove_range and remove_prefix; iterators/filtering hide keys covered by visible range deletions.
  • Behavior

    • Range-deletion metadata is persisted and carried through flushes/compactions (including RT-only tables); table-skip and point/scan semantics respect MVCC-visible range deletions.
    • On-disk manifest format bumped to V4 (not readable by older V3-only binaries).
  • Tests

    • Extensive unit and integration tests covering RT lifecycle, compaction, persistence, and recovery.
  • Documentation

    • README and reviewer guidance updated with V4 notes.

- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types
  (ported from upstream PR fjall-rs#242 algorithms, own SST persistence)
- Add RangeTombstoneFilter for bidirectional iteration suppression
- Integrate into Memtable with interval tree for O(log n) queries
- Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno)
  to AbstractTree trait, implemented for Tree and BlobTree
- Wire suppression into point reads (memtable + sealed + SST layers)
- Wire RangeTombstoneFilter into range/prefix iteration pipeline
- SST persistence: raw wire format in BlockType::RangeTombstone block
- Flush: collect range tombstones from sealed memtables, write to SSTs
- Compaction: propagate RTs from input to output tables with clipping
- GC: evict range tombstones below watermark at bottom level
- Table-skip: skip tables fully covered by a range tombstone
- MultiWriter: clip RTs to each output table's key range on rotation
- Handle RT-only tables (derive key range from tombstone bounds)
- 16 integration tests covering all paths

Closes #16
Copilot AI review requested due to automatic review settings March 16, 2026 19:59
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds end-to-end range-tombstone support: new RangeTombstone primitives, memtable interval index and active trackers, RT-aware iterators and point-read suppression, on-disk RT persistence (format V4), RT propagation through flush/compaction/writers (including RT-only tables), AbstractTree sealing and delete APIs, plus tests and recovery/version changes.

Changes

Cohort / File(s) Summary
Core trait & sealing
src/abstract_tree.rs, src/any_tree.rs
Introduce crate-internal sealed::Sealed, require it for AbstractTree; add flush_to_tables_with_rt(...) and remove_range(...) APIs; default flush_to_tables(...); seal AnyTree/Tree/BlobTree.
Tree impls & flush flow
src/tree/mod.rs, src/blob_tree/mod.rs
Collect sealed-memtable RTs at flush, pass RTs into flush hook, ensure MultiWriter receives RTs before KV writes, handle RT-only flush reinsertion, add remove_range delegation.
Range-tombstone core
src/range_tombstone.rs
New RangeTombstone and CoveringRt types, containment/visibility/suppression rules, intersect/coverage helpers, ordering, and upper_bound_exclusive.
Memtable index & APIs
src/memtable/interval_tree.rs, src/memtable/mod.rs
Add AVL IntervalTree for RTs behind Mutex; memtable stores RTs, insertion/query/export/count APIs; is_empty() updated to include RTs.
Active trackers & filter
src/active_tombstone_set.rs, src/range_tombstone_filter.rs
Add forward/reverse ActiveTombstoneSet and RangeTombstoneFilter (bidirectional iterator) with activation/expiry/multiset accounting and tests.
Iterator & read-path
src/range.rs
Collect RTs from memtables/tables, dedupe, apply table-skip when covered, and wrap stream with RangeTombstoneFilter only if visible RTs exist; update point-read suppression checks.
On-disk table format & writer
src/table/writer/mod.rs, src/table/multi_writer.rs, src/table/mod.rs, src/table/inner.rs, src/table/regions.rs, src/table/block/type.rs, src/table/util.rs, src/table/iter.rs
Persist RTs in new range_tombstones block and meta count; add BlockType::RangeTombstone; writer collects/writes RTs (sentinel for RT-only tables); MultiWriter supports clipping and setting RTs; load RTs on open; metrics/lint updates.
Compaction
src/compaction/flavour.rs, src/compaction/worker.rs
Collect/sort/dedupe RTs from inputs; add CompactionFlavour::write_range_tombstones; ensure RTs are written to outputs before KV writes so RT metadata survives mid-loop rotations.
Core modules, manifest & versioning
src/lib.rs, src/format_version.rs, src/version/mod.rs
Register new crate-private RT modules; add FormatVersion::V4 and serialize manifest as V4; change L0 run construction to per-table runs.
High-level tests & docs
tests/*, .github/*, README.md
Large RT-focused test suite and recovery/version tests; reviewer guidance updates; README notes V4 compatibility boundary.
Small surface updates
src/table/inner.rs, src/table/util.rs, src/table/iter.rs
Minor fields, metric bucketing, and lint attribute gating updates.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Tree
    participant ActiveMemtable
    participant SealedMemtables
    participant Flusher
    participant SST

    Client->>Tree: remove_range(start,end,seqno)
    Tree->>ActiveMemtable: insert_range_tombstone(start,end,seqno)
    ActiveMemtable->>ActiveMemtable: store in IntervalTree

    Client->>Tree: get(key, read_seqno)
    Tree->>Tree: is_suppressed_by_range_tombstones(key, key_seqno, read_seqno)?
    alt suppressed
        Tree-->>Client: None
    else not suppressed
        Tree->>SealedMemtables: read candidate / fallback to SST
        Tree-->>Client: Value
    end

    ActiveMemtable->>SealedMemtables: seal/flush
    Tree->>Flusher: flush_to_tables_with_rt(stream, collected_tombstones)
    Flusher->>SST: write SST(s) (include RT block or RT-only table)
    SST->>Tree: recovered tables include RTs
Loading
sequenceDiagram
    participant Iter
    participant Filter
    participant ActiveF
    participant ActiveR
    participant Inner

    Iter->>Filter: next()
    Filter->>Inner: pull next item (key, seqno)
    Filter->>ActiveF: activate RTs with start <= key
    Filter->>ActiveF: expire RTs with end <= key
    Filter->>ActiveF: is_suppressed(key_seqno)?
    alt suppressed
        Filter->>Inner: skip, loop
    else not suppressed
        Filter-->>Iter: yield item
    end

    Iter->>Filter: next_back()
    Filter->>Inner: pull prev item (key, seqno)
    Filter->>ActiveR: activate RTs with end > key
    Filter->>ActiveR: expire RTs with start > key
    Filter->>ActiveR: is_suppressed(key_seqno)?
    alt suppressed
        Filter->>Inner: skip, loop
    else not suppressed
        Filter-->>Iter: yield item
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I nibbled ranges at morning's brim,

Half-open tunnels, neat and slim,
Seqnos stacked like carrots, bright and prim,
Scans hop past shadows, compactions hum their hymn,
A burrow of data, safe and snug within.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add range tombstones (delete_range / delete_prefix)' clearly and concisely summarizes the main feature addition: native range tombstone support with two new deletion APIs.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #16: remove_range/remove_prefix APIs, MVCC correctness across memtables/SSTs, full pipeline integration, disk format V4, and efficient bulk-deletion semantics.
Out of Scope Changes check ✅ Passed All changes directly support range-tombstone implementation (new types, storage, filtering, iteration, persistence). Two deferred scope items (memtable RwLock redesign, RT-only GC optimization) were intentionally tracked separately; no unrelated changes present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#16-feat-range-tombstones--deleterange--deleteprefix

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class range tombstones to the LSM-tree to enable efficient range/prefix deletes (remove_range, remove_prefix) with MVCC-aware visibility, and persists tombstones through flush/compaction/recovery via a new SST block and table metadata.

Changes:

  • Introduces core range-tombstone types and in-memory structures (interval tree in memtables; sweep-line active sets + iterator filter).
  • Extends flush/compaction/table IO to write/read range tombstones from SSTs (BlockType::RangeTombstone, TOC region, metadata count).
  • Adds integration tests validating suppression semantics and persistence across flush/compaction/recovery.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests for range tombstone behavior and persistence.
src/abstract_tree.rs Adds remove_range API and remove_prefix helper; extends flush path to pass tombstones.
src/tree/mod.rs Tree integration: point-read suppression check + flush now accepts tombstones.
src/blob_tree/mod.rs BlobTree integration: passes tombstones through flush; exposes remove_range.
src/memtable/mod.rs Stores memtable range tombstones in a mutex-protected interval tree and exposes query/flush helpers.
src/memtable/interval_tree.rs AVL interval tree backing for memtable tombstone queries.
src/range_tombstone.rs Defines RangeTombstone, clipping/cover helpers, and key upper-bound helper.
src/active_tombstone_set.rs Sweep-line “active tombstone” trackers for forward/reverse iteration.
src/range_tombstone_filter.rs Bidirectional iterator wrapper to suppress entries covered by visible range tombstones.
src/range.rs Collects tombstones across layers, adds table-skip optimization, and wraps iterators with tombstone filter.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4.
src/table/writer/mod.rs Writes range tombstone blocks + metadata; supports RT-only tables by deriving key range.
src/table/regions.rs Adds range_tombstones section handle in parsed table regions.
src/table/multi_writer.rs Distributes/clips tombstones per output table via set_range_tombstones.
src/table/inner.rs Stores loaded range tombstones in table inner state.
src/table/mod.rs Loads/decodes range tombstone block on table recovery; exposes range_tombstones().
src/compaction/flavour.rs Extends compaction flavour to accept output range tombstones.
src/compaction/worker.rs Collects tombstones from input tables and propagates them to compaction outputs (with last-level GC filtering).
src/lib.rs Registers new internal modules for range tombstone support.
.forge/16/known-gaps.md Documents remaining limitations and known gaps for Issue #16.
.forge/16/analysis.md Architecture/design analysis and rationale for the chosen approach.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/range.rs Outdated
Comment thread tests/range_tombstone.rs
Comment thread src/tree/mod.rs Outdated
Comment thread src/blob_tree/mod.rs Outdated
Comment thread src/compaction/worker.rs Outdated
Comment thread src/abstract_tree.rs Outdated
Comment thread src/table/mod.rs Outdated
Comment thread src/memtable/interval_tree.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
src/table/mod.rs (2)

570-577: Consider validating the block type for range tombstones.

The recovery path loads the range tombstone block with CompressionType::None but doesn't validate that the loaded block has BlockType::RangeTombstone. This differs from other block loading paths (e.g., filter block at lines 553-560) that verify the block type.

♻️ Proposed fix
         let range_tombstones = if let Some(rt_handle) = regions.range_tombstones {
             log::trace!("Loading range tombstone block, with rt_ptr={rt_handle:?}");
             let block = Block::from_file(&file, rt_handle, crate::CompressionType::None)?;
+            if block.header.block_type != BlockType::RangeTombstone {
+                return Err(crate::Error::InvalidTag((
+                    "BlockType",
+                    block.header.block_type.into(),
+                )));
+            }
             Self::decode_range_tombstones(block)?
         } else {
             Vec::new()
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/mod.rs` around lines 570 - 577, The range tombstone loading path
doesn't validate the loaded block type; when regions.range_tombstones is Some
you call Block::from_file(..., crate::CompressionType::None) and then
Self::decode_range_tombstones(block) without checking block.block_type. Update
the code to verify the loaded block's type equals BlockType::RangeTombstone (or
return an error if not) before calling Self::decode_range_tombstones; reference
the regions.range_tombstones branch, the Block::from_file call,
crate::CompressionType::None, and Self::decode_range_tombstones to locate and
modify the logic.

620-660: Consider validating tombstone invariant start < end during decoding.

RangeTombstone::new has a debug_assert!(start < end) that only fires in debug builds. Corrupted on-disk data could produce invalid tombstones with start >= end in release builds, potentially causing incorrect behavior in the interval tree or filtering logic.

🛡️ Proposed validation
             tombstones.push(RangeTombstone::new(
                 UserKey::from(start_buf),
                 UserKey::from(end_buf),
                 seqno,
             ));
+            // Validate tombstone invariant
+            if let Some(rt) = tombstones.last() {
+                if rt.start >= rt.end {
+                    return Err(crate::Error::Unrecoverable);
+                }
+            }

Or validate before constructing:

+            let start_key = UserKey::from(start_buf);
+            let end_key = UserKey::from(end_buf);
+            if start_key >= end_key {
+                return Err(crate::Error::Unrecoverable);
+            }
             tombstones.push(RangeTombstone::new(
-                UserKey::from(start_buf),
-                UserKey::from(end_buf),
+                start_key,
+                end_key,
                 seqno,
             ));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/mod.rs` around lines 620 - 660, The decoder
(decode_range_tombstones) must validate the invariant start < end after reading
start_buf and end_buf and before calling RangeTombstone::new; if the condition
fails, return an appropriate crate::Error (e.g., a corruption/unrecoverable
error) instead of constructing the tombstone so release builds don't accept
invalid on-disk data; update the logic around
UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and
early-return Err(...) when start >= end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.forge/16/analysis.md:
- Around line 85-95: MD040 is triggered because the two fenced diff-stat blocks
in .forge/16/analysis.md lack a language tag; update the opening fences for the
two blocks that start with "origin/main vs upstream/main:" and "origin/main vs
upstream-pr-242:" to use a language tag (e.g., change ``` to ```text) so the
blocks become ```text; keep the content of the blocks unchanged.

In `@src/abstract_tree.rs`:
- Around line 722-743: The current remove_prefix implementation incorrectly
treats Bound::Unbounded by synthesizing a truncated exclusive end via
range_tombstone::upper_bound_exclusive(prefix), which doesn't form a true
lexicographic upper bound (e.g., b"\xff" still leaves b"\xff\x01"); fix by
adding an explicit unbounded-path instead of synthesizing an end: when
prefix_to_range yields hi == Bound::Unbounded, call a true unbounded-range
removal API (e.g., implement/use self.remove_range_unbounded(start, seqno) or
self.remove_range_from(start, seqno)) to remove [start, +∞), or if an unbounded
removal is not supported, explicitly reject the operation (return 0 or propagate
an error) rather than creating a truncated tombstone; update remove_prefix, and
add/modify remove_range* helper(s) accordingly (referencing remove_prefix,
prefix_to_range, range_tombstone::upper_bound_exclusive, and remove_range).

In `@src/active_tombstone_set.rs`:
- Around line 110-123: The initialize_from method currently activates every
provided RangeTombstone via activate(&rt) even if it starts in the future
relative to the iterator's current_key; change initialize_from to only activate
truly overlapping tombstones by either (A) adding a current_key parameter (e.g.,
initialize_from(&mut self, current_key: &KeyType, tombstones: impl
IntoIterator<Item = RangeTombstone>)) and filtering each rt with rt.start <=
*current_key && *current_key < rt.end before calling activate, or (B) make
initialize_from private/internal and require callers to pre-filter by that same
overlap predicate; apply the same fix to the analogous method/block at the
second occurrence (the code referenced around lines 226–230) and ensure
expire_until usage still enforces the end bound after activation.

In `@src/memtable/interval_tree.rs`:
- Around line 266-269: The insert method incorrectly increments self.len
unconditionally; change insert_node (the helper called by insert) to return
whether it actually created a new node (e.g., return a (node, bool) or a Node
plus created flag) when handling RangeTombstone duplicates, then update insert
to set self.root from the returned node and only increment self.len when the
created flag is true; update any calling sites or signatures for insert_node and
keep references to RangeTombstone, insert_node, and insert consistent.

In `@src/memtable/mod.rs`:
- Around line 31-36: Memtables that only contain range tombstones are treated as
empty because Memtable::is_empty() only checks items; update logic so
Memtable::is_empty() also inspects range_tombstones by locking the Mutex and
calling the interval tree's is_empty (e.g., check
self.range_tombstones.lock().is_empty()), and propagate this change to callers
(Tree::rotate_memtable(), Tree::clear_active_memtable()) so they seal/flush
memtables and compute get_highest_seqno() correctly when only tombstones exist;
ensure get_highest_seqno() and any durability paths (remove_range/remove_prefix)
treat a memtable with non-empty range_tombstones as non-empty.
- Around line 183-190: The insert_range_tombstone method must reject empty or
reversed intervals before persisting: in insert_range_tombstone (and before
calling RangeTombstone::new or inserting into self.range_tombstones), validate
that start < end (using UserKey ordering) and if not, return 0 (do not
compute/record the tombstone). Move or gate the size calculation and the
lock+insert behind this check so reversed/empty ranges are never stored; keep
the rest of the method behavior unchanged.

In `@src/range_tombstone_filter.rs`:
- Around line 39-52: The constructor new() should ensure fwd_tombstones are
ordered by start ascending (and tie-break by seqno descending) before storing,
because fwd_activate_up_to() relies on forward tombstones being sorted by start;
add a sort on fwd_tombstones in new() such as sorting by (start asc, seqno desc)
prior to assigning Self.fwd_tombstones so out-of-order input Vecs can't skip
activation.

In `@tests/range_tombstone.rs`:
- Around line 330-352: Add a new test that writes a range tombstone then tampers
with the SST on-disk range_tombstones block and asserts recovery fails: use
get_tmp_folder(), open_tree(), remove_range(), and flush_active_memtable(0) to
create an SST with the new range tombstone format
([start_len][start][end_len][end][seqno]), then locate the produced SST file in
the folder, corrupt/truncate bytes from the range_tombstones payload (e.g.
remove bytes from the end of the block or zero out length fields), attempt to
reopen the tree with open_tree(folder.path()) and assert that reopening returns
an Err (i.e. recovery fails) rather than succeeding; keep the test focused on
both truncated and malformed length scenarios so it fails on invalid RT
encoding.
- Around line 168-176: The test currently relies on behavior that could pass
even if the tombstone never left the active memtable; modify the test around the
remove_range/rotate_memtable sequence (the block using tree.remove_range,
tree.rotate_memtable, tree.insert, and tree.get) to explicitly assert that the
tombstone was actually sealed/flushed by checking rotate_memtable() returned
Some (or that tree.sealed_memtable_count() > 0) immediately after rotation, and
similarly for the later analogous block (lines ~367-374) assert table_count()
increases or reopen the DB to force RT-only flush visibility; update the
assertions in both places so the test fails if the tombstone stayed in the
active memtable.

---

Nitpick comments:
In `@src/table/mod.rs`:
- Around line 570-577: The range tombstone loading path doesn't validate the
loaded block type; when regions.range_tombstones is Some you call
Block::from_file(..., crate::CompressionType::None) and then
Self::decode_range_tombstones(block) without checking block.block_type. Update
the code to verify the loaded block's type equals BlockType::RangeTombstone (or
return an error if not) before calling Self::decode_range_tombstones; reference
the regions.range_tombstones branch, the Block::from_file call,
crate::CompressionType::None, and Self::decode_range_tombstones to locate and
modify the logic.
- Around line 620-660: The decoder (decode_range_tombstones) must validate the
invariant start < end after reading start_buf and end_buf and before calling
RangeTombstone::new; if the condition fails, return an appropriate crate::Error
(e.g., a corruption/unrecoverable error) instead of constructing the tombstone
so release builds don't accept invalid on-disk data; update the logic around
UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and
early-return Err(...) when start >= end.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a4b651a-d997-4ee5-9e1a-ba1e0f088648

📥 Commits

Reviewing files that changed from the base of the PR and between 5103c70 and a3bc6c9.

📒 Files selected for processing (21)
  • .forge/16/analysis.md
  • .forge/16/known-gaps.md
  • src/abstract_tree.rs
  • src/active_tombstone_set.rs
  • src/blob_tree/mod.rs
  • src/compaction/flavour.rs
  • src/compaction/worker.rs
  • src/lib.rs
  • src/memtable/interval_tree.rs
  • src/memtable/mod.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/range_tombstone_filter.rs
  • src/table/block/type.rs
  • src/table/inner.rs
  • src/table/mod.rs
  • src/table/multi_writer.rs
  • src/table/regions.rs
  • src/table/writer/mod.rs
  • src/tree/mod.rs
  • tests/range_tombstone.rs

Comment thread .forge/16/analysis.md Outdated
Comment thread src/abstract_tree.rs
Comment thread src/active_tombstone_set.rs
Comment thread src/memtable/interval_tree.rs
Comment thread src/memtable/mod.rs
Comment thread src/memtable/mod.rs
Comment thread src/range_tombstone_filter.rs Outdated
Comment thread tests/range_tombstone.rs
Comment thread tests/range_tombstone.rs
polaz added 2 commits March 16, 2026 22:26
- Fix non-exhaustive BlockType match in table/util.rs (metrics feature)
- Add #[allow(dead_code)] for future-use API (CoveringRt, initialize_from)
- Add #[allow(clippy::unnecessary_box_returns)] for AVL tree functions
- Fix Option<&T> vs &Option<T> in interval_tree traversal functions
- Add missing # Panics and # Errors doc sections
- Add backticks in doc comments for identifiers
- Use let...else patterns where clippy suggests
- Fix redundant clone, pass-by-value, pub(crate) visibility
- Remove unfulfilled lint expectations (cfg-gated params)
- Validate start < end on SST decode (reject corrupted RT blocks)
- Validate start >= end in insert_range_tombstone (release-safe guard)
- Fix duplicate insert len accounting in IntervalTree
- Fix 0xFF prefix edge case in remove_prefix (return 0 for unrepresentable)
- Re-insert tombstones into active memtable when RT-only flush
  produces no valid SST (no index possible without KV items)
- Account for range tombstones in Memtable::is_empty
- Add cheap pre-check before collecting range tombstones in iteration
- Sort fwd_tombstones inside RangeTombstoneFilter::new for safety
- Harden test assertions (sealed count, table count)
Verify loaded block has BlockType::RangeTombstone before decoding,
consistent with filter block validation pattern.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/table/iter.rs (1)

122-122: Prefer #[expect(clippy::too_many_arguments)] over #[allow(...)].

Per coding guidelines, #[expect] should be used instead of #[allow] because it warns when the suppression becomes unnecessary (e.g., if future refactoring reduces the argument count).

♻️ Suggested fix
-    #[allow(clippy::too_many_arguments)]
+    #[expect(clippy::too_many_arguments)]

As per coding guidelines: "Prefer #[expect(lint)] over #[allow(lint)]#[expect] warns when suppression becomes unnecessary"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/iter.rs` at line 122, Replace the
#[allow(clippy::too_many_arguments)] attribute on the function/impl annotated in
this diff with #[expect(clippy::too_many_arguments)] so the suppression still
applies but will warn if it becomes unnecessary; locate the exact attribute
instance (the attribute decorating the function that currently uses
#[allow(clippy::too_many_arguments)]) and change the attribute name only,
leaving the lint and surrounding code unchanged.
src/abstract_tree.rs (1)

239-245: Consider #[expect] instead of #[warn] for lint attribute.

The #[warn(clippy::type_complexity)] attribute is unusual here — it would emit warnings rather than suppress them. If the intent is to suppress the lint, use #[expect(clippy::type_complexity)] or #[allow(clippy::type_complexity)].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/abstract_tree.rs` around lines 239 - 245, The attribute on function
flush_to_tables uses #[warn(clippy::type_complexity)] which will emit warnings
instead of suppressing them; change it to #[expect(clippy::type_complexity)] (or
#[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on
the flush_to_tables function declaration so the clippy type_complexity lint is
treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt
code path.
src/range.rs (1)

101-101: Prefer #[expect(clippy::too_many_lines)] over #[allow(...)].

Per coding guidelines, use #[expect] so you get a warning when the suppression becomes unnecessary.

♻️ Suggested fix
-    #[allow(clippy::too_many_lines)]
+    #[expect(clippy::too_many_lines)]

As per coding guidelines: "Prefer #[expect(lint)] over #[allow(lint)]#[expect] warns when suppression becomes unnecessary"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range.rs` at line 101, Replace the attribute
#[allow(clippy::too_many_lines)] with #[expect(clippy::too_many_lines)] to
follow the coding guideline; locate the attribute (attached to the
function/module declared near the top of src/range.rs) and change the lint
suppression to an expectation so you receive a warning if the suppression
becomes unnecessary.
src/tree/mod.rs (1)

778-816: Performance concern: Linear scan of all SST tables for range tombstone suppression.

The current implementation iterates over every table in every level (lines 802-813) to check range tombstones. For point reads, this could be expensive with many tables.

Consider filtering tables by key range before checking their tombstones, similar to how get_for_key filters tables in get_internal_entry_from_tables. The PR notes this as a known limitation ("linear SST scan for point-read suppression").

💡 Potential optimization sketch
-        for table in super_version
-            .version
-            .iter_levels()
-            .flat_map(|lvl| lvl.iter())
-            .flat_map(|run| run.iter())
-        {
+        // Only check tables whose key range contains the key
+        for table in super_version
+            .version
+            .iter_levels()
+            .flat_map(|lvl| lvl.iter())
+            .filter_map(|run| run.get_for_key(key))
+        {
             for rt in table.range_tombstones() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tree/mod.rs` around lines 778 - 816, The
is_suppressed_by_range_tombstones function currently scans every SST table;
limit that by first filtering tables whose key range could contain the target
key (like get_for_key / get_internal_entry_from_tables does) before iterating
range_tombstones(): for each table from
super_version.version.iter_levels().flat_map(...).flat_map(...), check the
table's key range (e.g., smallest/largest key or a key_range() accessor on the
Table) and skip tables where key is outside that range, then only call
table.range_tombstones() and rt.should_suppress; keep the existing
active_memtable/sealed_memtables checks and ensure you use the same table-range
helpers used elsewhere to avoid duplicating range logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/table/util.rs`:
- Line 31: The attribute #[allow(clippy::too_many_arguments)] in
src/table/util.rs should be changed to #[expect(clippy::too_many_arguments)] to
comply with the repo's policy; locate the function or item annotated with
#[allow(clippy::too_many_arguments)] (the attribute on the item in util.rs) and
replace the allow attribute with #[expect(clippy::too_many_arguments)]
preserving any existing whitespace/comments and ensuring the attribute exactly
matches the expect form.

---

Nitpick comments:
In `@src/abstract_tree.rs`:
- Around line 239-245: The attribute on function flush_to_tables uses
#[warn(clippy::type_complexity)] which will emit warnings instead of suppressing
them; change it to #[expect(clippy::type_complexity)] (or
#[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on
the flush_to_tables function declaration so the clippy type_complexity lint is
treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt
code path.

In `@src/range.rs`:
- Line 101: Replace the attribute #[allow(clippy::too_many_lines)] with
#[expect(clippy::too_many_lines)] to follow the coding guideline; locate the
attribute (attached to the function/module declared near the top of
src/range.rs) and change the lint suppression to an expectation so you receive a
warning if the suppression becomes unnecessary.

In `@src/table/iter.rs`:
- Line 122: Replace the #[allow(clippy::too_many_arguments)] attribute on the
function/impl annotated in this diff with #[expect(clippy::too_many_arguments)]
so the suppression still applies but will warn if it becomes unnecessary; locate
the exact attribute instance (the attribute decorating the function that
currently uses #[allow(clippy::too_many_arguments)]) and change the attribute
name only, leaving the lint and surrounding code unchanged.

In `@src/tree/mod.rs`:
- Around line 778-816: The is_suppressed_by_range_tombstones function currently
scans every SST table; limit that by first filtering tables whose key range
could contain the target key (like get_for_key / get_internal_entry_from_tables
does) before iterating range_tombstones(): for each table from
super_version.version.iter_levels().flat_map(...).flat_map(...), check the
table's key range (e.g., smallest/largest key or a key_range() accessor on the
Table) and skip tables where key is outside that range, then only call
table.range_tombstones() and rt.should_suppress; keep the existing
active_memtable/sealed_memtables checks and ensure you use the same table-range
helpers used elsewhere to avoid duplicating range logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5037095-c6a0-47c3-a880-07c3d65eaa33

📥 Commits

Reviewing files that changed from the base of the PR and between a3bc6c9 and 718f2ba.

📒 Files selected for processing (14)
  • .forge/16/analysis.md
  • src/abstract_tree.rs
  • src/active_tombstone_set.rs
  • src/memtable/interval_tree.rs
  • src/memtable/mod.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/range_tombstone_filter.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/util.rs
  • src/table/writer/mod.rs
  • src/tree/mod.rs
  • tests/range_tombstone.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/range_tombstone.rs

Comment thread src/table/util.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MVCC-aware range tombstones to the LSM-tree, enabling efficient contiguous deletions via remove_range / remove_prefix, and persists/propagates tombstones through flush + compaction + recovery.

Changes:

  • Introduces core range-tombstone data structures (RangeTombstone, memtable IntervalTree, sweep-line ActiveTombstoneSet, and RangeTombstoneFilter).
  • Extends read/write paths to apply range-tombstone suppression (point reads, iterators, flush, compaction, and SST recovery).
  • Adds integration + regression tests for correctness across memtables/SSTs/compaction/recovery.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests covering range tombstone behavior across layers and lifecycle events.
src/abstract_tree.rs Adds remove_range API and default remove_prefix; extends flush to pass RTs through.
src/tree/mod.rs Implements remove_range; applies RT suppression on point reads; flush now supports RT distribution.
src/blob_tree/mod.rs Forwards range deletes through BlobTree to underlying index tree; flush now supports RT distribution.
src/memtable/mod.rs Stores RTs in a mutex-protected interval tree; updates emptiness + adds RT query/insert APIs.
src/memtable/interval_tree.rs New AVL interval tree for efficient RT overlap/suppression queries in memtables.
src/active_tombstone_set.rs New forward/reverse active RT tracking for sweep-line iteration filtering.
src/range_tombstone.rs Defines the RT type, ordering, and helper utilities (intersection, coverage checks).
src/range_tombstone_filter.rs New bidirectional iterator wrapper that suppresses KV entries covered by RTs.
src/range.rs Collects RTs from all layers; wraps merged iterator with RangeTombstoneFilter; adds table-skip optimization.
src/table/block/type.rs Adds new BlockType::RangeTombstone = 4.
src/table/regions.rs Extends table TOC parsing to include optional range_tombstones region.
src/table/inner.rs Stores recovered table RTs in Inner.
src/table/mod.rs Loads/decodes RT blocks on table open; exposes range_tombstones() accessor.
src/table/writer/mod.rs Adds RT collection and writes RT block + metadata count during table finish.
src/table/multi_writer.rs Distributes/clips RTs across rotated output tables (flush/compaction output).
src/table/util.rs Treats RT blocks like data/meta blocks for metrics counters; changes lint suppression.
src/table/iter.rs Changes lint suppression for constructor with many args.
src/compaction/flavour.rs Adds write_range_tombstones hook so compaction can propagate RTs to outputs.
src/compaction/worker.rs Collects RTs from input tables and propagates them to compaction outputs (with GC at last level).
src/lib.rs Exposes new internal modules for RT support.
.forge/16/known-gaps.md Documents known limitations and optimization opportunities for RT feature.
.forge/16/analysis.md Design/architecture analysis and upstream context for RT implementation choices.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/table/util.rs Outdated
Comment thread src/table/iter.rs Outdated
Comment thread src/range.rs Outdated
Comment thread src/abstract_tree.rs Outdated
Comment thread src/table/writer/mod.rs Outdated
Comment thread src/tree/mod.rs Outdated
Comment thread src/range_tombstone.rs Outdated
Comment thread src/table/mod.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/table/mod.rs (1)

648-666: Consider preserving error context for easier debugging.

The map_err(|_| crate::Error::Unrecoverable) discards the original I/O error. While Unrecoverable is semantically appropriate for corrupted data, preserving context (e.g., which field failed, byte offset) could aid debugging in production.

💡 Optional: Add context to decode errors

If a more descriptive error variant exists or is added later (e.g., Error::RangeTombstoneCorrupted { offset, reason }), consider using it here to preserve the cursor position and failure reason.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/mod.rs` around lines 648 - 666, The current reads
(cursor.read_u16::<LE>(), cursor.read_exact(...), cursor.read_u64::<LE>()) map
all IO errors to crate::Error::Unrecoverable, discarding the original error and
offset; update these map_err calls to capture and wrap the original error and
context (e.g., which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno"
and the cursor position/byte offset) into a more informative Error variant or
into Unrecoverable with a message, so failures when reading start_len,
start_buf, end_len, end_buf, and seqno preserve the underlying io::Error and an
indication of the failing field/offset for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/table/mod.rs`:
- Around line 671-675: Add a unit test named
decode_range_tombstones_invalid_interval_returns_error() that builds a
serialized range tombstone blob matching the production format (little-endian
lengths and seqno fields), intentionally sets the interval so start >= end,
calls decode_range_tombstones(...) and asserts it returns
Err(crate::Error::Unrecoverable); reference the decode_range_tombstones function
and use the same byteorder::<LE> encoding for length/seqno fields so the
deserializer exercises the start >= end validation path.

---

Nitpick comments:
In `@src/table/mod.rs`:
- Around line 648-666: The current reads (cursor.read_u16::<LE>(),
cursor.read_exact(...), cursor.read_u64::<LE>()) map all IO errors to
crate::Error::Unrecoverable, discarding the original error and offset; update
these map_err calls to capture and wrap the original error and context (e.g.,
which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno" and the cursor
position/byte offset) into a more informative Error variant or into
Unrecoverable with a message, so failures when reading start_len, start_buf,
end_len, end_buf, and seqno preserve the underlying io::Error and an indication
of the failing field/offset for easier debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d0210b5-a114-4709-840a-169886e4144c

📥 Commits

Reviewing files that changed from the base of the PR and between 718f2ba and 343f31c.

📒 Files selected for processing (1)
  • src/table/mod.rs

Comment thread src/table/mod.rs
- Fix visible_at to use exclusive boundary (seqno < read_seqno),
  consistent with seqno_filter and Memtable::get conventions
- Validate decoded field lengths against remaining block data before
  allocating buffers (prevent OOM on corrupted blocks)
- Skip SST tables by key_range in point read RT suppression check
- Add code comments explaining #[allow] vs #[expect] for cfg-gated params
- Use #[expect] with reason for too_many_lines on create_range
- Fix remove_range doc to match no-op behavior for invalid intervals
- Update PR body to reflect RT-only flush semantics
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/table/util.rs (1)

31-32: ⚠️ Potential issue | 🟡 Minor

Use cfg_attr(..., expect(...)) instead of a permanent #[allow].

The feature-gated parameter count is exactly what cfg_attr(feature = "metrics", expect(...)) is for. Leaving #[allow] here suppresses the lint in the one build where it should still fire.

Suggested patch
-// NOTE: #[allow] not #[expect] because arg count changes with cfg(feature = "metrics")
-#[allow(clippy::too_many_arguments)]
+#[cfg_attr(
+    feature = "metrics",
+    expect(
+        clippy::too_many_arguments,
+        reason = "metrics adds the extra parameter; without that feature this stays at the lint threshold"
+    )
+)]
 pub fn load_block(

As per coding guidelines, "Prefer #[expect(lint)] over #[allow(lint)]#[expect] warns when suppression becomes unnecessary."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/util.rs` around lines 31 - 32, Replace the permanent attribute
#[allow(clippy::too_many_arguments)] with a conditional expect so the lint still
fires when the metrics feature is disabled; specifically change the attribute to
#[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the
inverse depending on feature semantics) at the same item (the attribute applied
in src/table/util.rs) so the suppression only applies under the cfg where extra
arguments exist and the lint will warn in builds where it should.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/range.rs`:
- Around line 217-226: The table-cover check uses table.metadata.key_range.max()
which is an exclusive upper bound, but RangeTombstone::fully_covers() expects an
inclusive max; convert the exclusive upper bound to an inclusive one before
calling fully_covers (e.g., compute an inclusive_max from
table.metadata.key_range.max() using the appropriate predecessor/prev-key API or
method on the key type, handle empty ranges by skipping the check if min >
inclusive_max), and then call rt.fully_covers(table.metadata.key_range.min(),
inclusive_max) inside the all_range_tombstones.iter().any(...) closure while
keeping the visible_at and seqno comparison logic unchanged.

---

Duplicate comments:
In `@src/table/util.rs`:
- Around line 31-32: Replace the permanent attribute
#[allow(clippy::too_many_arguments)] with a conditional expect so the lint still
fires when the metrics feature is disabled; specifically change the attribute to
#[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the
inverse depending on feature semantics) at the same item (the attribute applied
in src/table/util.rs) so the suppression only applies under the cfg where extra
arguments exist and the lint will warn in builds where it should.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f4d3251b-ba68-46f2-8418-870e509cdcef

📥 Commits

Reviewing files that changed from the base of the PR and between 343f31c and 3e23f57.

📒 Files selected for processing (7)
  • src/abstract_tree.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/util.rs
  • src/tree/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/table/iter.rs

Comment thread src/range.rs Outdated
Replace #[allow(clippy::too_many_arguments)] with
#[cfg_attr(feature = "metrics", expect(...))] so the suppression
only activates when the extra metrics parameter is compiled in.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces MVCC-correct range tombstones to support efficient delete_range / delete_prefix semantics across memtables and SSTs, including persistence through flush/compaction/recovery and iterator-time suppression.

Changes:

  • Add remove_range / remove_prefix to AbstractTree, with memtable storage via an AVL interval tree and iterator-time filtering via a sweep-line RangeTombstoneFilter.
  • Persist range tombstones into SSTs via a new BlockType::RangeTombstone and load/validate them on table recovery.
  • Propagate/GC range tombstones during compaction and add an integration test suite for end-to-end semantics.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests for point reads, iteration, flush/compaction/recovery, prefix deletes, and RT-only flush behavior
src/abstract_tree.rs Adds public remove_range/remove_prefix API + flush plumbing for range tombstones
src/tree/mod.rs Flush writes RTs to SSTs and point-read path checks suppression by RTs across layers
src/blob_tree/mod.rs RT support for blob-tree flush path + forwards remove_range
src/memtable/mod.rs Stores RTs in a mutex-protected interval tree; exposes query/sorted export helpers
src/memtable/interval_tree.rs New AVL interval tree implementation for efficient overlap/suppression queries
src/range.rs Wraps merged iterators with RangeTombstoneFilter and adds table-skip optimization
src/range_tombstone.rs Defines RangeTombstone semantics, ordering, clipping, and helpers
src/range_tombstone_filter.rs Bidirectional sweep-line iterator filter for RT suppression
src/active_tombstone_set.rs Active set structures used by the iterator filter (forward + reverse)
src/table/block/type.rs Adds new on-disk block tag RangeTombstone = 4
src/table/writer/mod.rs Writes an RT block into SSTs and records range_tombstone_count in metadata
src/table/regions.rs TOC parsing extended to locate the range_tombstones section
src/table/inner.rs Stores loaded RTs in table inner state
src/table/mod.rs Loads/decodes RT blocks on recovery and exposes range_tombstones()
src/table/multi_writer.rs Attempts to distribute/clamp RTs across rotated output tables
src/table/util.rs Treats RT blocks like data/meta for metrics bookkeeping; lint suppression change
src/table/iter.rs Lint suppression change for constructor arg count
src/compaction/worker.rs Collects RTs from input tables and attempts to propagate/GC them
src/compaction/flavour.rs Adds write_range_tombstones hook and wires it to the table writer
src/lib.rs Exposes new internal modules for range tombstone support
.forge/16/known-gaps.md Documents limitations/known gaps for the feature
.forge/16/analysis.md Architecture analysis and design trade-offs documentation

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/tree/mod.rs Outdated
Comment thread src/blob_tree/mod.rs Outdated
Comment thread src/compaction/worker.rs Outdated
Comment thread src/tree/mod.rs
Comment thread src/memtable/mod.rs
Comment thread src/table/iter.rs
Comment thread src/active_tombstone_set.rs
Comment thread src/table/writer/mod.rs Outdated
Comment thread src/table/util.rs
Comment thread src/memtable/interval_tree.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/range.rs (2)

302-307: Strengthen the fast path to skip wrapping when no tombstone is visible.

A non-empty all_range_tombstones can still have zero entries visible at seqno, so this currently wraps with a no-op filter in those cases.

⚡ Proposed patch
-            // Fast path: skip filter wrapping when no range tombstones exist
-            if all_range_tombstones.is_empty() {
+            // Fast path: skip wrapping when no tombstone is visible for this read seqno
+            if !all_range_tombstones.iter().any(|rt| rt.visible_at(seqno)) {
                 Box::new(iter)
             } else {
                 Box::new(RangeTombstoneFilter::new(iter, all_range_tombstones, seqno))
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range.rs` around lines 302 - 307, The fast-path should skip wrapping not
only when all_range_tombstones.is_empty() but also when none of the tombstones
are visible at the given seqno; update the conditional around Box::new(...) to
inspect all_range_tombstones for any entry visible at seqno (e.g. using the
visibility predicate/method used by RangeTombstoneFilter or by
filtering/map-checking the collection) and only construct
RangeTombstoneFilter::new(iter, all_range_tombstones, seqno) if at least one
tombstone is visible; otherwise return Box::new(iter) unchanged.

176-196: Remove redundant tombstone sorting in this path.

rts.sort() here is duplicate work: RangeTombstoneFilter::new already sorts the forward list (and builds a reverse-sorted copy). This adds avoidable O(n log n) cost per range open.

♻️ Proposed patch
             let all_range_tombstones = if rt_count > 0 {
                 let mut rts: Vec<RangeTombstone> = Vec::with_capacity(rt_count);

                 rts.extend(lock.version.active_memtable.range_tombstones_sorted());
                 for mt in lock.version.sealed_memtables.iter() {
                     rts.extend(mt.range_tombstones_sorted());
                 }
                 if let Some((mt, _)) = &lock.ephemeral {
                     rts.extend(mt.range_tombstones_sorted());
                 }
                 for table in lock
                     .version
                     .version
                     .iter_levels()
                     .flat_map(|lvl| lvl.iter())
                     .flat_map(|run| run.iter())
                 {
                     rts.extend(table.range_tombstones().iter().cloned());
                 }
-                rts.sort();
                 rts
             } else {
                 Vec::new()
             };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range.rs` around lines 176 - 196, The vector `rts` constructed for
`all_range_tombstones` is being sorted with `rts.sort()` redundantly;
`RangeTombstoneFilter::new` already sorts the forward list and produces the
reverse-sorted copy, so remove the `rts.sort()` call in the block that builds
`rts` (the code that extends from
`lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables,
`lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed
into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while
preserving correctness for `RangeTombstone` handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/range.rs`:
- Around line 302-307: The fast-path should skip wrapping not only when
all_range_tombstones.is_empty() but also when none of the tombstones are visible
at the given seqno; update the conditional around Box::new(...) to inspect
all_range_tombstones for any entry visible at seqno (e.g. using the visibility
predicate/method used by RangeTombstoneFilter or by filtering/map-checking the
collection) and only construct RangeTombstoneFilter::new(iter,
all_range_tombstones, seqno) if at least one tombstone is visible; otherwise
return Box::new(iter) unchanged.
- Around line 176-196: The vector `rts` constructed for `all_range_tombstones`
is being sorted with `rts.sort()` redundantly; `RangeTombstoneFilter::new`
already sorts the forward list and produces the reverse-sorted copy, so remove
the `rts.sort()` call in the block that builds `rts` (the code that extends from
`lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables,
`lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed
into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while
preserving correctness for `RangeTombstone` handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36192350-8de8-4946-a95f-d0e3da61329a

📥 Commits

Reviewing files that changed from the base of the PR and between 3e23f57 and 954c044.

📒 Files selected for processing (3)
  • src/range.rs
  • src/table/iter.rs
  • src/table/util.rs

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 16, 2026

…unds

- Set range tombstones on MultiWriter before KV write loop in flush
  (Tree, BlobTree) so rotations carry RT metadata in earlier tables
- Propagate RTs before merge_iter loop in compaction worker
- Assert u16 bound on start/end key lengths in insert_range_tombstone
- Use exclusive seqno boundary in visible_at doc strings
- Strengthen iteration fast path: check visible_at not just is_empty
- Remove redundant sort (RangeTombstoneFilter::new sorts internally)
- Compaction with MultiWriter rotation preserves RTs across tables
- Table-skip optimization skips fully-covered tables
- BlobTree range tombstone insert, suppress, flush
- Invalid interval (start >= end) silently returns 0
- Multiple compaction rounds preserve range tombstones
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MVCC-aware range tombstones to enable efficient range/prefix deletion (remove_range / remove_prefix) with persistence across flush/compaction/recovery, plus iterator- and point-read suppression.

Changes:

  • Introduces core range-tombstone data structures (range tombstone model, sweep-line active sets, memtable interval tree, iterator filter).
  • Persists range tombstones in SSTs via a new BlockType::RangeTombstone and TOC section, propagating them through compaction.
  • Updates read path (point reads + range/prefix iteration) to suppress covered entries and adds comprehensive tests/docs.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/range_tombstone.rs New integration test suite for RT semantics across memtable/SST/compaction/recovery.
src/abstract_tree.rs Adds remove_range API + remove_prefix helper; flush now forwards RTs to table writers.
src/blob_tree/mod.rs Wires RT-aware flush through blob tree and forwards remove_range.
src/tree/mod.rs Implements remove_range for Tree, RT-aware flush-to-tables, and point-read suppression checks.
src/range.rs Collects RTs across layers for range/prefix iteration; adds RT filtering and table-skip optimization.
src/memtable/mod.rs Stores memtable RTs (mutex-protected interval tree) and exposes query/flush helpers.
src/memtable/interval_tree.rs New AVL interval tree for RT overlap and suppression queries in memtables.
src/active_tombstone_set.rs New forward/reverse sweep-line active tombstone sets for iteration.
src/range_tombstone.rs Defines RangeTombstone and helpers (visibility, suppression, clipping, coverage).
src/range_tombstone_filter.rs New bidirectional iterator wrapper that suppresses keys covered by visible RTs.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4 to the SST block type enum.
src/table/writer/mod.rs Writes RT blocks + count metadata during SST creation; updates writer state to track RTs.
src/table/regions.rs Parses new "range_tombstones" TOC section into table regions.
src/table/mod.rs Loads/decodes RT blocks on table recovery and exposes Table::range_tombstones().
src/table/inner.rs Stores recovered table RTs in Table inner state.
src/table/multi_writer.rs Distributes/clips RTs across rotated output tables during flush/compaction.
src/table/util.rs Includes RT blocks in metrics accounting for block loads (cache + IO).
src/table/iter.rs Adjusts clippy lint gating for metrics-conditional argument count.
src/compaction/worker.rs Collects RTs from compaction inputs and propagates them to outputs (with GC eviction at last level).
src/compaction/flavour.rs Extends compaction flavour interface to pass RTs into table writers.
src/lib.rs Exposes new internal modules for RT support.
.forge/16/known-gaps.md Documents known limitations/optimization gaps for Issue #16 implementation.
.forge/16/analysis.md Architecture/design analysis and upstream context for range tombstones.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/range_tombstone.rs
Comment thread src/table/writer/mod.rs Outdated
Comment thread src/memtable/interval_tree.rs Outdated
Comment thread src/memtable/interval_tree.rs Outdated
Comment thread src/range_tombstone.rs
Comment thread src/active_tombstone_set.rs
polaz added 2 commits March 17, 2026 00:52
- Write synthetic weak tombstone in RT-only tables to produce valid index
- Fix interval_tree pruning: use >= for exclusive seqno boundary
- Replace #[allow(dead_code)] with #[expect(dead_code, reason)] throughout
- Remove key_range filter from SST RT suppression (RT range may exceed table key range)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Comment thread tests/tree_recovery_versions.rs
Comment thread tests/tree_recovery_versions.rs
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Mar 20, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 20, 2026

@polaz Sure! I'll kick off a full re-review of the PR now.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-20T08:17:06.289Z
Learning: In `structured-world/lsm-tree`, the switch from `Mutex<IntervalTree>` to `RwLock<IntervalTree>` (or a freeze-on-seal immutable structure) for memtable range-tombstone storage in `src/memtable/mod.rs` is intentionally deferred from PR `#21` to a separate issue (`#30`). Do not flag the existing `Mutex<IntervalTree>` usage in `src/memtable/mod.rs` as a concurrency concern in the context of PR `#21`; it is tracked in issue `#30`.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-20T08:17:39.835Z
Learning: In `structured-world/lsm-tree`, the `Memtable` range-tombstone storage uses `Mutex<IntervalTree>` (introduced in PR `#21`). Switching to `RwLock<IntervalTree>` or a freeze-on-seal immutable structure is a deliberate future improvement tracked in a separate issue — do not flag the `Mutex` usage as a blocking concern in PR `#21` or future PRs that reference this design.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-20T08:17:39.835Z
Learning: In `structured-world/lsm-tree`, the `Memtable` range-tombstone storage uses `Mutex<IntervalTree>` (introduced in PR `#21`). Switching to `RwLock<IntervalTree>` or a freeze-on-seal immutable structure is a deliberate future improvement tracked in issue `#31` — do not flag the `Mutex` usage as a blocking concern in PR `#21` or future PRs that reference this design.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/table/multi_writer.rs:129-206
Timestamp: 2026-03-20T11:14:02.461Z
Learning: In `structured-world/lsm-tree`, the `write_rts_to_writer` function in `src/table/multi_writer.rs` has a known limitation when `clip_range_tombstones=true` (compaction mode): RT fragments that don't intersect any output writer's KV span are silently dropped, which can lose delete-only gaps (e.g. an RT covering `[m,p)` while all output tables cover `[a,l]` and `[q,z]`). This limitation is intentionally deferred to issue `#32` and is documented as a Known Limitation in PR `#21`. Do not flag this as a blocking concern in PR `#21` or any PR until `#32` is addressed.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/table/writer/mod.rs:394-463
Timestamp: 2026-03-20T20:05:18.309Z
Learning: In `structured-world/lsm-tree` (`src/table/writer/mod.rs`), the `key#min`/`key#max` widening for RT bounds in `Writer::finish()` is intentionally limited to the `item_count == 0` (RT-only) branch. For mixed KV+RT tables, `MultiWriter::write_rts_to_writer` (lines 167–199 of `src/table/multi_writer.rs`) already widens `meta.first_key`/`meta.last_key` conservatively in flush mode (`clip=false`). In compaction mode (`clip=true`), RTs are intersected with the KV key range so no widening is needed. Do not flag the absence of unconditional widening in `Writer::finish()` as a bug in PR `#21` or future PRs.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/table/writer/mod.rs:437-457
Timestamp: 2026-03-20T12:29:43.719Z
Learning: In `structured-world/lsm-tree` (`src/table/writer/mod.rs`), the synthetic weak-tombstone sentinel written for RT-only SSTs (when `item_count == 0` but `range_tombstones` is non-empty) intentionally increments `meta.tombstone_count`, `meta.weak_tombstone_count`, `meta.item_count`, and `meta.key_count`. All on-disk metadata counters must match actual block contents for recovery validation consistency (established in commit 7452427a for item_count/key_count; same reasoning applies to tombstone counters). The resulting +1 inflation on GC heuristics is harmless — it overestimates reclaimable space and triggers slightly earlier checks. Do not flag this as a metadata accuracy bug in PR `#21` or future PRs.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 12
File: src/table/block/mod.rs:109-119
Timestamp: 2026-03-15T16:10:30.764Z
Learning: In `structured-world/lsm-tree`, pre-allocation bounds checks on `uncompressed_length`, `data_length`, `handle.size()`, and `real_val_len` (256 MiB cap, returning `DecompressedSizeTooLarge`) are intentionally deferred to PR `#7` (branch `feat/#258-security-validate-uncompressedlength-before-decomp`, closes issues `#258` and `#257`). Do not flag missing size-cap guards in `Block::from_reader`, `Block::from_file`, or `blob_file::Reader::get` as unaddressed — they are handled in that separate security PR.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 23
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T00:34:54.145Z
Learning: In `structured-world/lsm-tree`, the lint suppression policy distinguishes between legacy and new code:
- Existing/upstream `#[allow(clippy::...)]` in **unchanged** code: treat as legacy — do **not** flag it in reviews.
- New code introduced in a PR **must** use `#[expect(clippy::...)]` with a justification reason.
This avoids churning upstream code while enforcing the `#[expect]` standard for all new contributions.

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: When reviewing Rust code in forks with multiple feature branches in parallel, check the PR description for cross-references as hardening in one branch may already exist in another

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-15T13:24:43.905Z
Learning: In the lsm-tree repository, the `unsafe { Slice::builder_unzeroed }` pattern used in lz4 decompression is pre-existing upstream code (not introduced by any feature PR). It is tracked for a separate fix in issue `#2`. The zstd decompression path uses the safe `Vec`-based approach instead.

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: Review ONLY code within the PR's diff in Rust code reviews; for issues outside the diff, suggest opening a separate issue

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: Applies to **/*.rs : Focus Rust code review effort on real bugs in Tier 1 (Logic Bugs and Correctness) such as data corruption, off-by-one errors, CRC/checksum mismatches, TOCTOU issues, incorrect merge semantics, missing validation, resource leaks, concurrency bugs, error swallowing, and integer overflow/truncation on sizes, offsets, or block counts

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: Applies to **/*.rs : Flag Tier 2 (Safety and Crash Recovery) issues in Rust code: `unsafe` blocks without `// SAFETY:` invariant explanation, `unwrap()`/`expect()` on disk I/O or deserialization, crash safety violations with write ordering, partial write exposure to readers, incorrect fsync ordering for metadata durability, and hardcoded secrets/credentials/private URLs

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/tree/mod.rs:775-776
Timestamp: 2026-03-20T00:36:36.202Z
Learning: In `structured-world/lsm-tree`, `Table::get` translates the caller-provided snapshot seqno into table-local seqno space for internal reads, then rehydrates returned `entry.key.seqno` back into global seqno space before returning the item. Therefore, `entry.key.seqno` is already in global seqno space at the call site in `Tree::get_internal_entry_from_version`, and the call to `is_suppressed_by_range_tombstones(super_version, key, entry.key.seqno, seqno)` is comparing values in the same (global) seqno space even for bulk-ingested tables with a non-zero `global_seqno`. Do not flag this comparison as a seqno-space mismatch.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/tree/mod.rs:747-749
Timestamp: 2026-03-20T20:05:24.184Z
Learning: In `structured-world/lsm-tree` (`src/tree/mod.rs`), `get_internal_entry_from_version` searches newest-first (active memtable → sealed memtables → SST tables). Each source's `get(key, seqno)` already returns the most-recent entry visible at that seqno within that source. Therefore, a point tombstone returned from the active memtable IS the authoritative latest version for that key — returning `Ok(None)` immediately is correct and does NOT hide a higher-seqno value from a lower source. Do not flag this early-return pattern as hiding newer entries.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Comment thread tests/tree_recovery_versions.rs Outdated
Comment thread tests/range_tombstone.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
.github/copilot-instructions.md (1)

65-74: ⚠️ Potential issue | 🟠 Major

Narrow this rule to PR-visible call sites and preserve unconditional Tier 1/2 reporting.

Line 69 (“check every call site”) conflicts with Line 9 diff-only review scope, and Line 73’s “not a bug” wording is too absolute. Please scope this to call sites visible in the PR diff and explicitly keep Tier 1/Tier 2 findings reportable even when a local protocol exists.

Suggested wording tweak
- **Before flagging any potential issue, trace the FULL call chain:**
+ **Before flagging any potential issue, trace the call chain visible in the PR diff:**
...
- **Only flag an issue if it survives all three checks.** A finding that looks like a bug in one function but is handled by its caller is not a bug — it is a protocol between the two functions.
+ **Only suppress Tier 3/Tier 4 findings if they fail these checks.** If the pattern still indicates Tier 1 (correctness) or Tier 2 (safety/crash-recovery) risk, flag it even when caller/callee protocols are documented.

Based on learnings: “Review ONLY code within the PR's diff” and “Tier 1/Tier 2 issues must be flagged regardless of documentation.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/copilot-instructions.md around lines 65 - 74, Update the "Design
Decision Analysis (CRITICAL)" guidance to limit the "check every call site"
requirement to only call sites that are visible within the PR diff and to
explicitly preserve unconditional reporting for Tier 1 and Tier 2 findings; edit
the sentences that mention "check every call site" and "Only flag an issue if it
survives all three checks" so they read that reviewers should trace the full
call chain only among PR-visible call sites and that documented local protocols
do not suppress Tier 1/Tier 2 reports.
🧹 Nitpick comments (1)
src/range_tombstone_filter.rs (1)

151-261: Rename the new unit tests to the repo’s standard pattern.

Several of these names (no_tombstones, basic_suppression, half_open_end_exclusive, reverse_half_open, …) are outside the repository’s what_condition_expected convention, which makes the new test surface less consistent than the rest of the crate.

As per coding guidelines, "Use test naming convention: fn <what>_<condition>_<expected>()."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range_tombstone_filter.rs` around lines 151 - 261, Rename the tests to
follow the <what>_<condition>_<expected> pattern: no_tombstones ->
items_no_tombstones_return_all, basic_suppression ->
items_with_range_tombstone_suppress_intermediate_keys,
tombstone_does_not_suppress_newer_kv ->
tombstone_older_than_kv_does_not_suppress, half_open_end_exclusive ->
range_end_exclusive_suppresses_middle_keys, multiple_overlapping_tombstones ->
overlapping_tombstones_keep_newer_kv, tombstone_not_visible_at_read_seqno ->
tombstone_newer_than_read_seqno_not_visible, reverse_basic_suppression ->
rev_items_with_range_tombstone_suppress_middle_keys, reverse_half_open ->
rev_range_end_exclusive_suppresses_between; update the fn identifiers
accordingly (e.g., fn items_no_tombstones_return_all()) and keep the test bodies
unchanged so references to RangeTombstoneFilter, rt, kv, SeqNo::MAX remain
valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/range_tombstone.rs`:
- Around line 193-196: The module-level clippy suppression currently only lists
clippy::unwrap_used (the attribute on the module near the top of
src/range_tombstone.rs); add clippy::expect_used to that same attribute so calls
to expect(...) (e.g., the expect used around line ~385) are also silenced during
clippy runs. Update the existing #[expect(...)] attribute to include both
clippy::unwrap_used and clippy::expect_used.

In `@tests/range_tombstone.rs`:
- Around line 877-902: The test
compaction_clip_preserves_rt_covering_gap_between_tables currently uses
major_compact(64_000_000, 0) with tiny inputs so it compacts into a single
output table and never exercises the described gap case; either (A)
rename/reword the test and its comments to state it is a major-compaction
regression (remove claims about a gap between output tables) referencing the
test fn compaction_clip_preserves_rt_covering_gap_between_tables and calls to
major_compact/remove_range/flush_active_memtable, or (B) change the test inputs
to force multiple output tables (e.g., insert more keys or lower the compaction
target so major_compact produces two outputs, or invoke the leveled compaction
path if available) so the RT [m,p) actually falls in a gap between output tables
and the assertions exercise the intended case.

---

Duplicate comments:
In @.github/copilot-instructions.md:
- Around line 65-74: Update the "Design Decision Analysis (CRITICAL)" guidance
to limit the "check every call site" requirement to only call sites that are
visible within the PR diff and to explicitly preserve unconditional reporting
for Tier 1 and Tier 2 findings; edit the sentences that mention "check every
call site" and "Only flag an issue if it survives all three checks" so they read
that reviewers should trace the full call chain only among PR-visible call sites
and that documented local protocols do not suppress Tier 1/Tier 2 reports.

---

Nitpick comments:
In `@src/range_tombstone_filter.rs`:
- Around line 151-261: Rename the tests to follow the
<what>_<condition>_<expected> pattern: no_tombstones ->
items_no_tombstones_return_all, basic_suppression ->
items_with_range_tombstone_suppress_intermediate_keys,
tombstone_does_not_suppress_newer_kv ->
tombstone_older_than_kv_does_not_suppress, half_open_end_exclusive ->
range_end_exclusive_suppresses_middle_keys, multiple_overlapping_tombstones ->
overlapping_tombstones_keep_newer_kv, tombstone_not_visible_at_read_seqno ->
tombstone_newer_than_read_seqno_not_visible, reverse_basic_suppression ->
rev_items_with_range_tombstone_suppress_middle_keys, reverse_half_open ->
rev_range_end_exclusive_suppresses_between; update the fn identifiers
accordingly (e.g., fn items_no_tombstones_return_all()) and keep the test bodies
unchanged so references to RangeTombstoneFilter, rt, kv, SeqNo::MAX remain
valid.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31c0a672-6e79-4242-ad4e-bc89cd0d3d5b

📥 Commits

Reviewing files that changed from the base of the PR and between ffa24b7 and c46710e.

📒 Files selected for processing (28)
  • .github/copilot-instructions.md
  • .github/instructions/rust.instructions.md
  • README.md
  • src/abstract_tree.rs
  • src/active_tombstone_set.rs
  • src/any_tree.rs
  • src/blob_tree/mod.rs
  • src/compaction/flavour.rs
  • src/compaction/worker.rs
  • src/format_version.rs
  • src/lib.rs
  • src/memtable/interval_tree.rs
  • src/memtable/mod.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/range_tombstone_filter.rs
  • src/table/block/type.rs
  • src/table/inner.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/multi_writer.rs
  • src/table/regions.rs
  • src/table/util.rs
  • src/table/writer/mod.rs
  • src/tree/mod.rs
  • src/version/mod.rs
  • tests/range_tombstone.rs
  • tests/tree_recovery_versions.rs

Comment thread src/range_tombstone.rs
Comment thread tests/range_tombstone.rs
polaz added 2 commits March 20, 2026 22:41
…TODO, ignore heavy test

- Add clippy::expect_used to range_tombstone test module expect list
- Add range_tombstones section to ASCII table layout diagram
- Annotate RT block metrics folded into data_block counters
- Mark 68 MiB multi-table flush test as #[ignore]
- Assert InvalidVersion error type in manifest version test
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Mar 20, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 20, 2026

@polaz Sure! I'll kick off a full re-review of the PR now.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-20T08:17:06.289Z
Learning: In `structured-world/lsm-tree`, the switch from `Mutex<IntervalTree>` to `RwLock<IntervalTree>` (or a freeze-on-seal immutable structure) for memtable range-tombstone storage in `src/memtable/mod.rs` is intentionally deferred from PR `#21` to a separate issue (`#30`). Do not flag the existing `Mutex<IntervalTree>` usage in `src/memtable/mod.rs` as a concurrency concern in the context of PR `#21`; it is tracked in issue `#30`.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-20T08:17:39.835Z
Learning: In `structured-world/lsm-tree`, the `Memtable` range-tombstone storage uses `Mutex<IntervalTree>` (introduced in PR `#21`). Switching to `RwLock<IntervalTree>` or a freeze-on-seal immutable structure is a deliberate future improvement tracked in a separate issue — do not flag the `Mutex` usage as a blocking concern in PR `#21` or future PRs that reference this design.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-20T08:17:39.835Z
Learning: In `structured-world/lsm-tree`, the `Memtable` range-tombstone storage uses `Mutex<IntervalTree>` (introduced in PR `#21`). Switching to `RwLock<IntervalTree>` or a freeze-on-seal immutable structure is a deliberate future improvement tracked in issue `#31` — do not flag the `Mutex` usage as a blocking concern in PR `#21` or future PRs that reference this design.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/table/multi_writer.rs:129-206
Timestamp: 2026-03-20T11:14:02.461Z
Learning: In `structured-world/lsm-tree`, the `write_rts_to_writer` function in `src/table/multi_writer.rs` has a known limitation when `clip_range_tombstones=true` (compaction mode): RT fragments that don't intersect any output writer's KV span are silently dropped, which can lose delete-only gaps (e.g. an RT covering `[m,p)` while all output tables cover `[a,l]` and `[q,z]`). This limitation is intentionally deferred to issue `#32` and is documented as a Known Limitation in PR `#21`. Do not flag this as a blocking concern in PR `#21` or any PR until `#32` is addressed.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/table/writer/mod.rs:394-463
Timestamp: 2026-03-20T20:05:18.309Z
Learning: In `structured-world/lsm-tree` (`src/table/writer/mod.rs`), the `key#min`/`key#max` widening for RT bounds in `Writer::finish()` is intentionally limited to the `item_count == 0` (RT-only) branch. For mixed KV+RT tables, `MultiWriter::write_rts_to_writer` (lines 167–199 of `src/table/multi_writer.rs`) already widens `meta.first_key`/`meta.last_key` conservatively in flush mode (`clip=false`). In compaction mode (`clip=true`), RTs are intersected with the KV key range so no widening is needed. Do not flag the absence of unconditional widening in `Writer::finish()` as a bug in PR `#21` or future PRs.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/table/writer/mod.rs:437-457
Timestamp: 2026-03-20T12:29:43.719Z
Learning: In `structured-world/lsm-tree` (`src/table/writer/mod.rs`), the synthetic weak-tombstone sentinel written for RT-only SSTs (when `item_count == 0` but `range_tombstones` is non-empty) intentionally increments `meta.tombstone_count`, `meta.weak_tombstone_count`, `meta.item_count`, and `meta.key_count`. All on-disk metadata counters must match actual block contents for recovery validation consistency (established in commit 7452427a for item_count/key_count; same reasoning applies to tombstone counters). The resulting +1 inflation on GC heuristics is harmless — it overestimates reclaimable space and triggers slightly earlier checks. Do not flag this as a metadata accuracy bug in PR `#21` or future PRs.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 12
File: src/table/block/mod.rs:109-119
Timestamp: 2026-03-15T16:10:30.764Z
Learning: In `structured-world/lsm-tree`, pre-allocation bounds checks on `uncompressed_length`, `data_length`, `handle.size()`, and `real_val_len` (256 MiB cap, returning `DecompressedSizeTooLarge`) are intentionally deferred to PR `#7` (branch `feat/#258-security-validate-uncompressedlength-before-decomp`, closes issues `#258` and `#257`). Do not flag missing size-cap guards in `Block::from_reader`, `Block::from_file`, or `blob_file::Reader::get` as unaddressed — they are handled in that separate security PR.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 23
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T00:34:54.145Z
Learning: In `structured-world/lsm-tree`, the lint suppression policy distinguishes between legacy and new code:
- Existing/upstream `#[allow(clippy::...)]` in **unchanged** code: treat as legacy — do **not** flag it in reviews.
- New code introduced in a PR **must** use `#[expect(clippy::...)]` with a justification reason.
This avoids churning upstream code while enforcing the `#[expect]` standard for all new contributions.

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: When reviewing Rust code in forks with multiple feature branches in parallel, check the PR description for cross-references as hardening in one branch may already exist in another

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 0
File: :0-0
Timestamp: 2026-03-15T13:24:43.905Z
Learning: In the lsm-tree repository, the `unsafe { Slice::builder_unzeroed }` pattern used in lz4 decompression is pre-existing upstream code (not introduced by any feature PR). It is tracked for a separate fix in issue `#2`. The zstd decompression path uses the safe `Vec`-based approach instead.

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: Review ONLY code within the PR's diff in Rust code reviews; for issues outside the diff, suggest opening a separate issue

Learnt from: CR
Repo: structured-world/lsm-tree PR: 0
File: .github/instructions/rust.instructions.md:0-0
Timestamp: 2026-03-17T01:09:51.997Z
Learning: Applies to **/*.rs : Focus Rust code review effort on real bugs in Tier 1 (Logic Bugs and Correctness) such as data corruption, off-by-one errors, CRC/checksum mismatches, TOCTOU issues, incorrect merge semantics, missing validation, resource leaks, concurrency bugs, error swallowing, and integer overflow/truncation on sizes, offsets, or block counts

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/tree/mod.rs:747-749
Timestamp: 2026-03-20T20:05:24.184Z
Learning: In `structured-world/lsm-tree` (`src/tree/mod.rs`), `get_internal_entry_from_version` searches newest-first (active memtable → sealed memtables → SST tables). Each source's `get(key, seqno)` already returns the most-recent entry visible at that seqno within that source. Therefore, a point tombstone returned from the active memtable IS the authoritative latest version for that key — returning `Ok(None)` immediately is correct and does NOT hide a higher-seqno value from a lower source. Do not flag this early-return pattern as hiding newer entries.

Learnt from: polaz
Repo: structured-world/lsm-tree PR: 21
File: src/tree/mod.rs:775-776
Timestamp: 2026-03-20T00:36:36.202Z
Learning: In `structured-world/lsm-tree`, `Table::get` translates the caller-provided snapshot seqno into table-local seqno space for internal reads, then rehydrates returned `entry.key.seqno` back into global seqno space before returning the item. Therefore, `entry.key.seqno` is already in global seqno space at the call site in `Tree::get_internal_entry_from_version`, and the call to `is_suppressed_by_range_tombstones(super_version, key, entry.key.seqno, seqno)` is comparing values in the same (global) seqno space even for bulk-ingested tables with a non-zero `global_seqno`. Do not flag this comparison as a seqno-space mismatch.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.github/copilot-instructions.md (1)

73-73: Clarify suppression wording to avoid reviewer misinterpretation

The sentence can be read both ways. Consider rewording to explicitly say Tier 3/4 findings may be suppressed only when they do not survive the 3 checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/copilot-instructions.md at line 73, The sentence "Only suppress Tier
3/Tier 4 findings if they fail these checks." is ambiguous; update the wording
in .github/copilot-instructions.md to explicitly state that Tier 3/Tier 4
findings may be suppressed only when they do not survive the three checks (e.g.,
"Only suppress Tier 3/Tier 4 findings if they do not survive the three checks
below: ..."), making clear that any finding that still indicates Tier 1 or Tier
2 must be flagged regardless of suppression rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/range_tombstone.rs`:
- Around line 396-409: The test currently calls tree.major_compact(64_000_000,
0)? but never asserts that compaction actually ran; immediately after calling
tree.major_compact(...) add a structural assertion that the number of SSTables
collapsed (e.g., assert_eq!(1, tree.table_count());) so the test fails if
compaction short-circuits — reference the tree.major_compact and
tree.table_count symbols when making the change.
- Around line 641-652: After calling tree.major_compact(1024, 0) add an
assertion that the compaction produced multiple output tables to ensure the
MultiWriter rotation path was exercised; e.g. immediately after the compact call
assert that the number of output tables is >= 2 by querying the tree's API
(replace with the real accessor if different) such as
assert!(tree.out_tables().len() >= 2) or assert!(tree.num_tables() >= 2) (or use
tree.manifest().live_tables().len() / appropriate method) before the existing
tree.get(...) checks.

---

Nitpick comments:
In @.github/copilot-instructions.md:
- Line 73: The sentence "Only suppress Tier 3/Tier 4 findings if they fail these
checks." is ambiguous; update the wording in .github/copilot-instructions.md to
explicitly state that Tier 3/Tier 4 findings may be suppressed only when they do
not survive the three checks (e.g., "Only suppress Tier 3/Tier 4 findings if
they do not survive the three checks below: ..."), making clear that any finding
that still indicates Tier 1 or Tier 2 must be flagged regardless of suppression
rationale.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c6916bd-814f-424b-8d8f-082c1a50a5a8

📥 Commits

Reviewing files that changed from the base of the PR and between ffa24b7 and 9b1aab8.

📒 Files selected for processing (28)
  • .github/copilot-instructions.md
  • .github/instructions/rust.instructions.md
  • README.md
  • src/abstract_tree.rs
  • src/active_tombstone_set.rs
  • src/any_tree.rs
  • src/blob_tree/mod.rs
  • src/compaction/flavour.rs
  • src/compaction/worker.rs
  • src/format_version.rs
  • src/lib.rs
  • src/memtable/interval_tree.rs
  • src/memtable/mod.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/range_tombstone_filter.rs
  • src/table/block/type.rs
  • src/table/inner.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/multi_writer.rs
  • src/table/regions.rs
  • src/table/util.rs
  • src/table/writer/mod.rs
  • src/tree/mod.rs
  • src/version/mod.rs
  • tests/range_tombstone.rs
  • tests/tree_recovery_versions.rs

Comment thread tests/range_tombstone.rs
Comment thread tests/range_tombstone.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: range tombstones / delete_range / delete_prefix

2 participants