forked from fjall-rs/lsm-tree
-
Notifications
You must be signed in to change notification settings - Fork 1
feat: add range tombstones (delete_range / delete_prefix) #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
polaz
merged 85 commits into
main
from
feat/#16-feat-range-tombstones--deleterange--deleteprefix
Mar 20, 2026
Merged
Changes from 1 commit
Commits
Show all changes
85 commits
Select commit
Hold shift + click to select a range
a3bc6c9
feat: add range tombstones (delete_range / delete_prefix)
polaz c3e80fc
Merge branch 'main' into feat/#16-feat-range-tombstones--deleterange-…
polaz ed272cf
fix: resolve all clippy warnings for range tombstone code
polaz 718f2ba
fix(range-tombstone): validate bounds, fix RT-only flush and edge cases
polaz 343f31c
fix(table): validate BlockType on range tombstone block load
polaz 3e23f57
fix(range-tombstone): seqno visibility, decode hardening, lint attrs
polaz 954c044
fix(lint): use cfg_attr(feature, expect) for metrics-gated arg count
polaz d53ecfe
fix(range-tombstone): propagate RTs before write loop, enforce u16 bo…
polaz cb60d63
test(range-tombstone): rotation, blob tree, table-skip, invalid interval
polaz e1db06d
fix(range-tombstone): RT-only SST persistence, pruning, lint attrs
polaz f0c90ea
style: fix rustfmt formatting in interval_tree
polaz 1f87efa
Merge branch 'main' into feat/#16-feat-range-tombstones--deleterange-…
polaz 334890c
fix(range-tombstone): preserve sentinel seqno bounds, soft-reject ove…
polaz 5077e58
chore: remove .forge from git tracking
polaz 6841260
chore: add .claude to .gitignore
polaz 9522778
chore: merge main into feature branch
polaz df7c0f4
fix(range-tombstone): use #[expect] lints, optimize query_suppression
polaz 9a67ff2
fix(interval-tree): remove unfired unnecessary_box_returns expects
polaz 51c3429
fix(range-tombstone): preserve sentinel seqno bounds, soft-reject ove…
polaz 86a985f
docs(test): clarify Guard import is a trait dependency for .key()
polaz e525c10
docs(test): clarify Vec<Vec<u8>> PartialEq coercion for assertions
polaz 12a23c6
fix(range-tombstone): correct flush clipping and RT-only metadata range
polaz 2dbe5c6
fix(range-tombstone): write all RTs in flush mode without overlap filter
polaz 28d419f
test(range-tombstone): add disjoint RT flush and compaction tests
polaz e59eb29
fix(range-tombstone): use max RT seqno for sentinel to avoid seqno=0 …
polaz 21249f2
docs(range-tombstone): clarify RT-only flush and sentinel design deci…
polaz 6371dac
docs(ci): add design pattern exclusions to instruction files
polaz 1e381fc
docs(ci): generalize design decision rules in instruction files
polaz 317b9d8
fix(range-tombstone): dedup compaction RTs and warn on oversized keys
polaz a40eb7d
refactor(range-tombstone): tighten visibility and remove dead code
polaz d486a98
docs(range-tombstone): hide flush_to_tables_with_rt from public docs
polaz 0123988
fix(range-tombstone): correct reverse sort comparator for sweep-line
polaz 5a7308d
fix(range-tombstone): remove unused Reverse import
polaz 6860597
fix(range-tombstone): use unique sentinel seqno in RT-only tables
polaz 3462b39
docs(range-tombstone): annotate table-skip scan complexity
polaz dc8f68d
fix(range-tombstone): use MAX_SEQNO for sentinel to prevent cross-tab…
polaz 8874e46
fix(range-tombstone): widen flush table key_range to cover RT span
polaz 0d34769
docs(range-tombstone): clarify sentinel visibility at SeqNo::MAX
polaz ddda2c2
docs(range-tombstone): clarify compaction fallback for max-length keys
polaz a51facd
fix(range-tombstone): remove dead max_rt_seqno and add #[must_use]
polaz 979d126
fix(range-tombstone): restore item counts after sentinel and widen co…
polaz bc508a0
fix(range-tombstone): use SeqNo::MAX for sentinel and short-circuit t…
polaz 4371b2b
fix(range-tombstone): revert sentinel to MAX_SEQNO and tighten writer…
polaz 8b720cc
fix(range-tombstone): remove key_range widening to preserve inclusive…
polaz 7ec78e1
fix(range-tombstone): restore flush key_range widening for disjoint R…
polaz b383dcc
test(range-tombstone): add regression test for disjoint RT multi-comp…
polaz 66b044b
fix(range-tombstone): use seqno 0 for sentinel to prevent merge domin…
polaz 87a06f8
test(range-tombstone): add sentinel masking regression test
polaz 0b9a096
fix(range-tombstone): use lowest RT seqno for sentinel instead of 0
polaz 61c5268
docs(range-tombstone): document internal trait rationale for required…
polaz 7452427
fix(range-tombstone): keep sentinel in item counts for on-disk consis…
polaz 755e241
Potential fix for pull request finding
polaz b483a8b
Potential fix for pull request finding
polaz 97e64b1
fix(range-tombstone): align rt-only sentinel key with lowest seqno rt
polaz d9ed474
refactor(api): seal AbstractTree internals
polaz edd92f5
fix(format): bump fork disk semantics to V4
polaz 34ce53f
perf(range): remove redundant RT pre-scan
polaz a065720
test(range): harden range tombstone safety coverage
polaz c81c251
perf(range): filter memtable RTs by query overlap
polaz ece7a2a
style(test): format safety regression assertions
polaz e8d1c4a
fix(range): harden RT metadata and compaction safety
polaz e6c457b
fix(flush): keep RT-widened tables in disjoint L0 runs
polaz 9ee4dcd
fix(range): harden max-key clipping and review regressions
polaz bc72a94
chore(lints): add reason to interval-tree test expect
polaz eb61d8f
docs(readme): document fork disk format V4 boundary
polaz 2c0b836
fix(api): properly seal AbstractTree implementations
polaz a6a1436
test(range): add RT tamper coverage and memtable must_use
polaz 61cfda2
fix(range-tombstone): dedup flush range tombstones before persistence
polaz 210b69f
docs(ci): clarify call-site scope in review instructions
polaz 642148d
test(range-tombstone): add compaction clip gap regression test
polaz 7710935
test(range-tombstone): add RT-only table sentinel recovery regression…
polaz 56addb8
test(range-tombstone): assert specific error type in RT block tamper …
polaz 00b5b9a
test(range-tombstone): disable compression in multi-table flush test
polaz 8b083ad
fix(range-tombstone): disable compression in flush rotation test and …
polaz 42d4726
docs(ci): scope design-decision exclusion to Tier 3-4 only
polaz 261bbf9
fix(range-tombstone): pin compression in compaction rotation test and…
polaz c863073
perf(range-tombstone): dedup collected RTs before range iteration filter
polaz 1065361
docs(range-tombstone): document newest-first search order for point t…
polaz c46710e
docs(test): clarify owned File auto-borrow for read_u64 call
polaz 5d7ab2d
test(recovery): assert InvalidVersion error type in manifest version …
polaz 67e6a27
fix(range-tombstone): add test expect_used lint, RT diagram, metrics …
polaz 1564cdd
docs(range-tombstone): link RT metrics TODO to issue #34
polaz 0ed9bf0
docs(ci): scope design analysis to PR-visible call chain and preserve…
polaz 9b1aab8
style(range-tombstone): rename filter tests to what_condition_expecte…
polaz 387617f
test(range-tombstone): add structural assertions for compaction and r…
polaz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,272 @@ | ||
| # Issue #16: Range Tombstones / delete_range / delete_prefix | ||
|
|
||
| ## Deep Architecture Analysis | ||
|
|
||
| **Date:** 2026-03-16 | ||
| **Branch:** `feat/#16-feat-range-tombstones--deleterange--deleteprefix` | ||
| **Upstream reference:** fjall-rs/lsm-tree#2 (epic), fjall-rs/lsm-tree#242 (PR, OPEN but stalled) | ||
|
|
||
| --- | ||
|
|
||
| ## Context | ||
|
|
||
| ### What exists today | ||
|
|
||
| The codebase has **point tombstones** (`ValueType::Tombstone`, `ValueType::WeakTombstone`) and a limited `drop_range` compaction strategy that drops SST files **fully contained** within a key range. There is no true range tombstone — a single marker that logically deletes all keys in `[start, end)`. | ||
|
|
||
| **Current deletion model:** | ||
| - `remove(key, seqno)` → writes a `Tombstone` entry to memtable | ||
| - `remove_weak(key, seqno)` → writes a `WeakTombstone` (single-delete) | ||
| - `drop_range(range)` → compaction strategy that physically drops fully-contained tables (no MVCC, no partial overlap handling) | ||
| - No `delete_range` or `delete_prefix` API | ||
|
|
||
| ### Why CoordiNode needs this | ||
|
|
||
| Graph operations like "drop all edges of node X" or "drop label Y" generate O(N) individual tombstones for N edges/properties. With range tombstones: | ||
| - **Single marker** replaces millions of point tombstones | ||
| - **Compaction efficiency** — one range tombstone suppresses entire key ranges without per-key processing | ||
| - **Label drop** becomes O(1) write instead of O(N) | ||
|
|
||
| --- | ||
|
|
||
| ## Upstream PR #242 — Deep Analysis | ||
|
|
||
| ### Status: STALLED with unresolved architectural disagreement | ||
|
|
||
| **Timeline:** | ||
| - 2026-02-06: PR opened by `temporaryfix` (4049 additions, 15 commits) | ||
| - 2026-02-06: Author: "Will do benchmarks when I get the chance" — **benchmarks never done** | ||
| - 2026-02-10: Maintainer `marvin-j97` reviews — raises **principal concern about SST block format** | ||
| - 2026-02-10: Extended discussion about block format design | ||
| - 2026-02-12: Author shares GLORAN paper, suggesting the whole approach may be wrong | ||
| - 2026-02-12: Maintainer is "open to alternative implementations" (GLORAN-style) | ||
| - 2026-03-15: Author: "life has gotten in the way" — **work stalled** | ||
|
|
||
| ### Maintainer's principal concern | ||
|
|
||
| **marvin-j97 objects to the dual block format** (`RangeTombstoneStart` + `RangeTombstoneEnd`): | ||
|
|
||
| > "The first thing that seems awkward to me is the duplication of range tombstones for start and end inside tables. Ideally you would just store a table's range tombstones using a DataBlock." | ||
|
|
||
| His reasoning: | ||
| 1. DataBlocks are "tried and tested" — adding new block types is maintenance burden | ||
| 2. RocksDB stores range tombstones "in data blocks without binary seek index" (cites blog post) | ||
| 3. The 2D problem (range + seqno) is "pretty unavoidable" — range tombstones need in-memory deserialization on table open regardless (like RocksDB's fragmentation) | ||
| 4. Suggests looking at **Pebble** for alternative approach | ||
|
|
||
| Author's defense: | ||
| - DataBlocks are optimized for key→value lookups, not overlap/cover checks | ||
| - Reverse MVCC needs streaming on `(end > current_key)` — impossible with only start-sorted data | ||
| - ByStart + ByEndDesc enables per-window `max_end` pruning and streaming reverse activation | ||
| - RocksDB also uses dedicated blocks, not regular data blocks | ||
|
|
||
| **Unresolved:** Maintainer did not accept the defense. The conversation shifted to GLORAN paper, and then stalled. | ||
|
|
||
| ### GLORAN paper (arxiv 2511.06061) | ||
|
|
||
| Both maintainer and author reference this paper as potentially superior: | ||
| - **Problem:** Per-level range tombstones cause 30% point lookup degradation with just 1% range deletes | ||
| - **Solution:** Global Range Record Index (LSM-DRtree) — central immutable index instead of per-level storage | ||
| - **Results:** 10.6× faster point lookups, 2.7× higher throughput | ||
| - **Maintainer's take:** "Such auxiliary indexes must be immutable/log-structured" — possible to add as new field in Version files (like value log) | ||
|
|
||
| ### What this means for us | ||
|
|
||
| 1. **PR #242 will NOT merge as-is** — maintainer wants different SST format | ||
| 2. **No timeline for resolution** — author is busy, maintainer is open to alternatives but hasn't specified one | ||
| 3. **Cherry-picking is high-risk** — we'd inherit a contested design that upstream may reject/replace | ||
| 4. **The PR base is stale** — based on pre-3.1.0 upstream; upstream has diverged significantly since | ||
|
|
||
| --- | ||
|
|
||
| ## Fork Divergence Analysis | ||
|
|
||
| ### Our fork vs upstream main | ||
| ``` | ||
| origin/main vs upstream/main: | ||
| 45 files changed, 224 insertions(+), 3,789 deletions(-) | ||
| ``` | ||
| Our fork has ~100+ commits with: zstd compression, intra-L0 compaction, size cap enforcement, verify_integrity, SequenceNumberGenerator trait, multi_get, contains_prefix, seqno-aware seek, copilot CI, release-plz. | ||
|
|
||
| ### Our fork vs PR #242 branch | ||
| ``` | ||
| origin/main vs upstream-pr-242: | ||
| 97 files changed, 4,568 insertions(+), 6,096 deletions(-) | ||
| ``` | ||
| Massive diff because: | ||
| 1. PR #242 is based on OLD upstream (pre-3.1.0) | ||
| 2. Our fork has features upstream doesn't have | ||
| 3. Upstream main has moved past the PR base | ||
|
|
||
| **New files in PR #242:** | ||
| - `src/active_tombstone_set.rs` (403 lines) | ||
| - `src/memtable/interval_tree.rs` (513 lines) | ||
| - `src/range_tombstone.rs` (343 lines) | ||
| - `src/range_tombstone_filter.rs` (281 lines) | ||
| - `src/table/range_tombstone_block_by_end.rs` (393 lines) | ||
| - `src/table/range_tombstone_block_by_start.rs` (664 lines) | ||
| - `src/table/range_tombstone_encoder.rs` (365 lines) | ||
| - `tests/range_tombstone.rs` (447 lines) | ||
|
|
||
| **Heavily modified files (conflict-prone):** | ||
| - `src/abstract_tree.rs` (renamed to `abstract.rs` in PR!) | ||
| - `src/tree/mod.rs` — both we and PR modify heavily | ||
| - `src/compaction/stream.rs` — PR strips our `StreamFilter` | ||
| - `src/compaction/worker.rs` — both modify | ||
| - `src/memtable/mod.rs` — both modify | ||
| - `src/table/mod.rs` — both modify | ||
| - `src/blob_tree/mod.rs` — both modify | ||
| - `src/config/mod.rs` — PR removes features we added | ||
| - `src/compression.rs` — PR removes our zstd support | ||
| - `src/seqno.rs` — PR removes our SequenceNumberGenerator | ||
|
|
||
| **Critical:** The PR diff REMOVES many of our fork features because it's based on older upstream. Cherry-picking individual commits would require resolving conflicts in virtually every modified file. | ||
|
|
||
| --- | ||
|
|
||
| ## Revised Implementation Approaches | ||
|
|
||
| ### Approach A: Selective port of PR #242 core logic — **RECOMMENDED** | ||
|
|
||
| Extract the **algorithmic core** from PR #242 (types, interval tree, active set, filter) and integrate into our fork manually. Skip the contested SST block format — use DataBlock-based storage as the maintainer prefers. This positions us for eventual upstream convergence regardless of which SST format upstream chooses. | ||
|
|
||
| **What to port (new files, minimal conflicts):** | ||
| 1. `RangeTombstone` struct + `ActiveTombstoneSet` sweep-line tracker (~750 LOC) | ||
| 2. `IntervalTree` for memtable range tombstones (~500 LOC) | ||
| 3. `RangeTombstoneFilter` for range/prefix iteration (~280 LOC) | ||
|
|
||
| **What to write ourselves:** | ||
| 4. SST persistence — use existing `DataBlock` with `key=start, value=end|seqno` (follows maintainer's direction) | ||
| 5. Integration into our fork's `tree/mod.rs`, `abstract_tree.rs`, `compaction/` (our code differs too much to cherry-pick) | ||
| 6. `delete_range` / `delete_prefix` API on `AbstractTree` | ||
|
|
||
| **Pros:** | ||
| - Reuses proven algorithms (interval tree, sweep-line) without inheriting contested SST format | ||
| - Follows maintainer's preferred direction (DataBlock reuse) — better upstream merge path | ||
| - New files (types, interval tree, active set) can be ported with zero conflicts | ||
| - Integration code written against OUR fork, not upstream's old base | ||
| - Benchmarks can be done before committing to SST format | ||
|
|
||
| **Cons:** | ||
| - More manual work than pure cherry-pick (~3-4 days vs 2-3) | ||
| - SST format may still diverge from whatever upstream eventually picks | ||
| - Need to validate the ported algorithms ourselves | ||
|
|
||
| **Estimate:** 3-4 days | ||
|
|
||
| **Upstream-compatible:** Partially — core algorithms match; SST format aligned with maintainer preference but final upstream choice unknown. | ||
|
|
||
| --- | ||
|
|
||
| ### Approach B: Memtable-only + DataBlock persistence (minimal viable) | ||
|
|
||
| Implement range tombstones in memtable with a simple sorted structure. Persist to SSTs using existing DataBlock (no new block types). Skip sweep-line optimization — use simpler linear scan for iteration filtering. Good enough for CoordiNode's initial needs, easy to upgrade later. | ||
|
|
||
| **How it works:** | ||
| 1. `RangeTombstone` struct with `[start, end)` + seqno | ||
| 2. `BTreeMap<(start, Reverse(seqno)), RangeTombstone>` in Memtable (simpler than full interval tree) | ||
| 3. Point reads: linear scan memtable tombstones for suppression (O(T) where T = tombstone count, typically small) | ||
| 4. Range iteration: collect tombstones, suppress matching keys inline | ||
| 5. SST: store tombstones in DataBlock at end of table (key=start, value=end|seqno) | ||
| 6. Compaction: read tombstone DataBlock, clip to table range, write to output | ||
| 7. GC: evict tombstones below watermark at bottom level | ||
|
|
||
| **Pros:** | ||
| - Simpler implementation (~1.5-2K LOC) | ||
| - No new block types — uses proven DataBlock format | ||
| - Good enough for CoordiNode's workload (tens of tombstones, not millions) | ||
| - Easy to upgrade to interval tree / sweep-line later if needed | ||
| - Aligns with maintainer's DataBlock preference | ||
|
|
||
| **Cons:** | ||
| - Linear scan for suppression = O(T) per point read (fine for small T, bad for thousands of tombstones) | ||
| - No sweep-line = reverse iteration is naive | ||
| - Need to write everything from scratch (no code reuse from PR) | ||
| - No table-skip optimization initially | ||
|
|
||
| **Estimate:** 2-3 days | ||
|
|
||
| **Upstream-compatible:** Partially — SST format aligned with maintainer, but algorithms differ. | ||
|
|
||
| --- | ||
|
|
||
| ### Approach C: Wait for upstream resolution | ||
|
|
||
| Don't implement now. Monitor PR #242 discussion. Use `drop_range` + point tombstones as workaround. | ||
|
|
||
| **Pros:** | ||
| - Zero fork divergence | ||
| - Zero effort now | ||
|
|
||
| **Cons:** | ||
| - Blocks CoordiNode's efficient bulk deletion (P0 requirement) | ||
| - Upstream resolution could take months (author stalled, no consensus on design) | ||
| - `drop_range` is not MVCC-safe — unusable for transactional graph operations | ||
|
|
||
| **Estimate:** 0 days now, unknown later | ||
|
|
||
| **Upstream-compatible:** Perfect — but at the cost of blocking our product. | ||
|
|
||
| --- | ||
|
|
||
| ### Approach D: Full custom (GLORAN-inspired) | ||
|
|
||
| Implement range tombstones using GLORAN's global index approach. Independent of upstream. | ||
|
|
||
| **Pros:** | ||
| - Potentially best performance (10x point lookups per paper) | ||
| - Novel approach both maintainer and PR author seem interested in | ||
|
|
||
| **Cons:** | ||
| - Research-grade, no production implementation exists | ||
| - Massive effort (5-8 days) with high uncertainty | ||
| - Permanent fork divergence | ||
| - Paper is from 2025, may have undiscovered issues | ||
|
|
||
| **Estimate:** 5-8 days | ||
|
|
||
| **Upstream-compatible:** No. | ||
|
|
||
| --- | ||
|
|
||
| ## Comparison Matrix | ||
|
|
||
| | Criteria | A: Selective Port | B: Minimal Viable | C: Wait | D: GLORAN | | ||
| |---|---|---|---|---| | ||
| | **Completeness** | Full | Adequate | None | Full | | ||
| | **MVCC correctness** | Yes | Yes | No | Yes | | ||
| | **SST persistence** | DataBlock | DataBlock | No | Custom | | ||
| | **Point read overhead** | O(log T) | O(T) | None | O(log² N) | | ||
| | **Sweep-line iteration** | Yes | No | N/A | Yes | | ||
| | **Table skip optimization** | Yes | No | No | Yes | | ||
| | **Upstream alignment** | Good | Moderate | Perfect | None | | ||
| | **Implementation effort** | 3-4d | 2-3d | 0d | 5-8d | | ||
| | **Fork divergence risk** | Low-Medium | Low | None | Very High | | ||
| | **CoordiNode P0 unblock** | Yes | Yes | No | Yes | | ||
|
|
||
| --- | ||
|
|
||
| ## Recommendation: Approach A — Selective Port of PR #242 Core | ||
|
|
||
| **Why A over the original "cherry-pick everything":** | ||
|
|
||
| 1. **PR #242 will not merge as-is** — maintainer rejected the SST format. Cherry-picking inherits a dead design | ||
| 2. **PR base is stale** — based on pre-3.1.0 upstream, conflict resolution across 25+ files is a week of work, not 2-3 days | ||
| 3. **Core algorithms are solid** — `IntervalTree`, `ActiveTombstoneSet`, `RangeTombstoneFilter` are well-designed and can be ported as standalone files with zero conflicts | ||
| 4. **DataBlock SST format** follows maintainer's direction — better upstream merge path than the dual-block approach | ||
| 5. **We control integration** — writing our own glue code against our fork's actual state avoids the massive conflict resolution | ||
|
|
||
| **Execution plan:** | ||
| 1. Port `RangeTombstone` + `ActiveTombstoneSet` types (new files, no conflicts) | ||
| 2. Port `IntervalTree` for memtable (new file, no conflicts) | ||
| 3. Port `RangeTombstoneFilter` for iteration (new file, no conflicts) | ||
| 4. Design DataBlock-based SST persistence (new block type value = `5`, or reuse DataBlock with sentinel key prefix) | ||
| 5. Integrate into memtable, point reads, range iteration, flush, compaction | ||
| 6. Add `remove_range` / `remove_prefix` to `AbstractTree` | ||
| 7. Write tests (port + adapt from PR #242's test suite) | ||
| 8. Benchmark point read regression | ||
|
|
||
| **Key risks:** | ||
| - The ported algorithms may have bugs we don't catch (mitigate: thorough testing) | ||
| - Upstream may pick a different SST format than DataBlock (mitigate: encapsulate persistence behind trait/module boundary) | ||
| - Performance regression on point reads (mitigate: fast path when zero range tombstones exist, as PR #242 already does) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # Issue #16: Range Tombstones — Known Gaps & Limitations | ||
|
|
||
| **Date:** 2026-03-16 | ||
| **Status:** All critical bugs fixed. Remaining items are optimization opportunities. | ||
|
|
||
| ## Fixed Bugs (this session) | ||
|
|
||
| 1. **RT-only flush** — Writer deleted empty table before writing RTs. Fixed: derive key range from tombstones when item_count==0. | ||
| 2. **MultiWriter rotation** — RTs only written to last output table. Fixed: clip RTs to each table's key range via `intersect_opt()` during rotation and finish. | ||
| 3. **Compaction clipping** — RTs propagated unclipped. Fixed: MultiWriter now clips via `set_range_tombstones()`. | ||
|
|
||
| ## Remaining Known Limitations | ||
|
|
||
| ### 1. No WAL persistence for range tombstones | ||
| Range tombstones in the active memtable are lost on crash (before flush). This is consistent with the crate's design — it does not ship a WAL. The caller (fjall) manages WAL-level durability. | ||
|
|
||
| ### 2. No compaction-level tombstone clipping for multi-run tables | ||
| When a run has multiple tables, the `RunReader` path (in `range.rs`) does not apply the table-skip optimization. Only single-table runs get the `is_covered` check. This is a performance optimization gap, not a correctness issue — the `RangeTombstoneFilter` still correctly filters suppressed items. | ||
|
|
||
| ### 3. Linear scan for SST range tombstone suppression in point reads | ||
| `is_suppressed_by_range_tombstones` iterates ALL SST tables and ALL their range tombstones linearly. For workloads with many tables and many range tombstones, this could degrade point read latency. Consider: building an in-memory interval tree from SST tombstones on version change (similar to GLORAN paper's approach). | ||
|
|
||
| ### 4. Range tombstone block format is not upstream-compatible | ||
| We use a raw wire format (`[start_len:u16][start][end_len:u16][end][seqno:u64]`) with `BlockType::RangeTombstone = 4`. This is a fork-only format. When upstream settles on a format, we'll need a migration. | ||
|
|
||
| ### 5. No range tombstone metrics | ||
| Unlike point tombstones (`tombstone_count`, `weak_tombstone_count`), range tombstones are tracked in table metadata (`range_tombstone_count`) but not surfaced to the `Metrics` system (behind `#[cfg(feature = "metrics")]`). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.