Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
a3bc6c9
feat: add range tombstones (delete_range / delete_prefix)
polaz Mar 16, 2026
c3e80fc
Merge branch 'main' into feat/#16-feat-range-tombstones--deleterange-…
polaz Mar 16, 2026
ed272cf
fix: resolve all clippy warnings for range tombstone code
polaz Mar 16, 2026
718f2ba
fix(range-tombstone): validate bounds, fix RT-only flush and edge cases
polaz Mar 16, 2026
343f31c
fix(table): validate BlockType on range tombstone block load
polaz Mar 16, 2026
3e23f57
fix(range-tombstone): seqno visibility, decode hardening, lint attrs
polaz Mar 16, 2026
954c044
fix(lint): use cfg_attr(feature, expect) for metrics-gated arg count
polaz Mar 16, 2026
d53ecfe
fix(range-tombstone): propagate RTs before write loop, enforce u16 bo…
polaz Mar 16, 2026
cb60d63
test(range-tombstone): rotation, blob tree, table-skip, invalid interval
polaz Mar 16, 2026
e1db06d
fix(range-tombstone): RT-only SST persistence, pruning, lint attrs
polaz Mar 16, 2026
f0c90ea
style: fix rustfmt formatting in interval_tree
polaz Mar 17, 2026
1f87efa
Merge branch 'main' into feat/#16-feat-range-tombstones--deleterange-…
polaz Mar 17, 2026
334890c
fix(range-tombstone): preserve sentinel seqno bounds, soft-reject ove…
polaz Mar 17, 2026
5077e58
chore: remove .forge from git tracking
polaz Mar 17, 2026
6841260
chore: add .claude to .gitignore
polaz Mar 17, 2026
9522778
chore: merge main into feature branch
polaz Mar 17, 2026
df7c0f4
fix(range-tombstone): use #[expect] lints, optimize query_suppression
polaz Mar 17, 2026
9a67ff2
fix(interval-tree): remove unfired unnecessary_box_returns expects
polaz Mar 17, 2026
51c3429
fix(range-tombstone): preserve sentinel seqno bounds, soft-reject ove…
polaz Mar 17, 2026
86a985f
docs(test): clarify Guard import is a trait dependency for .key()
polaz Mar 17, 2026
e525c10
docs(test): clarify Vec<Vec<u8>> PartialEq coercion for assertions
polaz Mar 17, 2026
12a23c6
fix(range-tombstone): correct flush clipping and RT-only metadata range
polaz Mar 17, 2026
2dbe5c6
fix(range-tombstone): write all RTs in flush mode without overlap filter
polaz Mar 17, 2026
28d419f
test(range-tombstone): add disjoint RT flush and compaction tests
polaz Mar 17, 2026
e59eb29
fix(range-tombstone): use max RT seqno for sentinel to avoid seqno=0 …
polaz Mar 17, 2026
21249f2
docs(range-tombstone): clarify RT-only flush and sentinel design deci…
polaz Mar 18, 2026
6371dac
docs(ci): add design pattern exclusions to instruction files
polaz Mar 18, 2026
1e381fc
docs(ci): generalize design decision rules in instruction files
polaz Mar 18, 2026
317b9d8
fix(range-tombstone): dedup compaction RTs and warn on oversized keys
polaz Mar 18, 2026
a40eb7d
refactor(range-tombstone): tighten visibility and remove dead code
polaz Mar 18, 2026
d486a98
docs(range-tombstone): hide flush_to_tables_with_rt from public docs
polaz Mar 18, 2026
0123988
fix(range-tombstone): correct reverse sort comparator for sweep-line
polaz Mar 18, 2026
5a7308d
fix(range-tombstone): remove unused Reverse import
polaz Mar 18, 2026
6860597
fix(range-tombstone): use unique sentinel seqno in RT-only tables
polaz Mar 18, 2026
3462b39
docs(range-tombstone): annotate table-skip scan complexity
polaz Mar 18, 2026
dc8f68d
fix(range-tombstone): use MAX_SEQNO for sentinel to prevent cross-tab…
polaz Mar 18, 2026
8874e46
fix(range-tombstone): widen flush table key_range to cover RT span
polaz Mar 18, 2026
0d34769
docs(range-tombstone): clarify sentinel visibility at SeqNo::MAX
polaz Mar 18, 2026
ddda2c2
docs(range-tombstone): clarify compaction fallback for max-length keys
polaz Mar 18, 2026
a51facd
fix(range-tombstone): remove dead max_rt_seqno and add #[must_use]
polaz Mar 18, 2026
979d126
fix(range-tombstone): restore item counts after sentinel and widen co…
polaz Mar 18, 2026
bc508a0
fix(range-tombstone): use SeqNo::MAX for sentinel and short-circuit t…
polaz Mar 18, 2026
4371b2b
fix(range-tombstone): revert sentinel to MAX_SEQNO and tighten writer…
polaz Mar 18, 2026
8b720cc
fix(range-tombstone): remove key_range widening to preserve inclusive…
polaz Mar 18, 2026
7ec78e1
fix(range-tombstone): restore flush key_range widening for disjoint R…
polaz Mar 18, 2026
b383dcc
test(range-tombstone): add regression test for disjoint RT multi-comp…
polaz Mar 18, 2026
66b044b
fix(range-tombstone): use seqno 0 for sentinel to prevent merge domin…
polaz Mar 18, 2026
87a06f8
test(range-tombstone): add sentinel masking regression test
polaz Mar 18, 2026
0b9a096
fix(range-tombstone): use lowest RT seqno for sentinel instead of 0
polaz Mar 18, 2026
61c5268
docs(range-tombstone): document internal trait rationale for required…
polaz Mar 19, 2026
7452427
fix(range-tombstone): keep sentinel in item counts for on-disk consis…
polaz Mar 19, 2026
755e241
Potential fix for pull request finding
polaz Mar 19, 2026
b483a8b
Potential fix for pull request finding
polaz Mar 19, 2026
97e64b1
fix(range-tombstone): align rt-only sentinel key with lowest seqno rt
polaz Mar 19, 2026
d9ed474
refactor(api): seal AbstractTree internals
polaz Mar 19, 2026
edd92f5
fix(format): bump fork disk semantics to V4
polaz Mar 19, 2026
34ce53f
perf(range): remove redundant RT pre-scan
polaz Mar 19, 2026
a065720
test(range): harden range tombstone safety coverage
polaz Mar 19, 2026
c81c251
perf(range): filter memtable RTs by query overlap
polaz Mar 19, 2026
ece7a2a
style(test): format safety regression assertions
polaz Mar 19, 2026
e8d1c4a
fix(range): harden RT metadata and compaction safety
polaz Mar 19, 2026
e6c457b
fix(flush): keep RT-widened tables in disjoint L0 runs
polaz Mar 19, 2026
9ee4dcd
fix(range): harden max-key clipping and review regressions
polaz Mar 20, 2026
bc72a94
chore(lints): add reason to interval-tree test expect
polaz Mar 20, 2026
eb61d8f
docs(readme): document fork disk format V4 boundary
polaz Mar 20, 2026
2c0b836
fix(api): properly seal AbstractTree implementations
polaz Mar 20, 2026
a6a1436
test(range): add RT tamper coverage and memtable must_use
polaz Mar 20, 2026
61cfda2
fix(range-tombstone): dedup flush range tombstones before persistence
polaz Mar 20, 2026
210b69f
docs(ci): clarify call-site scope in review instructions
polaz Mar 20, 2026
642148d
test(range-tombstone): add compaction clip gap regression test
polaz Mar 20, 2026
7710935
test(range-tombstone): add RT-only table sentinel recovery regression…
polaz Mar 20, 2026
56addb8
test(range-tombstone): assert specific error type in RT block tamper …
polaz Mar 20, 2026
00b5b9a
test(range-tombstone): disable compression in multi-table flush test
polaz Mar 20, 2026
8b083ad
fix(range-tombstone): disable compression in flush rotation test and …
polaz Mar 20, 2026
42d4726
docs(ci): scope design-decision exclusion to Tier 3-4 only
polaz Mar 20, 2026
261bbf9
fix(range-tombstone): pin compression in compaction rotation test and…
polaz Mar 20, 2026
c863073
perf(range-tombstone): dedup collected RTs before range iteration filter
polaz Mar 20, 2026
1065361
docs(range-tombstone): document newest-first search order for point t…
polaz Mar 20, 2026
c46710e
docs(test): clarify owned File auto-borrow for read_u64 call
polaz Mar 20, 2026
5d7ab2d
test(recovery): assert InvalidVersion error type in manifest version …
polaz Mar 20, 2026
67e6a27
fix(range-tombstone): add test expect_used lint, RT diagram, metrics …
polaz Mar 20, 2026
1564cdd
docs(range-tombstone): link RT metrics TODO to issue #34
polaz Mar 20, 2026
0ed9bf0
docs(ci): scope design analysis to PR-visible call chain and preserve…
polaz Mar 20, 2026
9b1aab8
style(range-tombstone): rename filter tests to what_condition_expecte…
polaz Mar 20, 2026
387617f
test(range-tombstone): add structural assertions for compaction and r…
polaz Mar 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 272 additions & 0 deletions .forge/16/analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Issue #16: Range Tombstones / delete_range / delete_prefix

## Deep Architecture Analysis

**Date:** 2026-03-16
**Branch:** `feat/#16-feat-range-tombstones--deleterange--deleteprefix`
**Upstream reference:** fjall-rs/lsm-tree#2 (epic), fjall-rs/lsm-tree#242 (PR, OPEN but stalled)

---

## Context

### What exists today

The codebase has **point tombstones** (`ValueType::Tombstone`, `ValueType::WeakTombstone`) and a limited `drop_range` compaction strategy that drops SST files **fully contained** within a key range. There is no true range tombstone — a single marker that logically deletes all keys in `[start, end)`.

**Current deletion model:**
- `remove(key, seqno)` → writes a `Tombstone` entry to memtable
- `remove_weak(key, seqno)` → writes a `WeakTombstone` (single-delete)
- `drop_range(range)` → compaction strategy that physically drops fully-contained tables (no MVCC, no partial overlap handling)
- No `delete_range` or `delete_prefix` API

### Why CoordiNode needs this

Graph operations like "drop all edges of node X" or "drop label Y" generate O(N) individual tombstones for N edges/properties. With range tombstones:
- **Single marker** replaces millions of point tombstones
- **Compaction efficiency** — one range tombstone suppresses entire key ranges without per-key processing
- **Label drop** becomes O(1) write instead of O(N)

---

## Upstream PR #242 — Deep Analysis

### Status: STALLED with unresolved architectural disagreement

**Timeline:**
- 2026-02-06: PR opened by `temporaryfix` (4049 additions, 15 commits)
- 2026-02-06: Author: "Will do benchmarks when I get the chance" — **benchmarks never done**
- 2026-02-10: Maintainer `marvin-j97` reviews — raises **principal concern about SST block format**
- 2026-02-10: Extended discussion about block format design
- 2026-02-12: Author shares GLORAN paper, suggesting the whole approach may be wrong
- 2026-02-12: Maintainer is "open to alternative implementations" (GLORAN-style)
- 2026-03-15: Author: "life has gotten in the way" — **work stalled**

### Maintainer's principal concern

**marvin-j97 objects to the dual block format** (`RangeTombstoneStart` + `RangeTombstoneEnd`):

> "The first thing that seems awkward to me is the duplication of range tombstones for start and end inside tables. Ideally you would just store a table's range tombstones using a DataBlock."

His reasoning:
1. DataBlocks are "tried and tested" — adding new block types is maintenance burden
2. RocksDB stores range tombstones "in data blocks without binary seek index" (cites blog post)
3. The 2D problem (range + seqno) is "pretty unavoidable" — range tombstones need in-memory deserialization on table open regardless (like RocksDB's fragmentation)
4. Suggests looking at **Pebble** for alternative approach

Author's defense:
- DataBlocks are optimized for key→value lookups, not overlap/cover checks
- Reverse MVCC needs streaming on `(end > current_key)` — impossible with only start-sorted data
- ByStart + ByEndDesc enables per-window `max_end` pruning and streaming reverse activation
- RocksDB also uses dedicated blocks, not regular data blocks

**Unresolved:** Maintainer did not accept the defense. The conversation shifted to GLORAN paper, and then stalled.

### GLORAN paper (arxiv 2511.06061)

Both maintainer and author reference this paper as potentially superior:
- **Problem:** Per-level range tombstones cause 30% point lookup degradation with just 1% range deletes
- **Solution:** Global Range Record Index (LSM-DRtree) — central immutable index instead of per-level storage
- **Results:** 10.6× faster point lookups, 2.7× higher throughput
- **Maintainer's take:** "Such auxiliary indexes must be immutable/log-structured" — possible to add as new field in Version files (like value log)

### What this means for us

1. **PR #242 will NOT merge as-is** — maintainer wants different SST format
2. **No timeline for resolution** — author is busy, maintainer is open to alternatives but hasn't specified one
3. **Cherry-picking is high-risk** — we'd inherit a contested design that upstream may reject/replace
4. **The PR base is stale** — based on pre-3.1.0 upstream; upstream has diverged significantly since

---

## Fork Divergence Analysis

### Our fork vs upstream main
```text
origin/main vs upstream/main:
45 files changed, 224 insertions(+), 3,789 deletions(-)
```
Our fork has ~100+ commits with: zstd compression, intra-L0 compaction, size cap enforcement, verify_integrity, SequenceNumberGenerator trait, multi_get, contains_prefix, seqno-aware seek, copilot CI, release-plz.

### Our fork vs PR #242 branch
```text
origin/main vs upstream-pr-242:
97 files changed, 4,568 insertions(+), 6,096 deletions(-)
```
Massive diff because:
1. PR #242 is based on OLD upstream (pre-3.1.0)
2. Our fork has features upstream doesn't have
3. Upstream main has moved past the PR base

**New files in PR #242:**
- `src/active_tombstone_set.rs` (403 lines)
- `src/memtable/interval_tree.rs` (513 lines)
- `src/range_tombstone.rs` (343 lines)
- `src/range_tombstone_filter.rs` (281 lines)
- `src/table/range_tombstone_block_by_end.rs` (393 lines)
- `src/table/range_tombstone_block_by_start.rs` (664 lines)
- `src/table/range_tombstone_encoder.rs` (365 lines)
- `tests/range_tombstone.rs` (447 lines)

**Heavily modified files (conflict-prone):**
- `src/abstract_tree.rs` (renamed to `abstract.rs` in PR!)
- `src/tree/mod.rs` — both we and PR modify heavily
- `src/compaction/stream.rs` — PR strips our `StreamFilter`
- `src/compaction/worker.rs` — both modify
- `src/memtable/mod.rs` — both modify
- `src/table/mod.rs` — both modify
- `src/blob_tree/mod.rs` — both modify
- `src/config/mod.rs` — PR removes features we added
- `src/compression.rs` — PR removes our zstd support
- `src/seqno.rs` — PR removes our SequenceNumberGenerator

**Critical:** The PR diff REMOVES many of our fork features because it's based on older upstream. Cherry-picking individual commits would require resolving conflicts in virtually every modified file.

---

## Revised Implementation Approaches

### Approach A: Selective port of PR #242 core logic — **RECOMMENDED**

Extract the **algorithmic core** from PR #242 (types, interval tree, active set, filter) and integrate into our fork manually. Skip the contested SST block format — use DataBlock-based storage as the maintainer prefers. This positions us for eventual upstream convergence regardless of which SST format upstream chooses.

**What to port (new files, minimal conflicts):**
1. `RangeTombstone` struct + `ActiveTombstoneSet` sweep-line tracker (~750 LOC)
2. `IntervalTree` for memtable range tombstones (~500 LOC)
3. `RangeTombstoneFilter` for range/prefix iteration (~280 LOC)

**What to write ourselves:**
4. SST persistence — use existing `DataBlock` with `key=start, value=end|seqno` (follows maintainer's direction)
5. Integration into our fork's `tree/mod.rs`, `abstract_tree.rs`, `compaction/` (our code differs too much to cherry-pick)
6. `delete_range` / `delete_prefix` API on `AbstractTree`

**Pros:**
- Reuses proven algorithms (interval tree, sweep-line) without inheriting contested SST format
- Follows maintainer's preferred direction (DataBlock reuse) — better upstream merge path
- New files (types, interval tree, active set) can be ported with zero conflicts
- Integration code written against OUR fork, not upstream's old base
- Benchmarks can be done before committing to SST format

**Cons:**
- More manual work than pure cherry-pick (~3-4 days vs 2-3)
- SST format may still diverge from whatever upstream eventually picks
- Need to validate the ported algorithms ourselves

**Estimate:** 3-4 days

**Upstream-compatible:** Partially — core algorithms match; SST format aligned with maintainer preference but final upstream choice unknown.

---

### Approach B: Memtable-only + DataBlock persistence (minimal viable)

Implement range tombstones in memtable with a simple sorted structure. Persist to SSTs using existing DataBlock (no new block types). Skip sweep-line optimization — use simpler linear scan for iteration filtering. Good enough for CoordiNode's initial needs, easy to upgrade later.

**How it works:**
1. `RangeTombstone` struct with `[start, end)` + seqno
2. `BTreeMap<(start, Reverse(seqno)), RangeTombstone>` in Memtable (simpler than full interval tree)
3. Point reads: linear scan memtable tombstones for suppression (O(T) where T = tombstone count, typically small)
4. Range iteration: collect tombstones, suppress matching keys inline
5. SST: store tombstones in DataBlock at end of table (key=start, value=end|seqno)
6. Compaction: read tombstone DataBlock, clip to table range, write to output
7. GC: evict tombstones below watermark at bottom level

**Pros:**
- Simpler implementation (~1.5-2K LOC)
- No new block types — uses proven DataBlock format
- Good enough for CoordiNode's workload (tens of tombstones, not millions)
- Easy to upgrade to interval tree / sweep-line later if needed
- Aligns with maintainer's DataBlock preference

**Cons:**
- Linear scan for suppression = O(T) per point read (fine for small T, bad for thousands of tombstones)
- No sweep-line = reverse iteration is naive
- Need to write everything from scratch (no code reuse from PR)
- No table-skip optimization initially

**Estimate:** 2-3 days

**Upstream-compatible:** Partially — SST format aligned with maintainer, but algorithms differ.

---

### Approach C: Wait for upstream resolution

Don't implement now. Monitor PR #242 discussion. Use `drop_range` + point tombstones as workaround.

**Pros:**
- Zero fork divergence
- Zero effort now

**Cons:**
- Blocks CoordiNode's efficient bulk deletion (P0 requirement)
- Upstream resolution could take months (author stalled, no consensus on design)
- `drop_range` is not MVCC-safe — unusable for transactional graph operations

**Estimate:** 0 days now, unknown later

**Upstream-compatible:** Perfect — but at the cost of blocking our product.

---

### Approach D: Full custom (GLORAN-inspired)

Implement range tombstones using GLORAN's global index approach. Independent of upstream.

**Pros:**
- Potentially best performance (10x point lookups per paper)
- Novel approach both maintainer and PR author seem interested in

**Cons:**
- Research-grade, no production implementation exists
- Massive effort (5-8 days) with high uncertainty
- Permanent fork divergence
- Paper is from 2025, may have undiscovered issues

**Estimate:** 5-8 days

**Upstream-compatible:** No.

---

## Comparison Matrix

| Criteria | A: Selective Port | B: Minimal Viable | C: Wait | D: GLORAN |
|---|---|---|---|---|
| **Completeness** | Full | Adequate | None | Full |
| **MVCC correctness** | Yes | Yes | No | Yes |
| **SST persistence** | DataBlock | DataBlock | No | Custom |
| **Point read overhead** | O(log T) | O(T) | None | O(log² N) |
| **Sweep-line iteration** | Yes | No | N/A | Yes |
| **Table skip optimization** | Yes | No | No | Yes |
| **Upstream alignment** | Good | Moderate | Perfect | None |
| **Implementation effort** | 3-4d | 2-3d | 0d | 5-8d |
| **Fork divergence risk** | Low-Medium | Low | None | Very High |
| **CoordiNode P0 unblock** | Yes | Yes | No | Yes |

---

## Recommendation: Approach A — Selective Port of PR #242 Core

**Why A over the original "cherry-pick everything":**

1. **PR #242 will not merge as-is** — maintainer rejected the SST format. Cherry-picking inherits a dead design
2. **PR base is stale** — based on pre-3.1.0 upstream, conflict resolution across 25+ files is a week of work, not 2-3 days
3. **Core algorithms are solid** — `IntervalTree`, `ActiveTombstoneSet`, `RangeTombstoneFilter` are well-designed and can be ported as standalone files with zero conflicts
4. **DataBlock SST format** follows maintainer's direction — better upstream merge path than the dual-block approach
5. **We control integration** — writing our own glue code against our fork's actual state avoids the massive conflict resolution

**Execution plan:**
1. Port `RangeTombstone` + `ActiveTombstoneSet` types (new files, no conflicts)
2. Port `IntervalTree` for memtable (new file, no conflicts)
3. Port `RangeTombstoneFilter` for iteration (new file, no conflicts)
4. Design DataBlock-based SST persistence (new block type value = `5`, or reuse DataBlock with sentinel key prefix)
5. Integrate into memtable, point reads, range iteration, flush, compaction
6. Add `remove_range` / `remove_prefix` to `AbstractTree`
7. Write tests (port + adapt from PR #242's test suite)
8. Benchmark point read regression

**Key risks:**
- The ported algorithms may have bugs we don't catch (mitigate: thorough testing)
- Upstream may pick a different SST format than DataBlock (mitigate: encapsulate persistence behind trait/module boundary)
- Performance regression on point reads (mitigate: fast path when zero range tombstones exist, as PR #242 already does)
27 changes: 27 additions & 0 deletions .forge/16/known-gaps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Issue #16: Range Tombstones — Known Gaps & Limitations

**Date:** 2026-03-16
**Status:** All critical bugs fixed. Remaining items are optimization opportunities.

## Fixed Bugs (this session)

1. **RT-only flush** — Writer deleted empty table before writing RTs. Fixed: derive key range from tombstones when item_count==0.
2. **MultiWriter rotation** — RTs only written to last output table. Fixed: clip RTs to each table's key range via `intersect_opt()` during rotation and finish.
3. **Compaction clipping** — RTs propagated unclipped. Fixed: MultiWriter now clips via `set_range_tombstones()`.

## Remaining Known Limitations

### 1. No WAL persistence for range tombstones
Range tombstones in the active memtable are lost on crash (before flush). This is consistent with the crate's design — it does not ship a WAL. The caller (fjall) manages WAL-level durability.

### 2. No compaction-level tombstone clipping for multi-run tables
When a run has multiple tables, the `RunReader` path (in `range.rs`) does not apply the table-skip optimization. Only single-table runs get the `is_covered` check. This is a performance optimization gap, not a correctness issue — the `RangeTombstoneFilter` still correctly filters suppressed items.

### 3. Linear scan for SST range tombstone suppression in point reads
`is_suppressed_by_range_tombstones` iterates ALL SST tables and ALL their range tombstones linearly. For workloads with many tables and many range tombstones, this could degrade point read latency. Consider: building an in-memory interval tree from SST tombstones on version change (similar to GLORAN paper's approach).

### 4. Range tombstone block format is not upstream-compatible
We use a raw wire format (`[start_len:u16][start][end_len:u16][end][seqno:u64]`) with `BlockType::RangeTombstone = 4`. This is a fork-only format. When upstream settles on a format, we'll need a migration.

### 5. No range tombstone metrics
Unlike point tombstones (`tombstone_count`, `weak_tombstone_count`), range tombstones are tracked in table metadata (`range_tombstone_count`) but not surfaced to the `Metrics` system (behind `#[cfg(feature = "metrics")]`).
Loading
Loading