perf: batch multi_get + PinnableSlice + WriteBatch#214
Conversation
Phase 1 — Batch multi_get optimization: - Pre-compute bloom filter hashes once for all keys - Sort keys by comparator order for sequential I/O - Check memtables first (active + sealed), batch remaining - Batch table lookups through L0-L6 with sorted key access - Small batches (≤2 keys) use simple per-key path Phase 2 — PinnableSlice zero-copy reads: - New PinnableSlice enum: Pinned (block cache) | Owned (memtable) - New get_pinned() API on AbstractTree trait - Pinned variant holds Arc<Block> reference preventing eviction - Values from ByteView::slice() are already zero-copy sub-views - Default trait impl wraps get() as Owned for BlobTree compat - Tree overrides with full pinned path through tables Phase 3 — WriteBatch atomic writes: - New WriteBatch struct: insert/remove/remove_weak/merge ops - apply_batch() on AbstractTree: single seqno for atomicity - Memtable::insert_batch(): single fetch_add + single fetch_max - Single version-history lock acquisition for entire batch Dependencies: - Update structured-zstd 0.0.1 → 0.0.6 Closes #143
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds batched multi_get with bloom-batched SST probes, a PinnableSlice zero-copy read type and get_pinned API, a WriteBatch bulk-write API plus memtable bulk insert, table bloom/point-read refactors exposing block-backed reads, new tests and Criterion benchmarks, README tweak, and bumps optional Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Tree as Tree
participant Memtable as Memtable
participant SST as Table/SST
participant BlockCache as BlockCache
Client->>Tree: get_pinned(key, seqno)
Tree->>Memtable: probe active & sealed memtables
alt Found in memtable
Memtable-->>Tree: InternalValue (owned)
Tree-->>Client: PinnableSlice::Owned
else Not in memtable
Tree->>SST: batched bloom checks & table probe (get_with_block)
SST->>BlockCache: load & decompress block
BlockCache-->>Tree: (InternalValue, Block)
Tree-->>Client: PinnableSlice::Pinned
end
sequenceDiagram
participant Client as Client
participant Batch as WriteBatch
participant Tree as Tree
participant MemtableInst as Memtable
Client->>Batch: build batch (insert/remove/merge)
Client->>Tree: apply_batch(batch, seqno)
Tree->>Batch: materialize(seqno) -> Vec<InternalValue>
Tree->>MemtableInst: insert_batch(items)
MemtableInst-->>Tree: (bytes_added, new_size)
Tree-->>Client: (bytes_added, new_size)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR introduces three major read/write-path optimizations for the LSM tree: (1) a batch-optimized multi_get() to reduce bloom/hash work and improve I/O locality, (2) a get_pinned() API via PinnableSlice to support zero-copy-ish value access for table reads, and (3) an atomic WriteBatch API to apply multiple writes under a shared seqno with reduced locking/accounting overhead.
Changes:
- Implement batch
multi_get()with key sorting, precomputed bloom hashes, memtable-first filtering, and batched SST lookups. - Add
PinnableSlice+AbstractTree::get_pinned()and table APIs to return a value along with its containing block. - Add
WriteBatch+AbstractTree::apply_batch()and memtable bulk insert (insert_batch) for more efficient batched writes; bumpstructured-zstdto0.0.6.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/write_batch.rs | Adds coverage for batch insert/remove semantics and flush survivability. |
| tests/pinnable_slice.rs | Adds coverage for get_pinned() behavior across memtable vs disk and basic PinnableSlice APIs. |
| src/write_batch.rs | Implements WriteBatch builder + materialization into InternalValues. |
| src/tree/mod.rs | Implements get_pinned(), rewrites multi_get() to a batched path, and adds append_batch(). |
| src/table/mod.rs | Adds get_with_block() / point_read_with_block() to return the block alongside point-read results. |
| src/pinnable_slice.rs | Introduces PinnableSlice enum and its public API surface. |
| src/memtable/mod.rs | Adds Memtable::insert_batch() with single-pass size/seqno accounting. |
| src/lib.rs | Wires new modules and re-exports PinnableSlice and WriteBatch. |
| src/blob_tree/mod.rs | Implements apply_batch() for BlobTree by delegating to the index tree. |
| src/abstract_tree.rs | Extends the trait with get_pinned() (default impl) and required apply_batch(). |
| README.md | Updates feature list to mention batch multi_get, PinnableSlice, and WriteBatch. |
| Cargo.toml | Updates structured-zstd dependency version to 0.0.6. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Cargo.toml`:
- Line 38: The Cargo.toml dependency line for structured-zstd references a
non-existent version (0.0.6); update the structured-zstd dependency entry to use
a published version (e.g., set version = "0.0.3") so the crate resolves
correctly—look for the structured-zstd dependency declaration in Cargo.toml and
change its version value accordingly.
In `@src/memtable/mod.rs`:
- Around line 230-233: The loop currently consumes items with `for item in
items` which prevents reuse; change it to iterate by reference (e.g., `for item
in &items` or `for item in items.iter()`) in the block where you build
`InternalKey` and call `self.items.insert(&key, &item.value)` so you don't move
out of `items`; adjust any field accesses (e.g., `item.key` -> `&item.key` or
use references) or clone only the minimal fields if needed to satisfy ownership
for `InternalKey::new` and `self.items.insert` while keeping `items` intact for
later use.
In `@src/table/mod.rs`:
- Around line 325-409: get_with_block duplicates the filter-loading and
bloom-check logic present in get; extract that shared logic into a private
helper (suggest name check_bloom_filter or load_filter_block) which performs the
pinned/index/unpinned resolution using load_block + FilterBlock::new and returns
Option<Cow<'_, FilterBlock>> or a bool signalling whether to proceed; replace
the duplicated block in both get() and get_with_block() with a call to this
helper and ensure metrics updates and maybe_contains_hash calls are moved into
the helper while get_with_block continues to call
point_read_with_block/point_read as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 95ee76a8-d157-4e3d-b0b5-1fa4a8c9f32a
📒 Files selected for processing (12)
Cargo.tomlREADME.mdsrc/abstract_tree.rssrc/blob_tree/mod.rssrc/lib.rssrc/memtable/mod.rssrc/pinnable_slice.rssrc/table/mod.rssrc/tree/mod.rssrc/write_batch.rstests/pinnable_slice.rstests/write_batch.rs
WriteBatch: - blob tree KV separation roundtrip - merge operand concatenation - remove_weak tombstone - multi_get after batch spanning disk + memtable PinnableSlice: - blob tree returns Owned (not Pinned) - range tombstone suppression - after compaction returns Pinned (L1+ data) - sealed memtable returns Owned multi_get: - range tombstone suppression - multi-level (memtable + L0 + L1) with shadowing - large batch (500 keys) reverse order from disk
- Extract check_bloom() in Table to DRY filter-loading logic shared between get() and get_with_block() - Use saturating_add for batch size accumulation in insert_batch - Drop version_history read lock before insert_batch to reduce lock contention during large batches - Reword PinnableSlice docs: Block keeps buffer alive via refcounted Slice, does not prevent cache eviction - Add integration tests: WriteBatch with blob tree, merge ops, remove_weak; PinnableSlice with blob tree, range tombstones, compaction, sealed memtable; multi_get with range tombstones, multi-level shadowing, 500-key reverse-order batch
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'lsm-tree db_bench'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: a2c1f56 | Previous: cf5d5bd | Ratio |
|---|---|---|---|
fillseq |
839387.4601232084 ops/sec (normalized) |
1148537.5894635925 ops/sec (normalized) |
1.37 |
fillrandom |
463627.8850158264 ops/sec (normalized) |
663695.9309610558 ops/sec (normalized) |
1.43 |
readrandom |
239327.4587837826 ops/sec (normalized) |
311914.1436192461 ops/sec (normalized) |
1.30 |
seekrandom |
170547.86670630827 ops/sec (normalized) |
212174.6412340675 ops/sec (normalized) |
1.24 |
overwrite |
481982.997678611 ops/sec (normalized) |
687074.7425792207 ops/sec (normalized) |
1.43 |
mergerandom |
355798.78013906063 ops/sec (normalized) |
414073.5132640846 ops/sec (normalized) |
1.16 |
readwhilewriting |
196864.7817348204 ops/sec (normalized) |
283433.1104358807 ops/sec (normalized) |
1.44 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/table/mod.rs (1)
245-260:⚠️ Potential issue | 🟡 MinorRemove the seqno parameter inconsistency or document it.
Line 247 passes
iter.seek(key, seqno)while line 1068 usesiter.seek(key, crate::seqno::MAX_SEQNO). However, both paths are functionally equivalent because all filter partitions are written with hardcodedseqno=0, making the seqno parameter irrelevant to partition selection. For clarity, unify these calls to useMAX_SEQNOconsistently (or document why snapshot semantics apply here).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 245 - 260, The call to iter.seek in the pinned filter path uses a dynamic seqno (iter.seek(key, seqno)) while elsewhere the code uses crate::seqno::MAX_SEQNO; unify these to avoid confusion by changing the pinned_filter_index path to call iter.seek(key, crate::seqno::MAX_SEQNO) (or, if you prefer not to change behavior, add a comment in the pinned_filter_index block explaining that filter partitions are written with seqno=0 and seqno is irrelevant so MAX_SEQNO is used for snapshot semantics); update references in the pinned_filter_index handling around iter.seek, filter_idx.iter, and related materialize/seek logic to reflect the unified usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tree/mod.rs`:
- Around line 1733-1745: append_batch currently clones active_memtable and drops
the version_history read guard before calling insert_batch, allowing
rotate_memtable to seal the memtable mid-batch; change append_batch to hold the
version_history read guard for the duration of the insert by keeping the guard
value (the RwLockReadGuard from self.version_history.read()) alive while calling
insert_batch on the cloned active_memtable (mirror the pattern used in
append_entry), so that the read lock prevents rotate_memtable from moving the
memtable into sealed_memtables until the batch insert finishes.
---
Outside diff comments:
In `@src/table/mod.rs`:
- Around line 245-260: The call to iter.seek in the pinned filter path uses a
dynamic seqno (iter.seek(key, seqno)) while elsewhere the code uses
crate::seqno::MAX_SEQNO; unify these to avoid confusion by changing the
pinned_filter_index path to call iter.seek(key, crate::seqno::MAX_SEQNO) (or, if
you prefer not to change behavior, add a comment in the pinned_filter_index
block explaining that filter partitions are written with seqno=0 and seqno is
irrelevant so MAX_SEQNO is used for snapshot semantics); update references in
the pinned_filter_index handling around iter.seek, filter_idx.iter, and related
materialize/seek logic to reflect the unified usage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6c045502-6822-4165-9573-4f9e6e935b3b
📒 Files selected for processing (7)
src/memtable/mod.rssrc/pinnable_slice.rssrc/table/mod.rssrc/tree/mod.rstests/multi_get.rstests/pinnable_slice.rstests/write_batch.rs
- Hold version_history read guard during append_batch to prevent rotate_memtable from sealing mid-batch (data loss race) - Reword PinnableSlice field/method docs: block keeps buffer alive via refcount, does not prevent cache eviction - Fix get_pinned/apply_batch doctests: pass &folder not folder - Unify filter partition seek to MAX_SEQNO (seqno is irrelevant for partition selection, all partitions written with seqno=0) - Add code comment explaining L0 seqno fast-path in pinned path
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/pinnable_slice.rs`:
- Around line 24-33: The docs currently describe the `Pinned` variant as a
"block cache" signal which implies a cache hit; update the comments for
`PinnableSlice` (including the `Pinned` variant description and any doc blocks
around `is_pinned()` and `PinnableSlice::as_ref`) to instead explain that
`Pinned` means the slice shares/pins the underlying decompressed block buffer
(holds a `Block` clone with a refcounted `Slice`) and therefore keeps that
buffer alive for the `PinnableSlice` lifetime; mention `Tree::get_pinned()` and
`Table::get_with_block()` semantics so callers understand `Pinned` can be
created even when the block was just loaded from disk, and ensure the wording
focuses on sharing/pinning and lifetime/refcount behavior rather than cache-hit
semantics.
In `@src/table/mod.rs`:
- Around line 245-265: The code path that checks pinned_filter_index currently
falls through when iter.next() returns None, causing an unnecessary point lookup
and clearing has_filter; instead treat a missing filter partition as a definite
miss: inside bloom_may_contain_key (the function containing the
pinned_filter_index handling), after iter.seek(...) if iter.next() is None, set
has_filter true (so filter metrics are recorded) and immediately return
Ok(false) to short-circuit the lookup; apply the same change to the second
identical pinned_filter_index branch elsewhere in this file.
In `@src/tree/mod.rs`:
- Around line 1730-1749: append_batch currently calls Memtable::insert_batch
which inserts entries one-by-one and can be observed partially by concurrent
get(seqno) readers; fix by making the batch atomic: in append_batch /
Memtable::insert_batch add a publish barrier (e.g., write all items into a
private buffer or mark them as uncommitted, then atomically flip a committed
flag or advance a visible_seqno) so readers (get) only see the new entries after
the commit flip; locate append_batch, Memtable::insert_batch, get, and any
seqno/visible_seqno state and implement the two-phase insert+commit (or document
that callers must call apply_batch/publish after append_batch) so a racing get
cannot observe a prefix of the batch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3c6dc9ab-4a09-47db-9a88-4c0314188e6d
📒 Files selected for processing (4)
src/abstract_tree.rssrc/pinnable_slice.rssrc/table/mod.rssrc/tree/mod.rs
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/table/mod.rs:266
- In
check_bloom, whenpinned_filter_indexis present anditer.next()returnsNone(key is beyond all partition end-keys), the key cannot exist in this table (same logic asbloom_may_contain_key). Currently this falls through withfilter_block = None, causingget()/get_with_block()to proceed topoint_readand also reporthas_filter = false. Consider short-circuiting to(should_proceed = false, has_filter = true)in this case, and computinghas_filterbased on filter availability (index/full filter), not only on whether a filter block was loaded.
fn check_bloom(&self, key: &[u8], key_hash: u64) -> crate::Result<(bool, bool)> {
// Returns (should_proceed, has_filter)
let filter_block = if let Some(block) = &self.pinned_filter_block {
Some(Cow::Borrowed(block))
} else if let Some(filter_idx) = &self.pinned_filter_index {
let mut iter = filter_idx.iter(self.comparator.clone());
// Filter partitions are written with seqno=0, making the seqno
// parameter irrelevant to partition selection. Use MAX_SEQNO
// consistently to match the index-block seek in Table::range().
iter.seek(key, crate::seqno::MAX_SEQNO);
if let Some(filter_block_handle) = iter.next() {
let filter_block_handle = filter_block_handle.materialize(filter_idx.as_slice());
let block = self.load_block(
&filter_block_handle.into_inner(),
BlockType::Filter,
CompressionType::None,
#[cfg(zstd_any)]
None,
)?;
Some(Cow::Owned(FilterBlock::new(block)))
} else {
None
}
} else if let Some(_filter_tli_handle) = &self.regions.filter_tli {
- Remove all "block cache" wording from PinnableSlice — Pinned means sharing decompressed block buffer, not a cache-hit indicator - Return (false, true) when filter partition seek yields None — key sorts past last partition, definite miss instead of fallthrough - Clarify WriteBatch visibility contract: entries visible one-at-a-time, caller must publish seqno after apply_batch for atomic visibility
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/abstract_tree.rs (1)
559-563:⚠️ Potential issue | 🟡 MinorClarify
Pinneddocs to avoid cache-hit implication.Line 559 still frames
PinnableSlice::Pinnedas tied to “found in the block cache,” which can be read as a cache-hit signal. Prefer wording that focuses on shared backing-memory lifetime.✏️ Suggested doc tweak
- /// When the value is found in the block cache, the returned - /// [`PinnableSlice::Pinned`] variant holds a reference to the cached - /// block, avoiding a data copy. Memtable and blob-resolved values use + /// When the value is backed by an on-disk data block, the returned + /// [`PinnableSlice::Pinned`] variant holds a reference to that block's + /// decompressed buffer, avoiding a data copy. Memtable and blob-resolved values use /// the [`PinnableSlice::Owned`] variant.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/abstract_tree.rs` around lines 559 - 563, The docs for PinnableSlice::Pinned currently imply it signals a block-cache hit; update the comment to remove any cache-hit implication and instead describe that PinnableSlice::Pinned indicates the returned slice shares backing memory whose lifetime is tied to the pinned handle (i.e., no data copy), whereas PinnableSlice::Owned contains an owned copy (used for memtable/blob-resolved values). Edit the comment surrounding PinnableSlice::Pinned in abstract_tree.rs to emphasize shared backing-memory lifetime and ownership semantics rather than referencing a cache hit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/write_batch.rs`:
- Line 40: The documentation for WriteBatch incorrectly says batches are
“applied atomically”; update the doc comment on the WriteBatch type (the
WriteBatch struct/docs) to state that entries are applied with a shared seqno
but become visible one-by-one unless the caller publishes visibility (e.g., via
visible_seqno.fetch_max(seqno + 1)) and that atomic visibility must be performed
by the caller after apply_batch/Tree::append_batch. Mention
Tree::append_batch/apply_batch as the locations that intentionally do not
provide internal visibility gating.
---
Duplicate comments:
In `@src/abstract_tree.rs`:
- Around line 559-563: The docs for PinnableSlice::Pinned currently imply it
signals a block-cache hit; update the comment to remove any cache-hit
implication and instead describe that PinnableSlice::Pinned indicates the
returned slice shares backing memory whose lifetime is tied to the pinned handle
(i.e., no data copy), whereas PinnableSlice::Owned contains an owned copy (used
for memtable/blob-resolved values). Edit the comment surrounding
PinnableSlice::Pinned in abstract_tree.rs to emphasize shared backing-memory
lifetime and ownership semantics rather than referencing a cache hit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9198c93b-2c6b-4d79-8d90-15ed451c784a
📒 Files selected for processing (4)
src/abstract_tree.rssrc/pinnable_slice.rssrc/table/mod.rssrc/write_batch.rs
refactor(table): extract BloomResult enum from check_bloom - PinnableSlice Pinned variant: len, is_empty, value, as_ref, clone, into_value, From impl, PartialEq, Debug — all exercised on disk values - PinnableSlice: empty value, tombstone on disk, RT suppression of disk values, merge operands in memtable/sealed/disk paths - multi_get: merge operand resolution (memtable + disk), L0 tombstones - BloomResult enum replaces (bool, bool) tuple — cleanly gates has_filter behind cfg(metrics) without underscore prefix hacks
…ract - WriteBatch: remove "atomically" from struct doc (visibility is per-entry, not atomic without caller seqno publish) - get_pinned: replace "block cache" with "on-disk data block" in AbstractTree trait doc
|
🧠 Learnings used✅ Actions performedReviews resumed. Full review triggered. |
find_in_tables (monomorphized get/get_pinned path) continued scanning L1+ runs after the covering run returned None. L1+ runs have non- overlapping key ranges — break immediately after covering table miss.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/tree/mod.rs (1)
1535-1541:⚠️ Potential issue | 🟠 MajorStop after the covering L1+ run misses.
Once
run.get_for_key_cmp()returns a table, no later run in the same level can containkey. Continuing afterT::lookup(...)returnsNonemakes bothget()andget_pinned()pay redundant same-level bloom/index probes, even thoughbatch_get_from_tables()already has the covered-miss fast path.♻️ Minimal fix
} else { for run in level.iter() { - if let Some(table) = run.get_for_key_cmp(key, comparator) - && let Some(item) = T::lookup(table, key, seqno, key_hash)? - { - return Ok(item.filter_tombstone()); + if let Some(table) = run.get_for_key_cmp(key, comparator) { + if let Some(item) = T::lookup(table, key, seqno, key_hash)? { + return Ok(item.filter_tombstone()); + } + break; } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tree/mod.rs` around lines 1535 - 1541, The loop currently continues probing later runs after run.get_for_key_cmp(key, comparator) returns Some(table) but T::lookup(table, key, seqno, key_hash) returns None, causing redundant same-level probes; change the logic so that when get_for_key_cmp yields a table and T::lookup returns None you immediately short-circuit the search for that key in this level (return Ok(None) or otherwise stop iterating this level) instead of continuing to later runs—adjust the code around run.get_for_key_cmp and T::lookup to return the covered-miss result, mirroring the fast-path behavior used by batch_get_from_tables(), and keep get() and get_pinned() calling this updated path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tree/mod.rs`:
- Around line 1456-1479: The merge step currently does
still_remaining.extend(covered_miss) which can break the comparator sort order
and ruin sequential SST access; instead merge the two sorted partitions
(still_remaining and covered_miss) by comparing keys via keys[idx].as_ref() with
comparator so the resulting vector preserves the comparator order before
descending to lower levels—replace the extend with a proper two-way merge (or
re-sort using the comparator on the combined entries) so functions like
run.get_for_key_cmp, level.iter(), and the vectors still_remaining/covered_miss
maintain the required order for subsequent probes.
---
Duplicate comments:
In `@src/tree/mod.rs`:
- Around line 1535-1541: The loop currently continues probing later runs after
run.get_for_key_cmp(key, comparator) returns Some(table) but T::lookup(table,
key, seqno, key_hash) returns None, causing redundant same-level probes; change
the logic so that when get_for_key_cmp yields a table and T::lookup returns None
you immediately short-circuit the search for that key in this level (return
Ok(None) or otherwise stop iterating this level) instead of continuing to later
runs—adjust the code around run.get_for_key_cmp and T::lookup to return the
covered-miss result, mirroring the fast-path behavior used by
batch_get_from_tables(), and keep get() and get_pinned() calling this updated
path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a266cd29-953d-4641-9fbd-5bd9ec32786f
📒 Files selected for processing (2)
src/error.rssrc/tree/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
Cargo.toml (1)
38-38:⚠️ Potential issue | 🔴 Critical
structured-zstdversion may be non-resolvable (build blocker).At Line 38, please verify that
structured-zstdversion0.0.6exists on crates.io; if it does not, dependency resolution will fail and block CI/build.#!/bin/bash # Read-only verification: list published versions for structured-zstd curl -s https://crates.io/api/v1/crates/structured-zstd/versions \ | jq -r '.versions[].num' \ | head -50🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Cargo.toml` at line 38, The Cargo.toml lists the dependency structured-zstd = "0.0.6" which may not exist on crates.io; verify whether version 0.0.6 is published and, if it is missing, update the dependency entry (the structured-zstd dependency in Cargo.toml) to a published version (e.g. bump to an existing semver) or point it to a valid git/path source; ensure the dependency string for structured-zstd matches a resolvable crate version or valid alternative source so CI/build dependency resolution succeeds.src/tree/mod.rs (2)
1535-1540:⚠️ Potential issue | 🟠 MajorStop the L1+ run scan after the covering run misses.
Once
run.get_for_key_cmp(...)returns a table for this key, later runs in the same level cannot contain it. IfT::lookup(...)then returnsNone, continuing the loop turnsgetandget_pinnedback into anO(runs_in_level)walk on sparse levels.Fix sketch
- for run in level.iter() { - if let Some(table) = run.get_for_key_cmp(key, comparator) - && let Some(item) = T::lookup(table, key, seqno, key_hash)? - { - return Ok(item.filter_tombstone()); - } - } + for run in level.iter() { + if let Some(table) = run.get_for_key_cmp(key, comparator) { + if let Some(item) = T::lookup(table, key, seqno, key_hash)? { + return Ok(item.filter_tombstone()); + } + break; + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tree/mod.rs` around lines 1535 - 1540, When scanning runs in a level you must stop the L1+ run scan once the covering run has been checked and T::lookup(...) returns None to avoid turning get/get_pinned into an O(runs_in_level) walk; modify the loop that iterates runs (the code calling run.get_for_key_cmp(...) and then T::lookup(...)) to track whether a covering run/table was observed for the key and, if so, break out of the loop when lookup returns None so later runs in the same level are not scanned. Ensure the change affects both get and get_pinned code paths and uses the existing run.get_for_key_cmp, T::lookup, and runs_in_level logic to decide when to stop scanning.
1456-1479:⚠️ Potential issue | 🟠 MajorPreserve comparator order when re-adding
covered_miss.Line 1479 concatenates two individually sorted partitions. Once a level has both uncovered keys and earlier covered misses, the next level sees keys out of comparator order, which throws away the sequential SST access pattern this path is trying to preserve.
Fix sketch
- still_remaining.extend(covered_miss); + still_remaining.extend(covered_miss); + still_remaining.sort_by(|&(a, _), &(b, _)| { + comparator.compare(keys[a].as_ref(), keys[b].as_ref()) + });At minimum, re-sort here; a two-way merge is even better if you want to avoid the extra sort.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tree/mod.rs` around lines 1456 - 1479, The code currently concatenates the already-sorted `covered_miss` partition with the other sorted partition causing comparator order to be broken downstream; instead, preserve comparator order by either performing a two-way merge of the two sorted slices (preferred) or by re-sorting the concatenated vector with the same comparator used elsewhere (e.g., the comparator/`cmp`/`compare` function) before returning or passing the combined keys; locate the concatenation site referencing `covered_miss` and the other partition and replace the simple concat with a stable merge routine that compares elements with the same comparator (or call sort_by on the combined collection) so the final sequence is globally ordered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/pinnable_slice.rs`:
- Around line 10-15: Tests repeat the Config::new(...).open()? pattern; create a
shared helper (e.g., a private test helper function like init_tree or
make_config_in_tests) that constructs the Config, calls open(), and returns the
opened object or Result so tests call that helper instead of duplicating
Config::new(...).open() throughout; update all test usages in pinnable_slice.rs
to call the helper and adjust imports/return types accordingly so future
constructor/config changes are centralized.
- Around line 417-431: Multiple tests re-declare the same ConcatMerge fixture;
hoist a single ConcatMerge definition to module scope and have each test
reference that one to avoid duplication. Create a single ConcatMerge definition
(matching the existing local definitions) at the top of the test module (make it
pub(crate) or non-private as needed), remove the three local redefinitions
currently inside the tests at lines around 417/458/500, and update the tests
that previously declared ConcatMerge to simply use the module-scoped ConcatMerge
symbol (ensuring any helper impls or trait impls such as Display/From/Default
stay with the single definition). Ensure the module-scoped fixture compiles with
all tests that previously used the duplicated versions.
---
Duplicate comments:
In `@Cargo.toml`:
- Line 38: The Cargo.toml lists the dependency structured-zstd = "0.0.6" which
may not exist on crates.io; verify whether version 0.0.6 is published and, if it
is missing, update the dependency entry (the structured-zstd dependency in
Cargo.toml) to a published version (e.g. bump to an existing semver) or point it
to a valid git/path source; ensure the dependency string for structured-zstd
matches a resolvable crate version or valid alternative source so CI/build
dependency resolution succeeds.
In `@src/tree/mod.rs`:
- Around line 1535-1540: When scanning runs in a level you must stop the L1+ run
scan once the covering run has been checked and T::lookup(...) returns None to
avoid turning get/get_pinned into an O(runs_in_level) walk; modify the loop that
iterates runs (the code calling run.get_for_key_cmp(...) and then
T::lookup(...)) to track whether a covering run/table was observed for the key
and, if so, break out of the loop when lookup returns None so later runs in the
same level are not scanned. Ensure the change affects both get and get_pinned
code paths and uses the existing run.get_for_key_cmp, T::lookup, and
runs_in_level logic to decide when to stop scanning.
- Around line 1456-1479: The code currently concatenates the already-sorted
`covered_miss` partition with the other sorted partition causing comparator
order to be broken downstream; instead, preserve comparator order by either
performing a two-way merge of the two sorted slices (preferred) or by re-sorting
the concatenated vector with the same comparator used elsewhere (e.g., the
comparator/`cmp`/`compare` function) before returning or passing the combined
keys; locate the concatenation site referencing `covered_miss` and the other
partition and replace the simple concat with a stable merge routine that
compares elements with the same comparator (or call sort_by on the combined
collection) so the final sequence is globally ordered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a266cd29-953d-4641-9fbd-5bd9ec32786f
📒 Files selected for processing (16)
.github/copilot-instructions.mdCargo.tomlREADME.mdbenches/batch_ops.rssrc/abstract_tree.rssrc/blob_tree/mod.rssrc/error.rssrc/lib.rssrc/memtable/mod.rssrc/pinnable_slice.rssrc/table/mod.rssrc/tree/mod.rssrc/write_batch.rstests/multi_get.rstests/pinnable_slice.rstests/write_batch.rs
- Re-sort still_remaining after extending with covered_miss to preserve comparator order for next level's sequential scan - Extract setup_tree() and module-scope ConcatMerge in pinnable_slice tests (was duplicated 20+ / 3 times respectively) - Bump structured-zstd 0.0.6 → 0.0.7
|
@coderabbitai Re: structured-zstd version duplicate — updated to 0.0.7 (latest). Version 0.0.6 was always valid (cargo search confirmed), 0.0.7 released since. Re: L1+ covering miss duplicates — both fixed: find_in_tables breaks after covering table miss, batch path re-sorts after covered_miss merge. |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/pinnable_slice.rs`:
- Around line 425-427: Test currently flushes after the merge so both the base
insert and merge operand end up in the same SST; move the flush to occur
immediately after tree.insert("k", "D", 0) to force the base value into a
different SST, then perform tree.merge("k", "E", 1) (and optionally flush again)
so merge resolution exercises the cross‑SST disk lookup path—adjust calls to
tree.insert, tree.flush_active_memtable(0) and tree.merge accordingly and
preserve the ? error propagation.
- Around line 8-16: The helper setup_tree currently calls .open().unwrap(),
causing panics on I/O errors; change its signature to return
lsm_tree::Result<(AnyTree, tempfile::TempDir)> and replace the unwrap with the ?
operator so the error from Config::new(...).open() is propagated to callers;
update any callers in tests to handle Result (using ? in test functions or
map_err) so failures surface as Results instead of panics.
- Around line 346-348: The test currently applies a range tombstone via
tree.remove_range("a", "c", 2) but never flushes it, so it only exercises
memtable RT suppression; after calling tree.remove_range(...) add a flush of the
memtable so the range tombstone is written to an SST (e.g., call tree.flush() or
the project's memtable flush helper) so get_pinned() must handle SST-vs-SST
suppression; ensure the flush happens before subsequent reads/assertions that
rely on disk-resident RT behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 03560bd8-175c-46fd-aa37-cf4667c4c880
📒 Files selected for processing (3)
Cargo.tomlsrc/tree/mod.rstests/pinnable_slice.rs
- setup_tree() returns Result instead of unwrap on I/O path - RT suppression test flushes tombstone to disk (SST-vs-SST path) - merge-on-disk test splits base and operand across separate SSTs
- RT suppresses blob-resolved values in batch path - Merge operand resolution with and without merge operator - Memtable hits bypass SST phase for blob tree - Merge without operator returns raw operand value
InternalKey ordering ignores value_type, making equal-key entries with different op types ambiguous to reads/merges — not skiplist overwrites.
## 🤖 New release * `coordinode-lsm-tree`: 4.3.0 -> 4.3.1 <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> ## [4.3.1](v4.3.0...v4.3.1) - 2026-04-06 ### Performance - *(compression)* use numeric zstd levels in pure Rust backend ([#226](#226)) - batch multi_get + PinnableSlice + WriteBatch ([#214](#214)) </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/release-plz/release-plz/). Co-authored-by: sw-release-bot[bot] <255865126+sw-release-bot[bot]@users.noreply.github.com>
Summary
get_pinned()API —Pinnedvariant keeps decompressed block buffer alive via refcountedSlice/ByteView,Ownedfor memtable/merge/blob valuesapply_batch(batch, seqno) -> Result<(u64, u64)>— shared seqno, single lock,Error::MixedOperationBatchrejects mixed-op duplicates unconditionallyTechnical Details
Batch multi_get
find_in_tables<T: TablePointLookup>— zero-cost generic forget(no Block overhead) andget_pinned(retains Block)PinnableSlice
Pinned { _block: Block, value: Slice }— block buffer alive via refcount, does NOT prevent cache evictionresolve_pinned_entryhelper — single source of truth for tombstone/RT/merge resolution, used by bothget_pinnedandmulti_getTable::point_read_inner— shared block-index walk forpoint_readandpoint_read_with_blockBloomResultenum — DRY filter-loading with cfg(metrics) gatingWriteBatch
apply_batchreturnsResult—Error::MixedOperationBatchif same user key has differing op types (insert + remove)merge()on same key is safe (same value_type)Memtable::insert_batch— singlesaturating_addfor total sizerotate_memtableraceTable::get global seqno
Table::getandget_with_blocknow addglobal_seqnoback to returnedInternalValueseqnoglobal_seqnochecked_add(1) == Some(seqno)(was dead code with== seqno)Test Plan
cargo nextest run --all-features)cargo test --doc)cargo clippy --all-targets --all-features -- -D warnings)Related
Closes #143
Summary by CodeRabbit
New Features
Performance Improvements
Bug Fixes
Documentation
Tests / Benchmarks
Chores