feat(checkpoint)!: hard-link snapshot for PITR backup (V5 storage) by polaz · Pull Request #276 · structured-world/coordinode-lsm-tree

polaz · 2026-05-19T21:26:47Z

Summary

Implements #210 — the consistent hard-link checkpoint primitive that unblocks point-in-time recovery (PITR) backup for CoordiNode and any other consumer that needs an O(1)-per-file snapshot of an open LSM-tree.

New trait method Fs::hard_link(src, dst) with a default Unsupported impl (non-breaking for downstream backends). StdFs performs a real hard_link(2) with an EXDEV→copy fallback; MemFs produces an independent copy; IoUringFs delegates to StdFs since linking is a metadata-only syscall.
New trait method Fs::backend_id() -> Option<u64> — explicit shared-namespace capability check that gates cross-Fs hard-link attempts. StdFs/IoUringFs return a common KERNEL_BACKEND_ID; each MemFs::new() allocates a unique per-instance id; default impl returns None for safe-by-default third-party backends.
DeletionPause — a tree-wide refcount-gated deferred-delete queue installed into each Table::Inner / BlobFile::Inner. While at least one Pause guard is alive, both Drop impls queue removals instead of executing them; the queue drains when the last guard is dropped. Mirrors RocksDB's DisableFileDeletions/EnableFileDeletions pattern. Synchronisation uses spin::Mutex so the type is no_std-compatible.
AbstractTree::create_checkpoint(target) -> CheckpointInfo — implemented on Tree, BlobTree, and (via enum_dispatch) AnyTree. The driver captures visible_seqno first, flushes the active memtable (eviction seqno 0 — never expands MVCC GC beyond what would have happened anyway), snapshots the current version, hard-links every live SST + blob into target/{tables,blobs}/, copies manifest / v<id> / current, and fsyncs the target root.
level_routes (tiered storage) is honoured automatically — each Table::Inner already carries its own routed Fs, so the link is performed through the right backend. Cross-Fs situations (e.g. MemFs source vs. StdFs target) transparently fall back to a streamed byte copy after the backend_id namespace check declines the hard-link path.
Tree::open now rejects a directory that has version artifacts (tables/, blobs/, vN) but no current pointer — the on-disk signature of a half-written checkpoint. Previously the code silently called create_new, which would have overwritten the partial state with an empty tree.

Test plan

cargo nextest run --all-features — 1327/1327 passed (6 environment-skipped), including 14 integration scenarios in tests/checkpoint_pitr.rs:

Plus unit coverage for:

DeletionPause — defer / inactive-passthrough / nested-pauses / generation-race-under-lock
StdFs::is_cross_device — raw EXDEV + ErrorKind variants
Fs::hard_link — round-trip + duplicate / missing-source rejection (both StdFs and MemFs)
copy_fallback — independence + create-new semantics
cargo clippy --lib --tests clean
cargo test --doc passes for all touched modules

Closes #210

Summary by CodeRabbit

New Features
- Point-in-time checkpoint snapshots (optional blob inclusion) reporting file counts, total bytes, version id and visibility seqno; checkpoints can be opened as independent trees.
Reliability / Bug Fixes
- Atomic, durable checkpoint creation with hard-link-first and streamed-copy fallback (debug-logged); best-effort cleanup on failure; opening rejects half-written or corrupt checkpoints with clearer diagnostics.
Utilities
- Checkpoint-aware gate to defer physical file removals during snapshot lifetimes.
Filesystem compatibility
- Improved create-dir/hard-link handling with cross-filesystem copy fallback and namespace signaling.
Tests / Documentation
- Extensive end-to-end, isolation and crash-safety checkpoint suite; README updated to document PITR checkpoints.

BREAKING CHANGE: on-disk format V5. V3/V4 databases are NOT readable by this version and vice versa. V5 ships the BuRR filter wire-format change (replaces standard bloom, #269) plus the PITR hard-link checkpoint primitive (this PR, #210). The major bump goes through release-plz on squash.

- Add `Fs::hard_link(src, dst)` with default `Unsupported` impl; concrete impls on `StdFs` (with EXDEV→copy fallback), `MemFs` (copy-as-link semantics) and `IoUringFs` (delegates to `StdFs`). - Introduce `DeletionPause` — refcount-gated deferred-delete queue that holds back `Table` / `BlobFile` removals while a checkpoint is in flight, mirroring RocksDB's `DisableFileDeletions`/`EnableFileDeletions`. - Add `AbstractTree::create_checkpoint(target) -> CheckpointInfo`, implemented on `Tree`, `BlobTree`, and (via enum_dispatch) `AnyTree`. The driver flushes the active memtable, snapshots the current version, hard-links every live SST + blob into `target/{tables,blobs}/`, copies manifest / version / current-pointer, and fsyncs the target root. - `level_routes` tiers are honoured by linking through each table's own `Fs` handle; cross-`Fs` situations fall back to a streamed byte copy. - Tests: 11 integration scenarios in `tests/checkpoint_pitr.rs` (round-trip, concurrent writes, BlobTree, level_routes, isolation, crash-safety, deferred-delete vs concurrent major_compact, empty tree, byte-accuracy, double-target rejection, cross-Fs copy) plus unit coverage for `is_cross_device` and `DeletionPause`. Closes #210

coderabbitai · 2026-05-19T21:26:59Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 090e446d-7305-4e10-9522-78a314e9e63b

📥 Commits

Reviewing files that changed from the base of the PR and between 9d839cc and d767602.

📒 Files selected for processing (2)

Cargo.toml
src/compression/mod.rs

📝 Walkthrough

Walkthrough

Creates PITR checkpoints: adds Fs mkdir/hard-link primitives, a refcounted DeletionPause to defer deletions, a checkpoint driver that hard-links or copies SST/blob files and metadata, wires pause into Tree/BlobTree lifecycles, and adds end-to-end PITR tests.

Changes

Checkpoint PITR Feature

Layer / File(s)	Summary
Filesystem hard-link trait and implementations `src/fs/mod.rs`, `src/fs/std_fs.rs`, `src/fs/io_uring_fs.rs`	`Fs` trait adds `create_dir`, `hard_link`, and `backend_id`. `StdFs` attempts `std::fs::hard_link` and falls back to buffered copy on cross-device errors with detection and copy-fallback logic; `IoUringFs` delegates to `StdFs`.
MemFs namespace & tests `src/fs/mem_fs.rs`	Adds `namespace_id`, `backend_id`, `create_dir`, and `hard_link` emulation with tests validating independent-copy semantics, `AlreadyExists`, and `NotFound`.
Deletion pause mechanism `src/deletion_pause.rs`	`DeletionPause` is a ref-counted gate that queues deletions while active; `acquire()` returns a `Pause` guard and the last-holder drop drains queued deletions with generation-race protection. Includes unit and concurrency tests.
Checkpoint core orchestration `src/checkpoint.rs`	Implements checkpoint utilities: `prepare_target` creates atomic target layout; `link_or_copy_cross_fs` hard-links or stream-copies across Fs backends; `link_tables` and `link_blob_files` capture live files counting files/bytes; `copy_metadata` replicates manifest/v/CURRENT (rewriting CURRENT atomically); `run_checkpoint` flushes memtable, acquires deletion pause, links/copies files, fsyncs child dirs then root, and returns `CheckpointInfo`. RAII guard removes partial checkpoints on failure.
Tree wiring and pause installation `src/tree/inner.rs`, `src/tree/mod.rs`, `src/table/inner.rs`, `src/table/mod.rs`	`TreeInner` owns a shared `DeletionPause` created at construction. Recovery and `register_tables` install the pause into recovered/newly registered tables/blobs. `Tree::create_checkpoint` delegates to `checkpoint::run_checkpoint` with `include_blobs: false`. Table drops defer physical file removal when the pause enqueues deletions.
BlobTree and blob-file deletion pause `src/blob_tree/mod.rs`, `src/vlog/blob_file/mod.rs`, `src/vlog/blob_file/multi_writer.rs`, `src/vlog/blob_file/reader.rs`, `src/vlog/mod.rs`	`BlobTree::create_checkpoint` delegates to `run_checkpoint` with `include_blobs: true`. `BlobFileInner` stores the `Fs` handle and a `OnceBox<DeletionPause>`; `BlobFile::install_deletion_pause` installs the pause idempotently. Blob writer construction and recovery initialize these fields; blob Drop defers or performs removal in a checkpoint-aware manner.
Public API: CheckpointInfo and trait method `src/abstract_tree.rs`	Adds `CheckpointInfo` (sst_files, blob_files, total_bytes, version_id, seqno) and extends `AbstractTree` with `create_checkpoint(&Path) -> Result<CheckpointInfo>` documenting hard-link semantics, cross-FS fallback logging, concurrency/deferred deletion behavior, and failure/cleanup characteristics.
Crate-level modules & deps `src/lib.rs`, `Cargo.toml`	Declares crate-visible `deletion_pause` and `checkpoint` modules, re-exports `CheckpointInfo` at the crate root, and updates Cargo features/dependencies (`spin`, `once_cell`) to support no-std/alloc changes.
Blob merge heap change (alloc/no-std) `src/vlog/blob_file/merge.rs`	Replaces `interval_heap::IntervalHeap` with `alloc::collections::BinaryHeap<Reverse<...>>`, updates ordering and heap push/pop sites to use `Reverse`, and switches imports to `alloc`/`core` equivalents.
Compression OnceBox migration `src/compression/mod.rs`	Replaces zstd `OnceCell` cache with `once_cell::race::OnceBox` and updates initialization and lazy-init code accordingly.
End-to-end integration tests `tests/checkpoint_pitr.rs`	Integration tests cover round-trip checkpoint/reopen, concurrent writer/compaction safety, BlobTree blob preservation, tiered-storage flattening, MVCC/regression checks, divergence/isolation, compaction-vs-checkpoint races, empty-tree/bytes accounting, duplicate-target rejection, cross-FS fallback, and crash/failure safety with retries.
MemFs regression tests `tests/mem_fs_tree.rs`	Adds tests for relative single-component and `./`-prefixed relative checkpoint target handling on MemFs.
Misc: README and docs `README.md`, `.github/instructions/rust.instructions.md`	Documents Tree::create_checkpoint in README and adds a no-std+alloc guidance section to review docs.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Tree
  participant Checkpoint
  participant Fs
  participant DeletionPause
  Client->>Tree: create_checkpoint(target_path)
  Tree->>DeletionPause: acquire() (Pause guard)
  Tree->>Checkpoint: run_checkpoint(params)
  Checkpoint->>Tree: flush_active_memtable()
  Checkpoint->>Tree: capture current_version + visible_seqno
  Checkpoint->>Fs: create_dir(target), create tables/, optionally blobs/
  Checkpoint->>Fs: hard_link or copy SST/blob files
  Checkpoint->>Fs: copy manifest, write v<id>, write CURRENT
  Checkpoint->>Fs: fsync child dirs then root (and parent if present)
  Checkpoint->>DeletionPause: release() (drop Pause)
  Checkpoint->>Tree: return CheckpointInfo
  Tree->>Client: CheckpointInfo

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

structured-world/coordinode-lsm-tree#273: Overlaps on zstd prepared-cache migration (OnceCell → OnceBox) in src/compression/mod.rs.
structured-world/coordinode-lsm-tree#212: Related changes to Tree open/recovery logic and CURRENT/version handling touched in this PR.
structured-world/coordinode-lsm-tree#227: Also modifies zstd dictionary prepared caching and lazy-init behavior; code-level relation.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(checkpoint)!: hard-link snapshot for PITR backup (V5 storage)' clearly and specifically describes the main change: implementing checkpoint functionality for point-in-time recovery backups using hard-link snapshots, with a breaking change indicator.
Linked Issues check	✅ Passed	All coding requirements from issue `#210` are met: Fs::hard_link with EXDEV fallback, DeletionPause mechanism for deferred deletions, create_checkpoint API with CheckpointInfo, level_routes support, cross-filesystem handling, and comprehensive test coverage including concurrent compaction, crash-safety, and reopenability.
Out of Scope Changes check	✅ Passed	All code changes are directly scoped to checkpoint implementation: Fs trait extensions (hard_link, create_dir, backend_id), DeletionPause mechanism, checkpoint module, Tree/BlobTree implementations, and supporting infrastructure like deletion_pause integration into Table/BlobFile. No unrelated refactoring or feature additions detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#210-pitr-checkpoint

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abstract_tree.rs`:
- Around line 89-95: Update the checkpoint docstring to mention that while
checkpoints attempt O(1) hard-links (so they consume no extra space) for all
SST/blob files, the implementation falls back to streamed/copy-on-write behavior
when hard-linking fails across filesystems/backends (e.g., EXDEV), which incurs
linear IO and additional storage for those files; keep references to
Config::open and BlobTree and state that callers using level routing or separate
backup volumes may observe this fallback and its performance/storage
implications.

In `@src/checkpoint.rs`:
- Around line 220-229: The code currently copies the live CURRENT file via
copy_metadata_file(CURRENT_VERSION_FILE) which can race with a concurrent
publish; modify run_checkpoint so that instead of cloning the source CURRENT you
atomically write a new CURRENT in the checkpoint containing the captured version
id (use the captured version.id() / version_id) as the content and perform this
write as the final metadata step; replace the
copy_metadata_file(CURRENT_VERSION_FILE) call with an atomic write of the
captured version string (use the same metadata-write helper/atomic-temp-rename
pattern used elsewhere) so CURRENT always points to the snapshot's version and
not to a concurrently-published one.
- Around line 268-292: After prepare_target succeeds you must install a
best-effort cleanup guard that removes the partially-built checkpoint on any
early return/error: create a scoped guard (or use a Drop-based RAII guard)
immediately after prepare_target(target_root, include_blobs, &**target_fs)? that
captures target_root and target_fs and on drop (if not disarmed) deletes the
target (e.g., remove_dir_all or equivalent), then disarm the guard only after
the final fsync_directory completes successfully. Ensure the guard covers
failures from deletion_pause.acquire(), tree.flush_active_memtable(SeqNo::MAX)?,
link_tables(...), link_blob_files(...), copy_metadata(...), and
fsync_directory(...), and reference/disarm it right after the successful
completion of fsync_directory so retries won’t hit AlreadyExists.
- Around line 280-282: CheckpointInfo::seqno can race ahead of the captured
snapshot because visible_seqno.get() is read after tree.current_version();
change the code to capture the seqno at the same logical snapshot as
current_version() by obtaining visible_seqno as-of that version (or by using the
tree snapshot API that returns version+seqno together) instead of calling
visible_seqno.get() separately; update the code around the calls to
tree.current_version() and visible_seqno.get() so you either call a
visible_seqno.as_of(version) (or equivalent API) or take the snapshot first
(e.g., snapshot = tree.capture_snapshot()) and read snapshot.seqno, ensuring
CheckpointInfo::seqno reflects the same snapshot point as
tree.current_version().
- Around line 53-65: prepare_target currently races by calling
target_fs.exists(target) then create_dir_all, allowing another process to create
the same checkpoint; change it to claim the checkpoint root atomically (either
by using an atomic create-new API on the filesystem or by creating a create-new
sentinel file/dir inside target and failing if it already exists) before
creating TABLES_FOLDER and optional blobs; specifically, update prepare_target
to attempt an atomic create for target (or create_new for a sentinel like
target.join(".lock" or ".claim")), return AlreadyExists if that atomic create
fails, and only after successful claim call
target_fs.create_dir_all(&target.join(TABLES_FOLDER)) and, when include_blobs is
true, create the blobs folder — ensure you reference prepare_target and
TABLES_FOLDER and preserve error mapping to std::io::ErrorKind::AlreadyExists.
- Around line 283-292: The code currently calls link_tables and link_blob_files
to populate target/tables and target/blobs, then writes metadata and calls
fsync_directory(target_root, target_fs) only on the root; to fix, after
link_tables and (if include_blobs) link_blob_files succeed and before the final
fsync_directory(target_root, target_fs) you must call fsync_directory on each
populated child directory (e.g., the tables directory and, if created, the blobs
directory) using the same target_fs handle so those directory entry changes are
durable; keep the final fsync_directory(target_root, target_fs) after these
child-directory syncs and ensure you perform these child fsyncs regardless of
copy_metadata succeeding to preserve correct fsync ordering for
link_tables/link_blob_files, referencing link_tables, link_blob_files,
fsync_directory, target_root and target_fs in your changes.

In `@src/deletion_pause.rs`:
- Around line 150-151: Replace the test-level lint suppression attribute so it
uses expect instead of allow: change the attribute on the test module/section
that currently reads #[allow(clippy::unwrap_used, reason = "...")] (the
attribute immediately following #[cfg(test)]) to #[expect(clippy::unwrap_used,
reason = "...")] so the suppression will emit a warning if it becomes
unnecessary; keep the same reason string and retain #[cfg(test)] unchanged.
- Around line 121-144: The drain is racing with a concurrently-created new pause
because after fetch_sub() transitions active 1->0 a new acquire() can create a
fresh pause and enqueue items that this branch will wrongly take; fix by making
draining keyed or serialized with acquire()/try_enqueue(): add and bump a
monotonic generation counter (e.g., inner.generation) when acquire() creates a
new pause and have drop() capture the generation it observed, then while holding
inner.queue.lock() verify the queue's generation matches that captured
generation before taking it (or only drain items from the matching generation);
alternatively, ensure acquire()/try_enqueue() and the drain both synchronize on
the same mutex so the drain only swaps the queue when no new pause has been
created. Ensure you update acquire(), try_enqueue(), and Drop for DeletionPause
to consult the generation or shared lock and reference the existing fields
active, inner.queue, acquire(), try_enqueue(), and the prev == 1 branch when
applying the change.

In `@src/fs/mem_fs.rs`:
- Around line 1285-1312: The test hard_link_creates_independent_copy should
verify independent backing buffers by mutating one path before removal and
asserting the other is unchanged: after creating src and dst via
fs.hard_link(src, dst), open one (e.g., src) with
FsOpenOptions::new().write(true), write different bytes via write_all and
drop/close it, then remove_file(src) and assert exists(src) is false and
exists(dst) is true, and finally open dst and read_to_string to confirm its
contents remain the original value (not the mutated bytes); update assertions
accordingly to reference hard_link, open, write_all, remove_file,
read_to_string, and exists.

In `@tests/checkpoint_pitr.rs`:
- Around line 179-181: The test is silently swallowing I/O errors by using
filter_map(Result::ok) on std::fs::read_dir(&dst_tables) and by calling
e.metadata().ok(); instead collect the directory entries as a Result<Vec<_>>
(e.g. collect::<std::io::Result<Vec<_>>>()), propagate any read_dir error with ?
and replace e.metadata().ok() with e.metadata()? so filesystem errors bubble up
instead of being dropped; apply the same changes to the other occurrences around
the entries/count logic referenced (the blocks around lines 260-263 and 338-341)
so any unreadable checkpoint entries cause the test to fail with the original
I/O error.
- Around line 154-185: The test never verifies that any SST was actually flushed
into the routed hot tier, so add a pre-check that at least one flushed SST
exists under the hot tier path (hot_dir) before calling
tree.create_checkpoint(&dst_path); e.g., inspect hot_dir.path()/tables (or the
same tables layout used later) with std::fs::read_dir, count entries, and assert
entries >= 1 with a clear message so the checkpoint validation proves the routed
path (level_routes / LevelRoute) was honored.
- Around line 415-454: Change the test to exercise a failure that occurs after
prepare_target rather than before it: do not pre-create the full destination (so
prepare_target runs), then induce a failure after prepare_target but before
checkpoint completion (for example by making the destination unwritable or
creating a conflicting file inside the freshly prepared target to break the
subsequent link/copy steps) when calling tree.create_checkpoint; assert that
create_checkpoint returns an error, that the partially-created destination
directory is removed/cleaned up, and that the source tree (accessed via tree.get
and reopen via open_tree) still contains all data. Ensure the new assertions
reference create_checkpoint and the prepared target behavior so the test
verifies cleanup-on-failure semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c21dbfd7-893d-4247-b5b0-97e6bb93d0ad

📥 Commits

Reviewing files that changed from the base of the PR and between 413a296 and 580e44a.

📒 Files selected for processing (18)

src/abstract_tree.rs
src/blob_tree/mod.rs
src/checkpoint.rs
src/deletion_pause.rs
src/fs/io_uring_fs.rs
src/fs/mem_fs.rs
src/fs/mod.rs
src/fs/std_fs.rs
src/lib.rs
src/table/inner.rs
src/table/mod.rs
src/tree/inner.rs
src/tree/mod.rs
src/vlog/blob_file/mod.rs
src/vlog/blob_file/multi_writer.rs
src/vlog/blob_file/reader.rs
src/vlog/mod.rs
tests/checkpoint_pitr.rs

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a hard-link based checkpoint primitive to enable O(1)-per-file point-in-time recovery (PITR) snapshots of an open LSM tree, including correctness around concurrent compaction via deferred deletions.

Changes:

Introduces AbstractTree::create_checkpoint(...) -> CheckpointInfo and shared checkpoint driver (src/checkpoint.rs).
Adds Fs::hard_link(src, dst) with backend implementations (StdFs EXDEV→copy fallback, MemFs copy semantics, IoUringFs delegation).
Implements DeletionPause and wires it into Table/BlobFile drops to defer deletions during checkpoints; adds end-to-end PITR tests.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/checkpoint_pitr.rs	New end-to-end checkpoint/PITR integration tests including concurrent write/compaction scenarios.
src/vlog/mod.rs	Ensures recovered `BlobFile::Inner` carries `fs` + deletion pause state.
src/vlog/blob_file/reader.rs	Updates blob-file reader test setup to include new `Inner` fields.
src/vlog/blob_file/multi_writer.rs	Ensures newly created blob files carry `fs` + deletion pause state.
src/vlog/blob_file/mod.rs	Adds `DeletionPause` plumbing + `Fs` to blob-file inner; defers deletes in `Drop`.
src/tree/mod.rs	Implements `create_checkpoint` for `Tree`; installs deletion pause on recovered/new tables/blobs.
src/tree/inner.rs	Adds tree-wide `deletion_pause` to `TreeInner`.
src/table/mod.rs	Adds `install_deletion_pause` + `OnceLock` storage for the pause in tables.
src/table/inner.rs	Defers table file deletion in `Drop` when checkpoint pause is active.
src/lib.rs	Exposes `CheckpointInfo`; adds internal `checkpoint` + `deletion_pause` modules.
src/fs/std_fs.rs	Adds `hard_link` implementation with EXDEV detection + copy fallback; tests.
src/fs/mod.rs	Adds `Fs::hard_link` default (Unsupported) + documentation.
src/fs/mem_fs.rs	Implements `hard_link` as an independent byte copy for MemFs; tests.
src/fs/io_uring_fs.rs	Delegates `hard_link` to StdFs.
src/deletion_pause.rs	New refcounted deletion gate with queue draining on last guard drop; unit tests.
src/checkpoint.rs	New checkpoint implementation: prepare target, pause deletions, flush, link/copy files, copy metadata, fsync.
src/blob_tree/mod.rs	Implements `create_checkpoint` for `BlobTree` (includes blob files).
src/abstract_tree.rs	Adds `CheckpointInfo` type + `create_checkpoint` API docs on `AbstractTree`.

codecov · 2026-05-19T21:38:13Z

Codecov Report

❌ Patch coverage is 84.00000% with 108 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/checkpoint.rs	87.45%	33 Missing ⚠️
src/fs/mod.rs	0.00%	17 Missing ⚠️
src/vlog/blob_file/mod.rs	48.27%	15 Missing ⚠️
src/deletion_pause.rs	92.76%	11 Missing ⚠️
src/fs/io_uring_fs.rs	0.00%	9 Missing ⚠️
src/fs/mem_fs.rs	84.00%	8 Missing ⚠️
src/fs/std_fs.rs	85.71%	6 Missing ⚠️
src/tree/mod.rs	92.64%	5 Missing ⚠️
src/table/inner.rs	50.00%	3 Missing ⚠️
src/vlog/blob_file/merge.rs	85.71%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Switches `Arc`, `AtomicU32`, `fmt::Debug` and `vec!` imports in the new checkpoint + deletion-pause modules to their `core::*` / `alloc::*` forms. `Mutex`, `PathBuf` and `std::io::{Read, Write}` stay on `std::*` because they have no `core` / `alloc` equivalent — these dependencies are inherited from the `Fs` trait that already lives in `std`. No behaviour change; the no-std-check error count stays flat (1206) but the new code stops contributing trivially-portable usages to the migration debt.

Concurrency / correctness fixes: - prepare_target uses atomic Fs::create_dir (new trait method) instead of exists()+create_dir_all, closing the TOCTOU window where two concurrent creators could race and both proceed thinking they claimed the directory. - run_checkpoint now writes a fresh `current` pointer for the captured version_id rather than copying the live source file. A concurrent flush/compaction can publish v<N+1> after current_version() snapshot, and the old code would copy that newer CURRENT into the checkpoint — leaving the snapshot pointing at a version it never linked. - Reorders visible_seqno.get() BEFORE current_version() so the recorded CheckpointInfo.seqno is a lower bound on the snapshot's contents. Concurrent writers can no longer advance the cutoff past data that did not make it into the checkpoint. - fsync_directory is now called on tables/ and blobs/ (when populated) before the root, so directory entries for the hard-linked files survive a power loss instead of relying on the root fsync alone. - PartialCheckpointGuard removes the target directory on any early return from run_checkpoint, keeping retries against the same path idempotent. - link_or_copy_cross_fs retries on EINTR in the read loop, matching StdFs::copy_fallback and avoiding flaky failures under signal interruption. DeletionPause: - Drop for Pause re-checks `active` under the queue lock before draining, so a new pause that acquires between fetch_sub→0 and the lock cannot have its freshly-queued items stolen by the previous drop. Includes a regression test for the generation race. - Mutex poisoning in Drop and try_enqueue now recovers via PoisonError::into_inner instead of leaking deferred deletions permanently — the whole point of the type is to keep the filesystem tidy, so leaving a queue stranded on poison defeats the purpose. - Switches test-level #[allow(unwrap_used)] to #[expect(...)] so the suppression warns if it ever becomes unnecessary. Docs / API: - Adds Fs::create_dir (atomic single-leaf mkdir) with Unsupported default + concrete impls on StdFs, MemFs, IoUringFs. - abstract_tree.rs docstring spells out the cross-filesystem copy fall-back so callers using level_routes or detached backup volumes know to expect linear-time / extra-space behaviour in those cases. - std_fs::is_cross_device docstring now explains explicitly why we do NOT classify PermissionDenied as a cross-device signal. - Adds manual Debug impl on vlog::blob_file::Inner now that the Arc<dyn Fs> + OnceLock fields prevent #[derive(Debug)]. Tests: - MemFs hard_link test now mutates one path and asserts the other is unchanged, proving independent backing buffers (the old assertion on remove_file would have passed even with Arc<Mutex<Vec<u8>>> aliasing). - checkpoint_pitr.rs replaces filter_map(Result::ok) with collected Result<Vec<_>> in three places so I/O failures during verification surface instead of being silently dropped. - checkpoint_flattens_level_routes asserts the routed tier received an SST BEFORE checkpointing, so a regression that ignored level_routes would no longer pass silently. - compaction_during_checkpoint_preserves_source_ssts wraps the compactor thread in a RAII guard so an early-returning ? cleans up the worker instead of detaching it. - Removes a tautological assert_eq!(info.sst_files, info.sst_files) and replaces it with a meaningful sst_files >= 1 invariant. - cross_fs_link_or_copy_streams_through_trait now mutates the destination (not a different file) and verifies the StdFs source is unchanged AND the MemFs destination reflects the write. - checkpoint_failure_leaves_source_intact splits into early-reject and retry-after-cleanup paths.

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'lsm-tree db_bench'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `9808f74`	Previous: `413a296`	Ratio
`mergerandom`	`382152.66922789026` ops/sec (normalized)	`470343.5799479985` ops/sec (normalized)	`1.23`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abstract_tree.rs`:
- Around line 33-36: Update the documentation for CheckpointInfo::seqno to state
it is a lower-bound watermark rather than the definitive maximum visible seqno:
change the doc comment on the CheckpointInfo struct (and the other doc block
around the second occurrence) to say something like "A lower-bound watermark:
this seqno is guaranteed to be included in the checkpoint; later seqnos may also
be present due to sampling before flush/snapshot, so callers must treat it as an
inclusive minimum rather than the absolute maximum." Ensure you update the
comment attached to the field named seqno (CheckpointInfo::seqno) and any
repeated explanatory doc blocks in the same file to use this wording.

In `@src/checkpoint.rs`:
- Around line 364-370: The partial-checkpoint cleanup guard is created only
after prepare_target(...) returns which can leak target_root if prepare_target
creates the root but then fails creating child dirs; move the
PartialCheckpointGuard::new(target_root, target_fs) to immediately after the
root directory is claimed (i.e. as soon as prepare_target acquires/creates
target_root) or modify prepare_target(...) to install its own RAII guard for the
subdirectory-creation phase so any early return cleans up the partial root;
ensure the guard is still disarmed by commit() after fsync_directory succeeds
and keep references to target_root and target_fs consistent.
- Around line 201-208: Replace the exists-then-open TOCTOU pattern by opening
the source file directly: instead of calling src_fs.exists(&src) then
src_fs.open(&src, ...), call src_fs.open(&src, &FsOpenOptions::new().read(true))
first and match its Result; if it Errs with a NotFound/NotFound-equivalent from
the filesystem API, return Ok(()) (treat missing source as ignorable), otherwise
propagate the error and proceed to open/create the destination (dst) and
continue; remove the prior src_fs.exists call and ensure error handling checks
the filesystem's NotFound variant when handling the open of src (use the same
src, src_root, file_name, src_fs, FsOpenOptions, and src_file identifiers to
locate the change).

In `@tests/checkpoint_pitr.rs`:
- Around line 292-299: The compactor thread errors are currently being
swallowed: calls to major_compact(...) and the thread join in
CompactorGuard::drop discard results; change those points (references:
major_compact, CompactorGuard, Drop, handle, join, stop) to propagate or surface
failures instead of ignoring them — e.g., have major_compact return Result and
unwrap/expect or send the Err back to the test harness, and replace "let _ =
h.join()" with h.join().expect("compactor thread panicked") (or propagate the
Box<Any> to fail the test); apply the same treatment to the other join site
noted around the 306-309 region so any compaction error or panic fails the test
rather than being hidden.
- Around line 96-109: Capture the CheckpointInfo returned by
tree.create_checkpoint (don't discard it into _info), then use info.seqno to
bound the assertions: iterate keys whose seqno is <= info.seqno and assert each
is present in restored (use restored.get(..., lsm_tree::SeqNo::MAX) to read the
snapshot), while making no assertions about keys with seqno > info.seqno; update
the variable name from _info to info and replace the fixed 0..50 loop with a
loop that uses info.seqno as the upper bound (casting types as needed).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f9ce0ddf-0862-4fac-b9b9-33d3421f9c86

📥 Commits

Reviewing files that changed from the base of the PR and between 580e44a and dd23588.

📒 Files selected for processing (9)

src/abstract_tree.rs
src/checkpoint.rs
src/deletion_pause.rs
src/fs/io_uring_fs.rs
src/fs/mem_fs.rs
src/fs/mod.rs
src/fs/std_fs.rs
src/vlog/blob_file/mod.rs
tests/checkpoint_pitr.rs

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Correctness / safety: - prepare_target now arms a local RAII cleanup guard immediately after the atomic create_dir(target) succeeds. If creating tables/ or blobs/ subsequently fails, the freshly-claimed root is removed so retries against the same path don't hit AlreadyExists. - link_or_copy_cross_fs replaces the Arc::ptr_eq fast-path with a "try hard_link first, fall back on NotFound/Unsupported" strategy. Two Arc<dyn Fs> handles backed by the same kernel filesystem (e.g. an Arc::new(StdFs) created independently in level_routes versus the primary config.fs) used to be pessimised into an O(bytes) byte copy; they now hard-link in O(1) as intended. - copy_metadata_file opens the source directly instead of exists() + open() (closes the TOCTOU window where the file disappears between the two calls). It also distinguishes Required vs Optional metadata: v<id> is now Required so a concurrent version-GC that removes the captured v<id> fails the checkpoint loudly instead of producing an unopenable snapshot. Docs: - CheckpointInfo.seqno docstring rewritten as a lower-bound watermark ("guaranteed-included, may also contain later seqnos"). The previous "maximum visible seqno" wording could mislead PITR consumers into treating it as a replay ceiling. Tests: - DeletionPause race test now uses two mpsc channels to pin the invariant check at the exact moment Thread B holds an active pause and has enqueued. Without the fix, A's drain would have already swept B's file before the check runs; the buggy implementation now fails the assertion mid-flight instead of producing the same final state as the fixed one. - compaction_during_checkpoint_preserves_source_ssts: CompactorGuard now propagates panics via resume_unwind and surfaces Err returns as test failures (with thread::panicking() guards to avoid double- panic). The compactor loop itself uses ? on major_compact instead of swallowing errors with `let _ = ...`. - checkpoint_survives_concurrent_writes captures CheckpointInfo and asserts the lower-bound watermark contract: every key with seqno <= info.seqno MUST be present in the restored checkpoint (keys above the watermark make no claim either way). - checkpoint_failure_leaves_source_intact (Unix path) now genuinely exercises the post-prepare failure scenario by chmod-ing the source tables/ directory to 0o000 between flush and create_checkpoint. link(2) requires search (x) permission on every path component, so prepare_target succeeds and link_tables then fails with EACCES, exercising PartialCheckpointGuard's cleanup. The non-Unix branch falls back to the previous success-then-cleanup-then-retry chain.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abstract_tree.rs`:
- Around line 37-39: The doc comment for the watermark around
AbstractTree::current_version contains contradictory wording: it claims the
watermark's seqno is "≤ every seqno reflected in the checkpointed SSTs" which
conflicts with the lower-bound semantics used elsewhere; update the comment on
the watermark (the captured seqno described just before
AbstractTree::current_version) to either state it is a lower bound (i.e., ≤ the
eventual maximum reflected seqno) or remove the misleading sentence and keep
only the explicit lower-bound guarantee — ensure references to
AbstractTree::current_version and "watermark"/"seqno" are adjusted so the
comment clearly says the value is captured before current_version and is a
lower-bound on the maximum seqno reflected in checkpointed SSTs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e645e32-fb47-48d2-9add-be8da3c7ebcc

📥 Commits

Reviewing files that changed from the base of the PR and between dd23588 and 8fde911.

📒 Files selected for processing (4)

src/abstract_tree.rs
src/checkpoint.rs
src/deletion_pause.rs
tests/checkpoint_pitr.rs

- `is_cross_device_detects_exdev_and_kind_variants` now gates the raw `from_raw_os_error(18)` assertion behind `#[cfg(unix)]`. errno 18 is EXDEV on Linux/macOS/BSDs but means ERROR_NO_MORE_FILES on Windows, and the `#[cfg(unix)]` branch in `is_cross_device` is not compiled for non-Unix targets, so the assertion was guaranteed to fail on Windows test runners. The ErrorKind variant assertions stay cross-platform. - `CheckpointInfo::seqno` docstring previously said the watermark is "≤ every seqno reflected" which conflicts with the lower-bound semantics described in the next paragraph. Rewritten as "≤ the *maximum* seqno reflected" so PITR consumers can't misread it as a per-record ceiling.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

…noise - `crate::checkpoint` is now `pub(crate)` instead of `pub` + `#[doc(hidden)]`. The cross-Fs `link_or_copy_cross_fs` integration test is moved into inline unit tests inside `src/checkpoint.rs`, so the helpers no longer leak into the stable public API surface. - `MemFs::create_dir` now delegates to `ensure_parent_dir`, so a parent that's actually a regular file produces a "parent is not a directory" error (matching POSIX `ENOTDIR`) instead of an ambiguous `NotFound`. - `DeletionPause::new()` returns `Self` (plain ownership); a new `new_shared() -> Arc<Self>` convenience covers the install-into-tree case. Brings the constructor in line with the rest of the crate's ownership-explicit style. - `StdFs::hard_link` now logs the cross-device fallback at `debug!` rather than `warn!`. A tier-misconfigured checkpoint can hit this path once per SST and once per blob — per-file warnings would drown real signal. Per-checkpoint summary remains the responsibility of the driver. - `is_cross_device` replaces the bare literal `18` with a named `EXDEV` constant + explanatory comment (matches the existing `flock` pattern that also declares POSIX values inline rather than pulling in `libc`).

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/checkpoint.rs`:
- Around line 513-518: The code currently fsyncs tables, blobs (when
include_blobs), and target_root via fsync_directory but never fsyncs the parent
directory of target_root, so the checkpoint directory entry can be lost; after
the existing fsync_directory(target_root, &**target_fs)? call, add a final
fsync_directory on target_root.parent() (handle the Option safely—return an
error if None or propagate) using the same target_fs to ensure the directory
entry for the checkpoint is durable; reference the existing fsync_directory
function and the target_root/TABLES_FOLDER/BLOBS_FOLDER logic so the new
parent-directory fsync occurs as the last step before returning success.

In `@src/fs/std_fs.rs`:
- Around line 258-281: In copy_fallback, ensure we remove any partially-written
destination if a post-create error occurs: after
OpenOptions::new().write(true).create_new(true).open(dst) succeeds, if any
subsequent operation (read/write error, io::ErrorKind::Interrupted aside, or
dst_file.sync_all()) returns Err, best-effort remove the dst file (e.g.,
std::fs::remove_file(dst)) before returning the error; alternatively implement
writing to a temporary file (e.g., dst with a unique suffix) and only rename
(std::fs::rename) to dst after sync_all() succeeds, and ensure the temporary
file is removed on any failure — update error paths around src_file.read,
dst_file.write_all, and dst_file.sync_all to perform the cleanup.

In `@tests/checkpoint_pitr.rs`:
- Around line 340-364: The compactor thread may not have begun calling
major_compact before create_checkpoint runs; add a one-shot handshake so the
test waits until the worker has entered its loop (and attempted at least one
major_compact) before calling create_checkpoint. Modify the thread spawn block
that calls major_compact (the closure moved into thread::spawn referencing
compactor_tree, compactor_stop, major_compact) to take a
std::sync::mpsc::Sender<()> (or a oneshot channel) and send a single "ready"
signal from inside the loop before the first sleep/iteration; in the main test
code, create the Receiver and block on recv() (or try_recv/timeouts as
appropriate) right after spawning the thread and before calling
tree.create_checkpoint(&dst_path) so the checkpoint is gated until the compactor
loop has actually started. Ensure CompactorGuard construction and error paths
still behave the same.
- Around line 85-97: The test races: the writer thread may not perform any
post-50 writes before create_checkpoint runs; add a one-shot synchronization so
the main thread waits for the writer to perform its first insert/flush before
calling tree.create_checkpoint. Concretely, create a std::sync::mpsc::channel
(or oneshot) and move the Sender into the writer thread; after the writer does
its first insert or flush (e.g. right after the first writer_tree.insert(...) or
after the first writer_tree.flush_active_memtable(0)?), call sender.send(()) to
signal readiness; in the test thread call receiver.recv().unwrap() (or try_recv
with timeout) before invoking tree.create_checkpoint(&dst_path)?; keep existing
writer, writer_tree, insert, flush_active_memtable, create_checkpoint and
writer.join usage unchanged otherwise.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: caf4b740-58fa-485b-bcc3-ba55f9421a8d

📥 Commits

Reviewing files that changed from the base of the PR and between 5c5db06 and efa793b.

📒 Files selected for processing (8)

src/checkpoint.rs
src/deletion_pause.rs
src/fs/mem_fs.rs
src/fs/std_fs.rs
src/lib.rs
src/tree/inner.rs
src/tree/mod.rs
tests/checkpoint_pitr.rs

…hakes - run_checkpoint now fsync's the directory CONTAINING target_root after the root sync. Without this the checkpoint directory entry can still vanish after a power loss even though `create_checkpoint()` returned Ok and the children are durable. - copy_fallback wraps its read/write/sync_all sequence in an inner closure and best-effort `remove_file(dst)` on any post-`create_new` error. Previously an ENOSPC/EIO mid-copy left a truncated file behind, and the next retry would see AlreadyExists and surface a corrupt file from an operation that already reported failure. - checkpoint_survives_concurrent_writes adds an mpsc handshake: the writer thread sends `()` after its first post-flush insert and the main thread blocks on `recv()` before calling create_checkpoint. A fast scheduler could otherwise finish the snapshot before the writer touched the tree, collapsing the test into a single-threaded smoke test. - compaction_during_checkpoint_preserves_source_ssts adds the same handshake for the compactor — the worker sends `()` from inside its loop (after the announced flag flips) so we know major_compact has actually started before the checkpoint races it.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/checkpoint.rs`:
- Around line 482-496: The checkpoint seqno can advance after
flush_active_memtable but before current_version, causing CheckpointInfo::seqno
to exceed the actual persisted checkpoint contents; fix by computing the
checkpoint cutoff from the persisted snapshot rather than the live
visible_seqno: capture a lower-bound seqno before mutating visible_seqno (or
read it prior to calling tree.flush_active_memtable), or obtain the seqno from
the version returned by tree.current_version() and use that for
CheckpointInfo::seqno so the recorded seqno never outruns the snapshot; update
the code that sets captured_seqno and the use of tree.current_version()
accordingly (references: visible_seqno, tree.flush_active_memtable,
tree.current_version, CheckpointInfo::seqno).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7cb5d1f-89b7-4339-a07e-4e4c10c725c1

📥 Commits

Reviewing files that changed from the base of the PR and between 5c5db06 and efa793b.

📒 Files selected for processing (8)

src/checkpoint.rs
src/deletion_pause.rs
src/fs/mem_fs.rs
src/fs/std_fs.rs
src/lib.rs
src/tree/inner.rs
src/tree/mod.rs
tests/checkpoint_pitr.rs

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

coderabbitai

♻️ Duplicate comments (1)

src/checkpoint.rs (1)

482-496: ⚠️ Potential issue | 🟠 Major

Capture the PITR cutoff before the checkpoint flush.

A writer that lands between Line 486 and Line 495 can advance visible_seqno, while current_version() at Line 496 still snapshots only persisted files. That lets CheckpointInfo::seqno outrun the actual checkpoint contents, which breaks the watermark contract for PITR consumers.

Suggested fix

-    tree.flush_active_memtable(SeqNo::MAX)?;
-
-    // Capture the seqno BEFORE the version snapshot. Reading
-    // `visible_seqno` after `current_version()` would let concurrent
-    // writers advance the counter past data that never made it into the
-    // checkpoint, so PITR consumers could move the recovery cutoff past
-    // records they still need from WAL / replication. The "before"
-    // ordering means `CheckpointInfo::seqno` is a lower bound on the
-    // seqno reflected in the version snapshot — safe for cutoffs.
-    let captured_seqno = visible_seqno.get();
+    // Capture the cutoff BEFORE the checkpoint-triggered flush. Any
+    // later write can still race into the new active memtable or even a
+    // later flush, but this value remains a safe lower bound for what
+    // the checkpoint is guaranteed to contain.
+    let captured_seqno = visible_seqno.get();
+    tree.flush_active_memtable(SeqNo::MAX)?;
     let version = tree.current_version();

As per coding guidelines, Flag concurrency bugs: data races on shared segment lists, lock ordering violations, missing synchronization on manifest updates (Tier 1 — MUST flag).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/checkpoint.rs` around lines 482 - 496, The code captures visible_seqno
after flushing the active memtable and taking the version snapshot which allows
a concurrent writer to advance visible_seqno and violate the PITR watermark;
move the read of visible_seqno.get() (the captured_seqno used for
CheckpointInfo::seqno) to before calling tree.flush_active_memtable(SeqNo::MAX)
and before tree.current_version() so the checkpoint's seqno is a true lower
bound on the snapshot; ensure CheckpointInfo::seqno is assigned from that
earlier-captured_seqno and not from a later read.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/checkpoint.rs`:
- Around line 482-496: The code captures visible_seqno after flushing the active
memtable and taking the version snapshot which allows a concurrent writer to
advance visible_seqno and violate the PITR watermark; move the read of
visible_seqno.get() (the captured_seqno used for CheckpointInfo::seqno) to
before calling tree.flush_active_memtable(SeqNo::MAX) and before
tree.current_version() so the checkpoint's seqno is a true lower bound on the
snapshot; ensure CheckpointInfo::seqno is assigned from that
earlier-captured_seqno and not from a later read.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ecb5c09-8bae-4027-a180-59053efda983

📥 Commits

Reviewing files that changed from the base of the PR and between efa793b and 5faae35.

📒 Files selected for processing (3)

src/checkpoint.rs
src/fs/std_fs.rs
tests/checkpoint_pitr.rs

- run_checkpoint now samples visible_seqno BEFORE flush_active_memtable instead of between flush and current_version(). Previous ordering left a window where a concurrent writer could land in the freshly-rotated active memtable, advance the counter past data that the SST snapshot doesn't contain, and push CheckpointInfo.seqno above the actual captured contents. The "before flush" ordering keeps the captured seqno a strict lower bound on what's in the snapshot regardless of what writers do during the flush+snapshot window. - AbstractTree::create_checkpoint, Fs::hard_link, and link_or_copy_cross_fs docstrings now say "logged via log::debug" instead of "log::warn", matching StdFs::hard_link which was downgraded earlier to avoid per-file log spam on tier-misconfigured checkpoints. Adds a note that operator visibility for unexpected copies is the checkpoint driver's job (CheckpointInfo.total_bytes ≫ inode link cost is the actionable signal).

coderabbitai · 2026-05-20T15:51:53Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
REF="feat/%23210-pitr-checkpoint"

echo "=== Commit dda582a1 summary ==="
gh api repos/structured-world/coordinode-lsm-tree/commits/dda582a1 --jq '.commit.message'
echo ""
echo "=== Files changed ==="
gh api repos/structured-world/coordinode-lsm-tree/commits/dda582a1/files --jq '.[] | {filename: .filename, additions: .additions, deletions: .deletions}'

echo ""
echo "=== tests/checkpoint_pitr.rs lines 110-165 ==="
gh api "repos/structured-world/coordinode-lsm-tree/contents/tests/checkpoint_pitr.rs?ref=$REF" \
  --jq '.content' | base64 -d | sed -n '110,165p'

Length of output: 4631

@polaz — confirmed against the live branch.

The assertion at the check site now reads u64::from(i) < info.seqno (strict <), and the surrounding comment block precisely explains why:

visible_seqno is "lowest excluded" — record seqno = N is committed iff visible_seqno ≥ N + 1, which is exactly seqno < visible_seqno.
The previous <= form was racy: insert(seqno = i) and fetch_max(i + 1) are two separate steps, so the checkpoint thread could observe visible_seqno = i between them, capture i as the watermark, and then demand record-i be present — even though the flush hadn't necessarily materialized it yet. Strict < removes that window entirely.

The "no claim about i >= info.seqno" note correctly bounds what the test asserts, and the reasoning matches the fetch_max(i + 1) semantic throughout. Fix is correct and the cross-target flakes on aarch64/riscv64/musl/i686/ppc64 should be gone. ✅

(¯·.¸¸.·´¯·🐇·´¯`·.¸¸.·´¯)

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

CheckpointInfo::seqno is captured from `visible_seqno.get()` BEFORE the version snapshot. visible_seqno is the standard 'lowest-excluded' watermark — `visible_seqno = N` means records with seqno < N are visible; seqno == N may or may not be visible yet. The doc previously claimed 'every record with seqno <= info.seqno is included', which contradicted the captured semantic and the (now-correct) PITR test assertion (`i < info.seqno` in tests/checkpoint_pitr.rs:152). A PITR consumer reading the doc literally could move a recovery cutoff past data still needed from WAL or replication — the field is a strict lower-exclusive watermark, not a max-included ceiling. Rewrite the field doc to: - explain it's a lowest-excluded watermark - spell out the inclusion gate as `seqno < info.seqno` - note seqno == info.seqno and seqno > info.seqno are undetermined - warn against using `<=` as the inclusion gate

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

…ross-Fs copy failure Three CodeRabbit findings, all in PR files: 1. src/vlog/blob_file/mod.rs:60, src/table/inner.rs (deletion_pause) New `std::sync::OnceLock<Arc<DeletionPause>>` fields regressed the crate's no-std direction (OnceLock has no alloc-only equivalent without the std feature). Switch to `once_cell::sync::OnceCell`, which is already pulled in for the zstd path and is one of the approved no-std-friendly primitives. The same swap applies to all construction sites (table/mod.rs, vlog/mod.rs, vlog/blob_file/multi_writer.rs, vlog/blob_file/reader.rs). Cargo.toml: once_cell is no longer feature-gated under zstd — it's an unconditional dep with default-features = false. 2. src/checkpoint.rs (link_or_copy_cross_fs streamed-copy) The cross-Fs streamed-copy path didn't clean up the partial `dst` on mid-copy failure (ENOSPC/EIO/EINTR-not-recoverable in read/write/flush/fsync). A subsequent retry would hit create_new's AlreadyExists check and surface a wholly different error, masking the real cause. Wrap the copy in an inner closure so any failure path goes through a best-effort `dst_fs.remove_file(dst)` before propagating the original error. PartialCheckpointGuard handles target-dir cleanup on the normal failure path, but link_or_copy_cross_fs is also reachable from cross-Fs tests and future direct callers — the local cleanup is the safer invariant. 3. src/table/inner.rs:134, src/vlog/blob_file/mod.rs (try_enqueue short-circuit) The Drop impls cloned `Arc<dyn Fs>` + PathBuf unconditionally even when the pause was installed but inactive — the common compaction-cleanup case. `try_enqueue` re-checks is_active() internally, but the clones were already paid. Add an outer `pause.is_active()` short-circuit so the common path skips both allocations. The under-lock recheck in try_enqueue stays — the outer check is a perf gate, not a correctness one. No regression test for #2 (cleanup-on-error) — requires a custom failing-write Fs wrapper that the existing test infrastructure doesn't provide. The cleanup logic is a straightforward best-effort unlink in an error path; correctness is verifiable by reading the new match arm.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Cargo.toml (1)
91-99: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep once_cell gated to the zstd feature to prevent no_std regression.

The Cargo.toml comment states once_cell is "Optional + gated by the zstd feature," but the actual dependency is unconditional with features = ["std"], which pulls a std-bound dependency into --no-default-features --features alloc builds. The new deletion_pause fields use only new(), get(), and set() — all available on std::sync::OnceLock in Rust 1.92 — so they don't require once_cell's get_or_try_init. Only the zstd code in src/compression/mod.rs needs get_or_try_init, which is already feature-gated. Align the dependency to match the comment: gate once_cell to the zstd feature and use OnceLock (or a cfg-gated alias) for the new deletion-pause fields to avoid breaking the no_std CI constraint.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Cargo.toml` around lines 91 - 99, The Cargo.toml currently pulls once_cell
unconditionally which breaks no_std builds; change the dependency to be optional
and feature-gated and switch the new deletion-pause code to use
std::sync::OnceLock (or a cfg-gated alias) so only zstd uses once_cell.
Concretely: make once_cell optional (once_cell = { version = "1",
default-features = false, features = ["std"], optional = true }) and add it to
the zstd feature (features.zstd should include "once_cell"), and update code
references so ZstdDictionary::prepared_handle and src/compression/mod.rs keep
using once_cell::OnceCell/get_or_try_init under #[cfg(feature = "zstd")] while
the new deletion_pause fields use std::sync::OnceLock (or a small cfg alias
type) under non-zstd/no_std builds.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@Cargo.toml`:
- Around line 91-99: The Cargo.toml currently pulls once_cell unconditionally
which breaks no_std builds; change the dependency to be optional and
feature-gated and switch the new deletion-pause code to use std::sync::OnceLock
(or a cfg-gated alias) so only zstd uses once_cell. Concretely: make once_cell
optional (once_cell = { version = "1", default-features = false, features =
["std"], optional = true }) and add it to the zstd feature (features.zstd should
include "once_cell"), and update code references so
ZstdDictionary::prepared_handle and src/compression/mod.rs keep using
once_cell::OnceCell/get_or_try_init under #[cfg(feature = "zstd")] while the new
deletion_pause fields use std::sync::OnceLock (or a small cfg alias type) under
non-zstd/no_std builds.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31bdeea2-8c29-4478-b66c-7b180af3f190

📥 Commits

Reviewing files that changed from the base of the PR and between 9808f74 and 4be0ddb.

📒 Files selected for processing (8)

Cargo.toml
src/checkpoint.rs
src/table/inner.rs
src/table/mod.rs
src/vlog/blob_file/mod.rs
src/vlog/blob_file/multi_writer.rs
src/vlog/blob_file/reader.rs
src/vlog/mod.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Cargo.toml (1)
91-99: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Gate once_cell behind your crate's std feature or use a no-std-capable primitive.

The dependency is declared unconditionally with features = ["std"] (no optional = true), making std a transitive dependency for all builds including --no-default-features --features alloc. This contradicts the manifest's feature contract and documented no-std migration strategy. The deletion_pause field appears unconditionally across multiple modules (vlog, table, tree, checkpoint), so either:

Make once_cell optional and gate it (e.g., once_cell = { version = "1", optional = true, default-features = false, features = ["std"] } + add to std feature), or

Switch to once_cell::race::OnceBox (already noted as your no-std migration target) if no std dependency is acceptable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Cargo.toml` around lines 91 - 99, The Cargo.toml currently unconditionally
pulls once_cell with std via the dependency line `once_cell = { version = "1",
default-features = false, features = ["std"] }`, which breaks the crate's no-std
feature contract; to fix, make the dependency optional and gate it behind your
crate-level `std` feature (e.g., mark once_cell optional and add it to the `std`
feature) or replace usage with a no-std primitive such as
`once_cell::race::OnceBox`; update any code paths that rely on `once_cell`
(notably `ZstdDictionary::prepared_handle` and modules referencing the
`deletion_pause` field in vlog/table/tree/checkpoint) to be conditional on the
`std` feature or to use the no-std alternative so that `--no-default-features
--features alloc` builds do not pull in `std`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@Cargo.toml`:
- Around line 91-99: The Cargo.toml currently unconditionally pulls once_cell
with std via the dependency line `once_cell = { version = "1", default-features
= false, features = ["std"] }`, which breaks the crate's no-std feature
contract; to fix, make the dependency optional and gate it behind your
crate-level `std` feature (e.g., mark once_cell optional and add it to the `std`
feature) or replace usage with a no-std primitive such as
`once_cell::race::OnceBox`; update any code paths that rely on `once_cell`
(notably `ZstdDictionary::prepared_handle` and modules referencing the
`deletion_pause` field in vlog/table/tree/checkpoint) to be conditional on the
`std` feature or to use the no-std alternative so that `--no-default-features
--features alloc` builds do not pull in `std`.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: baacb669-ca46-4d4f-83b0-03d205a93a18

📥 Commits

Reviewing files that changed from the base of the PR and between 9808f74 and 4be0ddb.

📒 Files selected for processing (8)

Cargo.toml
src/checkpoint.rs
src/table/inner.rs
src/table/mod.rs
src/vlog/blob_file/mod.rs
src/vlog/blob_file/multi_writer.rs
src/vlog/blob_file/reader.rs
src/vlog/mod.rs

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

…iendly one-shot slots CodeRabbit Cargo.toml 91-93 thread: the previous `once_cell` config (`features = ["std"]`, unconditional) contradicted the dep comment ("optional + gated by zstd") AND pulled a std-bound dep into `--no-default-features --features alloc` builds, breaking the no-std-check invariant the rules want us to defend. Resolution: switch every one-shot slot in this crate to `once_cell::race::OnceBox` — the no-std + alloc variant — and pull `once_cell` unconditionally with `default-features = false, features = ["race"]`. This keeps the canonical `get_or_try_init`-with-retry-on-failure semantics that `std::sync::OnceLock` lacks below MSRV 1.86, without breaking `cargo check --no-default-features`. Sites touched: - Cargo.toml: `once_cell` is now unconditional (race-only feature), removed from the `zstd` feature set, comment rewritten to match. - src/compression/mod.rs (`ZstdDictionary::prepared_handle`): the zstd dictionary cache uses `OnceBox<DictionaryHandle>`. The closure returns `Box::new(handle)` per OnceBox API; the surrounding retry-on-failure contract is unchanged (the rejection-path test in the bench harness still pins it). - src/table/inner.rs + src/vlog/blob_file/mod.rs: the `deletion_pause` field is `OnceBox<Arc<DeletionPause>>`. Construction sites updated in table/mod.rs, vlog/mod.rs, blob_file/multi_writer.rs, blob_file/reader.rs. `install_deletion_pause` now passes `Box::new(pause)` to `OnceBox::set`. - `cached_blob_bytes: OnceLock<u64>` stays untouched — pre-existing pre-PR slot, not in scope for this thread. structured-zstd is already a no_std-capable dep (we just opt-in to its `std` feature via the `zstd` feature), so this change removes the last std-only reason once_cell had to be conditional on zstd. Verified: cargo check (default + --all-features), cargo nextest run (1330/1330 passed), cargo clippy --lib --tests --all-features (clean).

Copilot src/vlog/blob_file/merge.rs:37: the previous PartialEq impl compared only `scan_entry.key` while Ord ordered by `(key, Reverse(seqno))`. That violated the Rust standard-library `Ord` contract (a == b ⇔ a.cmp(b) == Equal): two entries with the same key + different seqno would test equal but cmp() would order them, which is the kind of latent footgun any future HashSet / BTreeSet / dedup() pass over IteratorValue would silently rely on. BinaryHeap (the active consumer since the IntervalHeap → BinaryHeap swap in dda582a) uses Ord only, so no observable behaviour changes today — but the contract is now actually upheld, and the comment calls out why so the next reader doesn't regress it. No regression test: the bug is a contract violation with no current caller that exercises Eq on IteratorValue. A test would assert the contract itself (`a == b iff a.cmp(b) == Equal`), which is what the new impl directly encodes — no executable behaviour to drive.

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.

Two Copilot doc nits flagged after the OnceLock → OnceBox swap: 1. src/compression/mod.rs:84-91 — the doc block on `prepared` still described `OnceCell::get_or_try_init` semantics ("losers wait on the internal one-shot synchronisation primitive") even though the field is now `OnceBox`. Rewrite to describe the actual race mechanic: a single CAS on the slot pointer, winners publish their `Box<DictionaryHandle>`, losers drop their unused Box and read the winner's value next iteration. No mutex, no internal park. 2. Cargo.toml:91-93 — the comment said `std::sync::OnceLock` lacks the needed API "until 1.86, above our 1.92 MSRV", which is inverted: 1.86 is *below* 1.92, so OnceLock actually HAS `get_or_try_init` and `set` on our MSRV. The real reason for once_cell is no-std: OnceLock is std-only and cannot compile under `--no-default-features --features alloc`. Rewrite the comment so it states the actual reason and stops asserting an MSRV claim that contradicts itself.

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 19, 2026 21:26

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Copilot AI reviewed May 19, 2026

View reviewed changes

Copilot started reviewing on behalf of polaz May 19, 2026 21:35 View session

polaz added 2 commits May 20, 2026 00:46

sw-release-bot Bot reviewed May 19, 2026

View reviewed changes

polaz requested a review from Copilot May 19, 2026 22:26

Copilot started reviewing on behalf of polaz May 19, 2026 22:27 View session

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/abstract_tree.rs

Comment thread src/checkpoint.rs

Comment thread src/checkpoint.rs

Comment thread src/deletion_pause.rs

Comment thread tests/checkpoint_pitr.rs Outdated

Comment thread tests/checkpoint_pitr.rs

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/checkpoint.rs

Comment thread src/checkpoint.rs Outdated

Comment thread src/checkpoint.rs Outdated

Comment thread tests/checkpoint_pitr.rs Outdated

Comment thread tests/checkpoint_pitr.rs Outdated

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/abstract_tree.rs Outdated

Copilot AI review requested due to automatic review settings May 19, 2026 22:55

Copilot started reviewing on behalf of polaz May 19, 2026 23:05 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/lib.rs Outdated

Comment thread src/fs/mem_fs.rs Outdated

Comment thread src/deletion_pause.rs Outdated

Comment thread src/fs/std_fs.rs

Comment thread src/fs/std_fs.rs

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/checkpoint.rs

Comment thread src/fs/std_fs.rs Outdated

Comment thread tests/checkpoint_pitr.rs

Comment thread tests/checkpoint_pitr.rs

Copilot AI review requested due to automatic review settings May 20, 2026 00:40

Copilot started reviewing on behalf of polaz May 20, 2026 00:41 View session

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread src/checkpoint.rs

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread src/fs/mod.rs Outdated

Comment thread src/checkpoint.rs

Comment thread src/checkpoint.rs

Comment thread src/abstract_tree.rs

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

polaz requested a review from Copilot May 20, 2026 01:46

polaz requested a review from Copilot May 20, 2026 15:52

Copilot started reviewing on behalf of polaz May 20, 2026 15:53 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread src/abstract_tree.rs Outdated

polaz requested a review from Copilot May 20, 2026 16:06

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread src/vlog/blob_file/mod.rs Outdated

Comment thread src/checkpoint.rs Outdated

Comment thread src/table/inner.rs

Copilot started reviewing on behalf of polaz May 20, 2026 16:17 View session

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

polaz requested a review from Copilot May 20, 2026 17:24

Copilot started reviewing on behalf of polaz May 20, 2026 17:29 View session

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread Cargo.toml Outdated

Comment thread src/vlog/blob_file/merge.rs

polaz added 2 commits May 20, 2026 21:04

polaz requested a review from Copilot May 20, 2026 18:22

Copilot started reviewing on behalf of polaz May 20, 2026 18:23 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread src/compression/mod.rs Outdated

Comment thread Cargo.toml Outdated

polaz requested a review from Copilot May 20, 2026 18:46

Copilot started reviewing on behalf of polaz May 20, 2026 18:47 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

polaz merged commit 9dc90bf into main May 20, 2026
25 of 27 checks passed

polaz deleted the feat/#210-pitr-checkpoint branch May 20, 2026 18:56

sw-release-bot Bot mentioned this pull request May 19, 2026

chore: release v5.0.0 #272

Open

Conversation

polaz commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sw-release-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaz commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

sw-release-bot Bot left a comment •

edited

Loading