Skip to content

feat(checkpoint)!: hard-link snapshot for PITR backup (V5 storage)#276

Merged
polaz merged 42 commits into
mainfrom
feat/#210-pitr-checkpoint
May 20, 2026
Merged

feat(checkpoint)!: hard-link snapshot for PITR backup (V5 storage)#276
polaz merged 42 commits into
mainfrom
feat/#210-pitr-checkpoint

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 19, 2026

Summary

Implements #210 — the consistent hard-link checkpoint primitive that unblocks point-in-time recovery (PITR) backup for CoordiNode and any other consumer that needs an O(1)-per-file snapshot of an open LSM-tree.

  • New trait method Fs::hard_link(src, dst) with a default Unsupported impl (non-breaking for downstream backends). StdFs performs a real hard_link(2) with an EXDEV→copy fallback; MemFs produces an independent copy; IoUringFs delegates to StdFs since linking is a metadata-only syscall.
  • New trait method Fs::backend_id() -> Option<u64> — explicit shared-namespace capability check that gates cross-Fs hard-link attempts. StdFs/IoUringFs return a common KERNEL_BACKEND_ID; each MemFs::new() allocates a unique per-instance id; default impl returns None for safe-by-default third-party backends.
  • DeletionPause — a tree-wide refcount-gated deferred-delete queue installed into each Table::Inner / BlobFile::Inner. While at least one Pause guard is alive, both Drop impls queue removals instead of executing them; the queue drains when the last guard is dropped. Mirrors RocksDB's DisableFileDeletions/EnableFileDeletions pattern. Synchronisation uses spin::Mutex so the type is no_std-compatible.
  • AbstractTree::create_checkpoint(target) -> CheckpointInfo — implemented on Tree, BlobTree, and (via enum_dispatch) AnyTree. The driver captures visible_seqno first, flushes the active memtable (eviction seqno 0 — never expands MVCC GC beyond what would have happened anyway), snapshots the current version, hard-links every live SST + blob into target/{tables,blobs}/, copies manifest / v<id> / current, and fsyncs the target root.
  • level_routes (tiered storage) is honoured automatically — each Table::Inner already carries its own routed Fs, so the link is performed through the right backend. Cross-Fs situations (e.g. MemFs source vs. StdFs target) transparently fall back to a streamed byte copy after the backend_id namespace check declines the hard-link path.
  • Tree::open now rejects a directory that has version artifacts (tables/, blobs/, vN) but no current pointer — the on-disk signature of a half-written checkpoint. Previously the code silently called create_new, which would have overwritten the partial state with an empty tree.

Test plan

cargo nextest run --all-features1327/1327 passed (6 environment-skipped), including 14 integration scenarios in tests/checkpoint_pitr.rs:

  • Round-trip — checkpoint + reopen as standalone tree reads every key back
  • Concurrent writes during checkpoint do not corrupt the snapshot
  • Deferred-delete invariant: concurrent major_compact cannot remove an SST captured by an in-flight checkpoint
  • BlobTree checkpoint captures both index SSTs and blob files
  • level_routes SSTs land in the flattened target/tables/ directory
  • Source vs. checkpoint isolation — writes on either side do not bleed across
  • Crash-safety — failed checkpoint leaves the source tree fully reopenable (chmod-driven link_tables failure)
  • Empty tree checkpoint succeeds with sst_files == 0
  • CheckpointInfo.total_bytes matches the on-disk sum
  • Re-running a checkpoint into the same target is rejected with AlreadyExists
  • MVCC regression — checkpoint-triggered flush must NOT drop history needed by source-tree snapshot readers (eviction seqno = 0, not SeqNo::MAX)
  • Half-written-checkpoint detection — missing current pointer rejected by Tree::open
  • Half-written-checkpoint detection — corrupt current pointer rejected by Tree::open
  • Cross-Fs (StdFsMemFs) link-or-copy streams through both trait objects (inline in src/checkpoint.rs)

Plus unit coverage for:

  • DeletionPause — defer / inactive-passthrough / nested-pauses / generation-race-under-lock

  • StdFs::is_cross_device — raw EXDEV + ErrorKind variants

  • Fs::hard_link — round-trip + duplicate / missing-source rejection (both StdFs and MemFs)

  • copy_fallback — independence + create-new semantics

  • cargo clippy --lib --tests clean

  • cargo test --doc passes for all touched modules

Closes #210

Summary by CodeRabbit

  • New Features

    • Point-in-time checkpoint snapshots (optional blob inclusion) reporting file counts, total bytes, version id and visibility seqno; checkpoints can be opened as independent trees.
  • Reliability / Bug Fixes

    • Atomic, durable checkpoint creation with hard-link-first and streamed-copy fallback (debug-logged); best-effort cleanup on failure; opening rejects half-written or corrupt checkpoints with clearer diagnostics.
  • Utilities

    • Checkpoint-aware gate to defer physical file removals during snapshot lifetimes.
  • Filesystem compatibility

    • Improved create-dir/hard-link handling with cross-filesystem copy fallback and namespace signaling.
  • Tests / Documentation

    • Extensive end-to-end, isolation and crash-safety checkpoint suite; README updated to document PITR checkpoints.

Review Change Stack


BREAKING CHANGE: on-disk format V5. V3/V4 databases are NOT readable by this version and vice versa. V5 ships the BuRR filter wire-format change (replaces standard bloom, #269) plus the PITR hard-link checkpoint primitive (this PR, #210). The major bump goes through release-plz on squash.

- Add `Fs::hard_link(src, dst)` with default `Unsupported` impl; concrete
  impls on `StdFs` (with EXDEV→copy fallback), `MemFs` (copy-as-link
  semantics) and `IoUringFs` (delegates to `StdFs`).
- Introduce `DeletionPause` — refcount-gated deferred-delete queue that
  holds back `Table` / `BlobFile` removals while a checkpoint is in
  flight, mirroring RocksDB's `DisableFileDeletions`/`EnableFileDeletions`.
- Add `AbstractTree::create_checkpoint(target) -> CheckpointInfo`,
  implemented on `Tree`, `BlobTree`, and (via enum_dispatch) `AnyTree`.
  The driver flushes the active memtable, snapshots the current version,
  hard-links every live SST + blob into `target/{tables,blobs}/`, copies
  manifest / version / current-pointer, and fsyncs the target root.
- `level_routes` tiers are honoured by linking through each table's own
  `Fs` handle; cross-`Fs` situations fall back to a streamed byte copy.
- Tests: 11 integration scenarios in `tests/checkpoint_pitr.rs` (round-trip,
  concurrent writes, BlobTree, level_routes, isolation, crash-safety,
  deferred-delete vs concurrent major_compact, empty tree, byte-accuracy,
  double-target rejection, cross-Fs copy) plus unit coverage for
  `is_cross_device` and `DeletionPause`.

Closes #210
Copilot AI review requested due to automatic review settings May 19, 2026 21:26
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 090e446d-7305-4e10-9522-78a314e9e63b

📥 Commits

Reviewing files that changed from the base of the PR and between 9d839cc and d767602.

📒 Files selected for processing (2)
  • Cargo.toml
  • src/compression/mod.rs

📝 Walkthrough

Walkthrough

Creates PITR checkpoints: adds Fs mkdir/hard-link primitives, a refcounted DeletionPause to defer deletions, a checkpoint driver that hard-links or copies SST/blob files and metadata, wires pause into Tree/BlobTree lifecycles, and adds end-to-end PITR tests.

Changes

Checkpoint PITR Feature

Layer / File(s) Summary
Filesystem hard-link trait and implementations
src/fs/mod.rs, src/fs/std_fs.rs, src/fs/io_uring_fs.rs
Fs trait adds create_dir, hard_link, and backend_id. StdFs attempts std::fs::hard_link and falls back to buffered copy on cross-device errors with detection and copy-fallback logic; IoUringFs delegates to StdFs.
MemFs namespace & tests
src/fs/mem_fs.rs
Adds namespace_id, backend_id, create_dir, and hard_link emulation with tests validating independent-copy semantics, AlreadyExists, and NotFound.
Deletion pause mechanism
src/deletion_pause.rs
DeletionPause is a ref-counted gate that queues deletions while active; acquire() returns a Pause guard and the last-holder drop drains queued deletions with generation-race protection. Includes unit and concurrency tests.
Checkpoint core orchestration
src/checkpoint.rs
Implements checkpoint utilities: prepare_target creates atomic target layout; link_or_copy_cross_fs hard-links or stream-copies across Fs backends; link_tables and link_blob_files capture live files counting files/bytes; copy_metadata replicates manifest/v/CURRENT (rewriting CURRENT atomically); run_checkpoint flushes memtable, acquires deletion pause, links/copies files, fsyncs child dirs then root, and returns CheckpointInfo. RAII guard removes partial checkpoints on failure.
Tree wiring and pause installation
src/tree/inner.rs, src/tree/mod.rs, src/table/inner.rs, src/table/mod.rs
TreeInner owns a shared DeletionPause created at construction. Recovery and register_tables install the pause into recovered/newly registered tables/blobs. Tree::create_checkpoint delegates to checkpoint::run_checkpoint with include_blobs: false. Table drops defer physical file removal when the pause enqueues deletions.
BlobTree and blob-file deletion pause
src/blob_tree/mod.rs, src/vlog/blob_file/mod.rs, src/vlog/blob_file/multi_writer.rs, src/vlog/blob_file/reader.rs, src/vlog/mod.rs
BlobTree::create_checkpoint delegates to run_checkpoint with include_blobs: true. BlobFileInner stores the Fs handle and a OnceBox<DeletionPause>; BlobFile::install_deletion_pause installs the pause idempotently. Blob writer construction and recovery initialize these fields; blob Drop defers or performs removal in a checkpoint-aware manner.
Public API: CheckpointInfo and trait method
src/abstract_tree.rs
Adds CheckpointInfo (sst_files, blob_files, total_bytes, version_id, seqno) and extends AbstractTree with create_checkpoint(&Path) -> Result<CheckpointInfo> documenting hard-link semantics, cross-FS fallback logging, concurrency/deferred deletion behavior, and failure/cleanup characteristics.
Crate-level modules & deps
src/lib.rs, Cargo.toml
Declares crate-visible deletion_pause and checkpoint modules, re-exports CheckpointInfo at the crate root, and updates Cargo features/dependencies (spin, once_cell) to support no-std/alloc changes.
Blob merge heap change (alloc/no-std)
src/vlog/blob_file/merge.rs
Replaces interval_heap::IntervalHeap with alloc::collections::BinaryHeap<Reverse<...>>, updates ordering and heap push/pop sites to use Reverse, and switches imports to alloc/core equivalents.
Compression OnceBox migration
src/compression/mod.rs
Replaces zstd OnceCell cache with once_cell::race::OnceBox and updates initialization and lazy-init code accordingly.
End-to-end integration tests
tests/checkpoint_pitr.rs
Integration tests cover round-trip checkpoint/reopen, concurrent writer/compaction safety, BlobTree blob preservation, tiered-storage flattening, MVCC/regression checks, divergence/isolation, compaction-vs-checkpoint races, empty-tree/bytes accounting, duplicate-target rejection, cross-FS fallback, and crash/failure safety with retries.
MemFs regression tests
tests/mem_fs_tree.rs
Adds tests for relative single-component and ./-prefixed relative checkpoint target handling on MemFs.
Misc: README and docs
README.md, .github/instructions/rust.instructions.md
Documents Tree::create_checkpoint in README and adds a no-std+alloc guidance section to review docs.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Tree
  participant Checkpoint
  participant Fs
  participant DeletionPause
  Client->>Tree: create_checkpoint(target_path)
  Tree->>DeletionPause: acquire() (Pause guard)
  Tree->>Checkpoint: run_checkpoint(params)
  Checkpoint->>Tree: flush_active_memtable()
  Checkpoint->>Tree: capture current_version + visible_seqno
  Checkpoint->>Fs: create_dir(target), create tables/, optionally blobs/
  Checkpoint->>Fs: hard_link or copy SST/blob files
  Checkpoint->>Fs: copy manifest, write v<id>, write CURRENT
  Checkpoint->>Fs: fsync child dirs then root (and parent if present)
  Checkpoint->>DeletionPause: release() (drop Pause)
  Checkpoint->>Tree: return CheckpointInfo
  Tree->>Client: CheckpointInfo
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(checkpoint)!: hard-link snapshot for PITR backup (V5 storage)' clearly and specifically describes the main change: implementing checkpoint functionality for point-in-time recovery backups using hard-link snapshots, with a breaking change indicator.
Linked Issues check ✅ Passed All coding requirements from issue #210 are met: Fs::hard_link with EXDEV fallback, DeletionPause mechanism for deferred deletions, create_checkpoint API with CheckpointInfo, level_routes support, cross-filesystem handling, and comprehensive test coverage including concurrent compaction, crash-safety, and reopenability.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to checkpoint implementation: Fs trait extensions (hard_link, create_dir, backend_id), DeletionPause mechanism, checkpoint module, Tree/BlobTree implementations, and supporting infrastructure like deletion_pause integration into Table/BlobFile. No unrelated refactoring or feature additions detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#210-pitr-checkpoint

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abstract_tree.rs`:
- Around line 89-95: Update the checkpoint docstring to mention that while
checkpoints attempt O(1) hard-links (so they consume no extra space) for all
SST/blob files, the implementation falls back to streamed/copy-on-write behavior
when hard-linking fails across filesystems/backends (e.g., EXDEV), which incurs
linear IO and additional storage for those files; keep references to
Config::open and BlobTree and state that callers using level routing or separate
backup volumes may observe this fallback and its performance/storage
implications.

In `@src/checkpoint.rs`:
- Around line 220-229: The code currently copies the live CURRENT file via
copy_metadata_file(CURRENT_VERSION_FILE) which can race with a concurrent
publish; modify run_checkpoint so that instead of cloning the source CURRENT you
atomically write a new CURRENT in the checkpoint containing the captured version
id (use the captured version.id() / version_id) as the content and perform this
write as the final metadata step; replace the
copy_metadata_file(CURRENT_VERSION_FILE) call with an atomic write of the
captured version string (use the same metadata-write helper/atomic-temp-rename
pattern used elsewhere) so CURRENT always points to the snapshot's version and
not to a concurrently-published one.
- Around line 268-292: After prepare_target succeeds you must install a
best-effort cleanup guard that removes the partially-built checkpoint on any
early return/error: create a scoped guard (or use a Drop-based RAII guard)
immediately after prepare_target(target_root, include_blobs, &**target_fs)? that
captures target_root and target_fs and on drop (if not disarmed) deletes the
target (e.g., remove_dir_all or equivalent), then disarm the guard only after
the final fsync_directory completes successfully. Ensure the guard covers
failures from deletion_pause.acquire(), tree.flush_active_memtable(SeqNo::MAX)?,
link_tables(...), link_blob_files(...), copy_metadata(...), and
fsync_directory(...), and reference/disarm it right after the successful
completion of fsync_directory so retries won’t hit AlreadyExists.
- Around line 280-282: CheckpointInfo::seqno can race ahead of the captured
snapshot because visible_seqno.get() is read after tree.current_version();
change the code to capture the seqno at the same logical snapshot as
current_version() by obtaining visible_seqno as-of that version (or by using the
tree snapshot API that returns version+seqno together) instead of calling
visible_seqno.get() separately; update the code around the calls to
tree.current_version() and visible_seqno.get() so you either call a
visible_seqno.as_of(version) (or equivalent API) or take the snapshot first
(e.g., snapshot = tree.capture_snapshot()) and read snapshot.seqno, ensuring
CheckpointInfo::seqno reflects the same snapshot point as
tree.current_version().
- Around line 53-65: prepare_target currently races by calling
target_fs.exists(target) then create_dir_all, allowing another process to create
the same checkpoint; change it to claim the checkpoint root atomically (either
by using an atomic create-new API on the filesystem or by creating a create-new
sentinel file/dir inside target and failing if it already exists) before
creating TABLES_FOLDER and optional blobs; specifically, update prepare_target
to attempt an atomic create for target (or create_new for a sentinel like
target.join(".lock" or ".claim")), return AlreadyExists if that atomic create
fails, and only after successful claim call
target_fs.create_dir_all(&target.join(TABLES_FOLDER)) and, when include_blobs is
true, create the blobs folder — ensure you reference prepare_target and
TABLES_FOLDER and preserve error mapping to std::io::ErrorKind::AlreadyExists.
- Around line 283-292: The code currently calls link_tables and link_blob_files
to populate target/tables and target/blobs, then writes metadata and calls
fsync_directory(target_root, target_fs) only on the root; to fix, after
link_tables and (if include_blobs) link_blob_files succeed and before the final
fsync_directory(target_root, target_fs) you must call fsync_directory on each
populated child directory (e.g., the tables directory and, if created, the blobs
directory) using the same target_fs handle so those directory entry changes are
durable; keep the final fsync_directory(target_root, target_fs) after these
child-directory syncs and ensure you perform these child fsyncs regardless of
copy_metadata succeeding to preserve correct fsync ordering for
link_tables/link_blob_files, referencing link_tables, link_blob_files,
fsync_directory, target_root and target_fs in your changes.

In `@src/deletion_pause.rs`:
- Around line 150-151: Replace the test-level lint suppression attribute so it
uses expect instead of allow: change the attribute on the test module/section
that currently reads #[allow(clippy::unwrap_used, reason = "...")] (the
attribute immediately following #[cfg(test)]) to #[expect(clippy::unwrap_used,
reason = "...")] so the suppression will emit a warning if it becomes
unnecessary; keep the same reason string and retain #[cfg(test)] unchanged.
- Around line 121-144: The drain is racing with a concurrently-created new pause
because after fetch_sub() transitions active 1->0 a new acquire() can create a
fresh pause and enqueue items that this branch will wrongly take; fix by making
draining keyed or serialized with acquire()/try_enqueue(): add and bump a
monotonic generation counter (e.g., inner.generation) when acquire() creates a
new pause and have drop() capture the generation it observed, then while holding
inner.queue.lock() verify the queue's generation matches that captured
generation before taking it (or only drain items from the matching generation);
alternatively, ensure acquire()/try_enqueue() and the drain both synchronize on
the same mutex so the drain only swaps the queue when no new pause has been
created. Ensure you update acquire(), try_enqueue(), and Drop for DeletionPause
to consult the generation or shared lock and reference the existing fields
active, inner.queue, acquire(), try_enqueue(), and the prev == 1 branch when
applying the change.

In `@src/fs/mem_fs.rs`:
- Around line 1285-1312: The test hard_link_creates_independent_copy should
verify independent backing buffers by mutating one path before removal and
asserting the other is unchanged: after creating src and dst via
fs.hard_link(src, dst), open one (e.g., src) with
FsOpenOptions::new().write(true), write different bytes via write_all and
drop/close it, then remove_file(src) and assert exists(src) is false and
exists(dst) is true, and finally open dst and read_to_string to confirm its
contents remain the original value (not the mutated bytes); update assertions
accordingly to reference hard_link, open, write_all, remove_file,
read_to_string, and exists.

In `@tests/checkpoint_pitr.rs`:
- Around line 179-181: The test is silently swallowing I/O errors by using
filter_map(Result::ok) on std::fs::read_dir(&dst_tables) and by calling
e.metadata().ok(); instead collect the directory entries as a Result<Vec<_>>
(e.g. collect::<std::io::Result<Vec<_>>>()), propagate any read_dir error with ?
and replace e.metadata().ok() with e.metadata()? so filesystem errors bubble up
instead of being dropped; apply the same changes to the other occurrences around
the entries/count logic referenced (the blocks around lines 260-263 and 338-341)
so any unreadable checkpoint entries cause the test to fail with the original
I/O error.
- Around line 154-185: The test never verifies that any SST was actually flushed
into the routed hot tier, so add a pre-check that at least one flushed SST
exists under the hot tier path (hot_dir) before calling
tree.create_checkpoint(&dst_path); e.g., inspect hot_dir.path()/tables (or the
same tables layout used later) with std::fs::read_dir, count entries, and assert
entries >= 1 with a clear message so the checkpoint validation proves the routed
path (level_routes / LevelRoute) was honored.
- Around line 415-454: Change the test to exercise a failure that occurs after
prepare_target rather than before it: do not pre-create the full destination (so
prepare_target runs), then induce a failure after prepare_target but before
checkpoint completion (for example by making the destination unwritable or
creating a conflicting file inside the freshly prepared target to break the
subsequent link/copy steps) when calling tree.create_checkpoint; assert that
create_checkpoint returns an error, that the partially-created destination
directory is removed/cleaned up, and that the source tree (accessed via tree.get
and reopen via open_tree) still contains all data. Ensure the new assertions
reference create_checkpoint and the prepared target behavior so the test
verifies cleanup-on-failure semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c21dbfd7-893d-4247-b5b0-97e6bb93d0ad

📥 Commits

Reviewing files that changed from the base of the PR and between 413a296 and 580e44a.

📒 Files selected for processing (18)
  • src/abstract_tree.rs
  • src/blob_tree/mod.rs
  • src/checkpoint.rs
  • src/deletion_pause.rs
  • src/fs/io_uring_fs.rs
  • src/fs/mem_fs.rs
  • src/fs/mod.rs
  • src/fs/std_fs.rs
  • src/lib.rs
  • src/table/inner.rs
  • src/table/mod.rs
  • src/tree/inner.rs
  • src/tree/mod.rs
  • src/vlog/blob_file/mod.rs
  • src/vlog/blob_file/multi_writer.rs
  • src/vlog/blob_file/reader.rs
  • src/vlog/mod.rs
  • tests/checkpoint_pitr.rs

Comment thread src/abstract_tree.rs
Comment thread src/checkpoint.rs Outdated
Comment thread src/checkpoint.rs Outdated
Comment thread src/checkpoint.rs
Comment thread src/checkpoint.rs
Comment thread src/deletion_pause.rs Outdated
Comment thread src/fs/mem_fs.rs
Comment thread tests/checkpoint_pitr.rs
Comment thread tests/checkpoint_pitr.rs Outdated
Comment thread tests/checkpoint_pitr.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a hard-link based checkpoint primitive to enable O(1)-per-file point-in-time recovery (PITR) snapshots of an open LSM tree, including correctness around concurrent compaction via deferred deletions.

Changes:

  • Introduces AbstractTree::create_checkpoint(...) -> CheckpointInfo and shared checkpoint driver (src/checkpoint.rs).
  • Adds Fs::hard_link(src, dst) with backend implementations (StdFs EXDEV→copy fallback, MemFs copy semantics, IoUringFs delegation).
  • Implements DeletionPause and wires it into Table/BlobFile drops to defer deletions during checkpoints; adds end-to-end PITR tests.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/checkpoint_pitr.rs New end-to-end checkpoint/PITR integration tests including concurrent write/compaction scenarios.
src/vlog/mod.rs Ensures recovered BlobFile::Inner carries fs + deletion pause state.
src/vlog/blob_file/reader.rs Updates blob-file reader test setup to include new Inner fields.
src/vlog/blob_file/multi_writer.rs Ensures newly created blob files carry fs + deletion pause state.
src/vlog/blob_file/mod.rs Adds DeletionPause plumbing + Fs to blob-file inner; defers deletes in Drop.
src/tree/mod.rs Implements create_checkpoint for Tree; installs deletion pause on recovered/new tables/blobs.
src/tree/inner.rs Adds tree-wide deletion_pause to TreeInner.
src/table/mod.rs Adds install_deletion_pause + OnceLock storage for the pause in tables.
src/table/inner.rs Defers table file deletion in Drop when checkpoint pause is active.
src/lib.rs Exposes CheckpointInfo; adds internal checkpoint + deletion_pause modules.
src/fs/std_fs.rs Adds hard_link implementation with EXDEV detection + copy fallback; tests.
src/fs/mod.rs Adds Fs::hard_link default (Unsupported) + documentation.
src/fs/mem_fs.rs Implements hard_link as an independent byte copy for MemFs; tests.
src/fs/io_uring_fs.rs Delegates hard_link to StdFs.
src/deletion_pause.rs New refcounted deletion gate with queue draining on last guard drop; unit tests.
src/checkpoint.rs New checkpoint implementation: prepare target, pause deletions, flush, link/copy files, copy metadata, fsync.
src/blob_tree/mod.rs Implements create_checkpoint for BlobTree (includes blob files).
src/abstract_tree.rs Adds CheckpointInfo type + create_checkpoint API docs on AbstractTree.

Comment thread tests/checkpoint_pitr.rs Outdated
Comment thread tests/checkpoint_pitr.rs Outdated
Comment thread tests/checkpoint_pitr.rs Outdated
Comment thread src/checkpoint.rs Outdated
Comment thread src/fs/std_fs.rs Outdated
Comment thread src/vlog/blob_file/mod.rs
Comment thread src/deletion_pause.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

polaz added 2 commits May 20, 2026 00:46
Switches `Arc`, `AtomicU32`, `fmt::Debug` and `vec!` imports in the new
checkpoint + deletion-pause modules to their `core::*` / `alloc::*`
forms. `Mutex`, `PathBuf` and `std::io::{Read, Write}` stay on `std::*`
because they have no `core` / `alloc` equivalent — these dependencies
are inherited from the `Fs` trait that already lives in `std`.

No behaviour change; the no-std-check error count stays flat (1206) but
the new code stops contributing trivially-portable usages to the
migration debt.
Concurrency / correctness fixes:

- prepare_target uses atomic Fs::create_dir (new trait method) instead
  of exists()+create_dir_all, closing the TOCTOU window where two
  concurrent creators could race and both proceed thinking they claimed
  the directory.
- run_checkpoint now writes a fresh `current` pointer for the captured
  version_id rather than copying the live source file. A concurrent
  flush/compaction can publish v<N+1> after current_version() snapshot,
  and the old code would copy that newer CURRENT into the checkpoint —
  leaving the snapshot pointing at a version it never linked.
- Reorders visible_seqno.get() BEFORE current_version() so the recorded
  CheckpointInfo.seqno is a lower bound on the snapshot's contents.
  Concurrent writers can no longer advance the cutoff past data that
  did not make it into the checkpoint.
- fsync_directory is now called on tables/ and blobs/ (when populated)
  before the root, so directory entries for the hard-linked files
  survive a power loss instead of relying on the root fsync alone.
- PartialCheckpointGuard removes the target directory on any early
  return from run_checkpoint, keeping retries against the same path
  idempotent.
- link_or_copy_cross_fs retries on EINTR in the read loop, matching
  StdFs::copy_fallback and avoiding flaky failures under signal
  interruption.

DeletionPause:

- Drop for Pause re-checks `active` under the queue lock before
  draining, so a new pause that acquires between fetch_sub→0 and the
  lock cannot have its freshly-queued items stolen by the previous
  drop. Includes a regression test for the generation race.
- Mutex poisoning in Drop and try_enqueue now recovers via
  PoisonError::into_inner instead of leaking deferred deletions
  permanently — the whole point of the type is to keep the filesystem
  tidy, so leaving a queue stranded on poison defeats the purpose.
- Switches test-level #[allow(unwrap_used)] to #[expect(...)] so the
  suppression warns if it ever becomes unnecessary.

Docs / API:

- Adds Fs::create_dir (atomic single-leaf mkdir) with Unsupported
  default + concrete impls on StdFs, MemFs, IoUringFs.
- abstract_tree.rs docstring spells out the cross-filesystem copy
  fall-back so callers using level_routes or detached backup volumes
  know to expect linear-time / extra-space behaviour in those cases.
- std_fs::is_cross_device docstring now explains explicitly why we do
  NOT classify PermissionDenied as a cross-device signal.
- Adds manual Debug impl on vlog::blob_file::Inner now that the
  Arc<dyn Fs> + OnceLock fields prevent #[derive(Debug)].

Tests:

- MemFs hard_link test now mutates one path and asserts the other is
  unchanged, proving independent backing buffers (the old assertion
  on remove_file would have passed even with Arc<Mutex<Vec<u8>>>
  aliasing).
- checkpoint_pitr.rs replaces filter_map(Result::ok) with collected
  Result<Vec<_>> in three places so I/O failures during verification
  surface instead of being silently dropped.
- checkpoint_flattens_level_routes asserts the routed tier received an
  SST BEFORE checkpointing, so a regression that ignored level_routes
  would no longer pass silently.
- compaction_during_checkpoint_preserves_source_ssts wraps the
  compactor thread in a RAII guard so an early-returning ? cleans up
  the worker instead of detaching it.
- Removes a tautological assert_eq!(info.sst_files, info.sst_files)
  and replaces it with a meaningful sst_files >= 1 invariant.
- cross_fs_link_or_copy_streams_through_trait now mutates the
  destination (not a different file) and verifies the StdFs source is
  unchanged AND the MemFs destination reflects the write.
- checkpoint_failure_leaves_source_intact splits into early-reject and
  retry-after-cleanup paths.
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'lsm-tree db_bench'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: 9808f74 Previous: 413a296 Ratio
mergerandom 382152.66922789026 ops/sec (normalized) 470343.5799479985 ops/sec (normalized) 1.23

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abstract_tree.rs`:
- Around line 33-36: Update the documentation for CheckpointInfo::seqno to state
it is a lower-bound watermark rather than the definitive maximum visible seqno:
change the doc comment on the CheckpointInfo struct (and the other doc block
around the second occurrence) to say something like "A lower-bound watermark:
this seqno is guaranteed to be included in the checkpoint; later seqnos may also
be present due to sampling before flush/snapshot, so callers must treat it as an
inclusive minimum rather than the absolute maximum." Ensure you update the
comment attached to the field named seqno (CheckpointInfo::seqno) and any
repeated explanatory doc blocks in the same file to use this wording.

In `@src/checkpoint.rs`:
- Around line 364-370: The partial-checkpoint cleanup guard is created only
after prepare_target(...) returns which can leak target_root if prepare_target
creates the root but then fails creating child dirs; move the
PartialCheckpointGuard::new(target_root, target_fs) to immediately after the
root directory is claimed (i.e. as soon as prepare_target acquires/creates
target_root) or modify prepare_target(...) to install its own RAII guard for the
subdirectory-creation phase so any early return cleans up the partial root;
ensure the guard is still disarmed by commit() after fsync_directory succeeds
and keep references to target_root and target_fs consistent.
- Around line 201-208: Replace the exists-then-open TOCTOU pattern by opening
the source file directly: instead of calling src_fs.exists(&src) then
src_fs.open(&src, ...), call src_fs.open(&src, &FsOpenOptions::new().read(true))
first and match its Result; if it Errs with a NotFound/NotFound-equivalent from
the filesystem API, return Ok(()) (treat missing source as ignorable), otherwise
propagate the error and proceed to open/create the destination (dst) and
continue; remove the prior src_fs.exists call and ensure error handling checks
the filesystem's NotFound variant when handling the open of src (use the same
src, src_root, file_name, src_fs, FsOpenOptions, and src_file identifiers to
locate the change).

In `@tests/checkpoint_pitr.rs`:
- Around line 292-299: The compactor thread errors are currently being
swallowed: calls to major_compact(...) and the thread join in
CompactorGuard::drop discard results; change those points (references:
major_compact, CompactorGuard, Drop, handle, join, stop) to propagate or surface
failures instead of ignoring them — e.g., have major_compact return Result and
unwrap/expect or send the Err back to the test harness, and replace "let _ =
h.join()" with h.join().expect("compactor thread panicked") (or propagate the
Box<Any> to fail the test); apply the same treatment to the other join site
noted around the 306-309 region so any compaction error or panic fails the test
rather than being hidden.
- Around line 96-109: Capture the CheckpointInfo returned by
tree.create_checkpoint (don't discard it into _info), then use info.seqno to
bound the assertions: iterate keys whose seqno is <= info.seqno and assert each
is present in restored (use restored.get(..., lsm_tree::SeqNo::MAX) to read the
snapshot), while making no assertions about keys with seqno > info.seqno; update
the variable name from _info to info and replace the fixed 0..50 loop with a
loop that uses info.seqno as the upper bound (casting types as needed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f9ce0ddf-0862-4fac-b9b9-33d3421f9c86

📥 Commits

Reviewing files that changed from the base of the PR and between 580e44a and dd23588.

📒 Files selected for processing (9)
  • src/abstract_tree.rs
  • src/checkpoint.rs
  • src/deletion_pause.rs
  • src/fs/io_uring_fs.rs
  • src/fs/mem_fs.rs
  • src/fs/mod.rs
  • src/fs/std_fs.rs
  • src/vlog/blob_file/mod.rs
  • tests/checkpoint_pitr.rs

Comment thread src/abstract_tree.rs
Comment thread src/checkpoint.rs
Comment thread src/checkpoint.rs
Comment thread src/deletion_pause.rs
Comment thread tests/checkpoint_pitr.rs Outdated
Comment thread tests/checkpoint_pitr.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Comment thread src/checkpoint.rs
Comment thread src/checkpoint.rs Outdated
Comment thread src/checkpoint.rs Outdated
Comment thread tests/checkpoint_pitr.rs Outdated
Comment thread tests/checkpoint_pitr.rs Outdated
Correctness / safety:

- prepare_target now arms a local RAII cleanup guard immediately after
  the atomic create_dir(target) succeeds. If creating tables/ or blobs/
  subsequently fails, the freshly-claimed root is removed so retries
  against the same path don't hit AlreadyExists.
- link_or_copy_cross_fs replaces the Arc::ptr_eq fast-path with a
  "try hard_link first, fall back on NotFound/Unsupported" strategy.
  Two Arc<dyn Fs> handles backed by the same kernel filesystem (e.g.
  an Arc::new(StdFs) created independently in level_routes versus the
  primary config.fs) used to be pessimised into an O(bytes) byte copy;
  they now hard-link in O(1) as intended.
- copy_metadata_file opens the source directly instead of exists() +
  open() (closes the TOCTOU window where the file disappears between
  the two calls). It also distinguishes Required vs Optional metadata:
  v<id> is now Required so a concurrent version-GC that removes the
  captured v<id> fails the checkpoint loudly instead of producing an
  unopenable snapshot.

Docs:

- CheckpointInfo.seqno docstring rewritten as a lower-bound watermark
  ("guaranteed-included, may also contain later seqnos"). The previous
  "maximum visible seqno" wording could mislead PITR consumers into
  treating it as a replay ceiling.

Tests:

- DeletionPause race test now uses two mpsc channels to pin the
  invariant check at the exact moment Thread B holds an active pause
  and has enqueued. Without the fix, A's drain would have already
  swept B's file before the check runs; the buggy implementation now
  fails the assertion mid-flight instead of producing the same final
  state as the fixed one.
- compaction_during_checkpoint_preserves_source_ssts: CompactorGuard
  now propagates panics via resume_unwind and surfaces Err returns as
  test failures (with thread::panicking() guards to avoid double-
  panic). The compactor loop itself uses ? on major_compact instead
  of swallowing errors with `let _ = ...`.
- checkpoint_survives_concurrent_writes captures CheckpointInfo and
  asserts the lower-bound watermark contract: every key with
  seqno <= info.seqno MUST be present in the restored checkpoint
  (keys above the watermark make no claim either way).
- checkpoint_failure_leaves_source_intact (Unix path) now genuinely
  exercises the post-prepare failure scenario by chmod-ing the source
  tables/ directory to 0o000 between flush and create_checkpoint.
  link(2) requires search (x) permission on every path component, so
  prepare_target succeeds and link_tables then fails with EACCES,
  exercising PartialCheckpointGuard's cleanup. The non-Unix branch
  falls back to the previous success-then-cleanup-then-retry chain.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abstract_tree.rs`:
- Around line 37-39: The doc comment for the watermark around
AbstractTree::current_version contains contradictory wording: it claims the
watermark's seqno is "≤ every seqno reflected in the checkpointed SSTs" which
conflicts with the lower-bound semantics used elsewhere; update the comment on
the watermark (the captured seqno described just before
AbstractTree::current_version) to either state it is a lower bound (i.e., ≤ the
eventual maximum reflected seqno) or remove the misleading sentence and keep
only the explicit lower-bound guarantee — ensure references to
AbstractTree::current_version and "watermark"/"seqno" are adjusted so the
comment clearly says the value is captured before current_version and is a
lower-bound on the maximum seqno reflected in checkpointed SSTs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e645e32-fb47-48d2-9add-be8da3c7ebcc

📥 Commits

Reviewing files that changed from the base of the PR and between dd23588 and 8fde911.

📒 Files selected for processing (4)
  • src/abstract_tree.rs
  • src/checkpoint.rs
  • src/deletion_pause.rs
  • tests/checkpoint_pitr.rs

Comment thread src/abstract_tree.rs Outdated
- `is_cross_device_detects_exdev_and_kind_variants` now gates the raw
  `from_raw_os_error(18)` assertion behind `#[cfg(unix)]`. errno 18 is
  EXDEV on Linux/macOS/BSDs but means ERROR_NO_MORE_FILES on Windows,
  and the `#[cfg(unix)]` branch in `is_cross_device` is not compiled
  for non-Unix targets, so the assertion was guaranteed to fail on
  Windows test runners. The ErrorKind variant assertions stay
  cross-platform.
- `CheckpointInfo::seqno` docstring previously said the watermark is
  "≤ every seqno reflected" which conflicts with the lower-bound
  semantics described in the next paragraph. Rewritten as "≤ the
  *maximum* seqno reflected" so PITR consumers can't misread it as a
  per-record ceiling.
Copilot AI review requested due to automatic review settings May 19, 2026 22:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Comment thread src/lib.rs Outdated
Comment thread src/fs/mem_fs.rs Outdated
Comment thread src/deletion_pause.rs Outdated
Comment thread src/fs/std_fs.rs
Comment thread src/fs/std_fs.rs
…noise

- `crate::checkpoint` is now `pub(crate)` instead of `pub` + `#[doc(hidden)]`.
  The cross-Fs `link_or_copy_cross_fs` integration test is moved into
  inline unit tests inside `src/checkpoint.rs`, so the helpers no longer
  leak into the stable public API surface.
- `MemFs::create_dir` now delegates to `ensure_parent_dir`, so a parent
  that's actually a regular file produces a "parent is not a directory"
  error (matching POSIX `ENOTDIR`) instead of an ambiguous `NotFound`.
- `DeletionPause::new()` returns `Self` (plain ownership); a new
  `new_shared() -> Arc<Self>` convenience covers the install-into-tree
  case. Brings the constructor in line with the rest of the crate's
  ownership-explicit style.
- `StdFs::hard_link` now logs the cross-device fallback at `debug!` rather
  than `warn!`. A tier-misconfigured checkpoint can hit this path once
  per SST and once per blob — per-file warnings would drown real signal.
  Per-checkpoint summary remains the responsibility of the driver.
- `is_cross_device` replaces the bare literal `18` with a named `EXDEV`
  constant + explanatory comment (matches the existing `flock` pattern
  that also declares POSIX values inline rather than pulling in `libc`).
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/checkpoint.rs`:
- Around line 513-518: The code currently fsyncs tables, blobs (when
include_blobs), and target_root via fsync_directory but never fsyncs the parent
directory of target_root, so the checkpoint directory entry can be lost; after
the existing fsync_directory(target_root, &**target_fs)? call, add a final
fsync_directory on target_root.parent() (handle the Option safely—return an
error if None or propagate) using the same target_fs to ensure the directory
entry for the checkpoint is durable; reference the existing fsync_directory
function and the target_root/TABLES_FOLDER/BLOBS_FOLDER logic so the new
parent-directory fsync occurs as the last step before returning success.

In `@src/fs/std_fs.rs`:
- Around line 258-281: In copy_fallback, ensure we remove any partially-written
destination if a post-create error occurs: after
OpenOptions::new().write(true).create_new(true).open(dst) succeeds, if any
subsequent operation (read/write error, io::ErrorKind::Interrupted aside, or
dst_file.sync_all()) returns Err, best-effort remove the dst file (e.g.,
std::fs::remove_file(dst)) before returning the error; alternatively implement
writing to a temporary file (e.g., dst with a unique suffix) and only rename
(std::fs::rename) to dst after sync_all() succeeds, and ensure the temporary
file is removed on any failure — update error paths around src_file.read,
dst_file.write_all, and dst_file.sync_all to perform the cleanup.

In `@tests/checkpoint_pitr.rs`:
- Around line 340-364: The compactor thread may not have begun calling
major_compact before create_checkpoint runs; add a one-shot handshake so the
test waits until the worker has entered its loop (and attempted at least one
major_compact) before calling create_checkpoint. Modify the thread spawn block
that calls major_compact (the closure moved into thread::spawn referencing
compactor_tree, compactor_stop, major_compact) to take a
std::sync::mpsc::Sender<()> (or a oneshot channel) and send a single "ready"
signal from inside the loop before the first sleep/iteration; in the main test
code, create the Receiver and block on recv() (or try_recv/timeouts as
appropriate) right after spawning the thread and before calling
tree.create_checkpoint(&dst_path) so the checkpoint is gated until the compactor
loop has actually started. Ensure CompactorGuard construction and error paths
still behave the same.
- Around line 85-97: The test races: the writer thread may not perform any
post-50 writes before create_checkpoint runs; add a one-shot synchronization so
the main thread waits for the writer to perform its first insert/flush before
calling tree.create_checkpoint. Concretely, create a std::sync::mpsc::channel
(or oneshot) and move the Sender into the writer thread; after the writer does
its first insert or flush (e.g. right after the first writer_tree.insert(...) or
after the first writer_tree.flush_active_memtable(0)?), call sender.send(()) to
signal readiness; in the test thread call receiver.recv().unwrap() (or try_recv
with timeout) before invoking tree.create_checkpoint(&dst_path)?; keep existing
writer, writer_tree, insert, flush_active_memtable, create_checkpoint and
writer.join usage unchanged otherwise.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: caf4b740-58fa-485b-bcc3-ba55f9421a8d

📥 Commits

Reviewing files that changed from the base of the PR and between 5c5db06 and efa793b.

📒 Files selected for processing (8)
  • src/checkpoint.rs
  • src/deletion_pause.rs
  • src/fs/mem_fs.rs
  • src/fs/std_fs.rs
  • src/lib.rs
  • src/tree/inner.rs
  • src/tree/mod.rs
  • tests/checkpoint_pitr.rs

Comment thread src/checkpoint.rs
Comment thread src/fs/std_fs.rs Outdated
Comment thread tests/checkpoint_pitr.rs
Comment thread tests/checkpoint_pitr.rs
…hakes

- run_checkpoint now fsync's the directory CONTAINING target_root after
  the root sync. Without this the checkpoint directory entry can still
  vanish after a power loss even though `create_checkpoint()` returned
  Ok and the children are durable.
- copy_fallback wraps its read/write/sync_all sequence in an inner
  closure and best-effort `remove_file(dst)` on any post-`create_new`
  error. Previously an ENOSPC/EIO mid-copy left a truncated file behind,
  and the next retry would see AlreadyExists and surface a corrupt file
  from an operation that already reported failure.
- checkpoint_survives_concurrent_writes adds an mpsc handshake: the
  writer thread sends `()` after its first post-flush insert and the
  main thread blocks on `recv()` before calling create_checkpoint. A
  fast scheduler could otherwise finish the snapshot before the writer
  touched the tree, collapsing the test into a single-threaded smoke
  test.
- compaction_during_checkpoint_preserves_source_ssts adds the same
  handshake for the compactor — the worker sends `()` from inside its
  loop (after the announced flag flips) so we know major_compact has
  actually started before the checkpoint races it.
Copilot AI review requested due to automatic review settings May 20, 2026 00:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/checkpoint.rs`:
- Around line 482-496: The checkpoint seqno can advance after
flush_active_memtable but before current_version, causing CheckpointInfo::seqno
to exceed the actual persisted checkpoint contents; fix by computing the
checkpoint cutoff from the persisted snapshot rather than the live
visible_seqno: capture a lower-bound seqno before mutating visible_seqno (or
read it prior to calling tree.flush_active_memtable), or obtain the seqno from
the version returned by tree.current_version() and use that for
CheckpointInfo::seqno so the recorded seqno never outruns the snapshot; update
the code that sets captured_seqno and the use of tree.current_version()
accordingly (references: visible_seqno, tree.flush_active_memtable,
tree.current_version, CheckpointInfo::seqno).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7cb5d1f-89b7-4339-a07e-4e4c10c725c1

📥 Commits

Reviewing files that changed from the base of the PR and between 5c5db06 and efa793b.

📒 Files selected for processing (8)
  • src/checkpoint.rs
  • src/deletion_pause.rs
  • src/fs/mem_fs.rs
  • src/fs/std_fs.rs
  • src/lib.rs
  • src/tree/inner.rs
  • src/tree/mod.rs
  • tests/checkpoint_pitr.rs

Comment thread src/checkpoint.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Comment thread src/fs/mod.rs Outdated
Comment thread src/checkpoint.rs
Comment thread src/checkpoint.rs
Comment thread src/abstract_tree.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/checkpoint.rs (1)

482-496: ⚠️ Potential issue | 🟠 Major

Capture the PITR cutoff before the checkpoint flush.

A writer that lands between Line 486 and Line 495 can advance visible_seqno, while current_version() at Line 496 still snapshots only persisted files. That lets CheckpointInfo::seqno outrun the actual checkpoint contents, which breaks the watermark contract for PITR consumers.

Suggested fix
-    tree.flush_active_memtable(SeqNo::MAX)?;
-
-    // Capture the seqno BEFORE the version snapshot. Reading
-    // `visible_seqno` after `current_version()` would let concurrent
-    // writers advance the counter past data that never made it into the
-    // checkpoint, so PITR consumers could move the recovery cutoff past
-    // records they still need from WAL / replication. The "before"
-    // ordering means `CheckpointInfo::seqno` is a lower bound on the
-    // seqno reflected in the version snapshot — safe for cutoffs.
-    let captured_seqno = visible_seqno.get();
+    // Capture the cutoff BEFORE the checkpoint-triggered flush. Any
+    // later write can still race into the new active memtable or even a
+    // later flush, but this value remains a safe lower bound for what
+    // the checkpoint is guaranteed to contain.
+    let captured_seqno = visible_seqno.get();
+    tree.flush_active_memtable(SeqNo::MAX)?;
     let version = tree.current_version();

As per coding guidelines, Flag concurrency bugs: data races on shared segment lists, lock ordering violations, missing synchronization on manifest updates (Tier 1 — MUST flag).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/checkpoint.rs` around lines 482 - 496, The code captures visible_seqno
after flushing the active memtable and taking the version snapshot which allows
a concurrent writer to advance visible_seqno and violate the PITR watermark;
move the read of visible_seqno.get() (the captured_seqno used for
CheckpointInfo::seqno) to before calling tree.flush_active_memtable(SeqNo::MAX)
and before tree.current_version() so the checkpoint's seqno is a true lower
bound on the snapshot; ensure CheckpointInfo::seqno is assigned from that
earlier-captured_seqno and not from a later read.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/checkpoint.rs`:
- Around line 482-496: The code captures visible_seqno after flushing the active
memtable and taking the version snapshot which allows a concurrent writer to
advance visible_seqno and violate the PITR watermark; move the read of
visible_seqno.get() (the captured_seqno used for CheckpointInfo::seqno) to
before calling tree.flush_active_memtable(SeqNo::MAX) and before
tree.current_version() so the checkpoint's seqno is a true lower bound on the
snapshot; ensure CheckpointInfo::seqno is assigned from that
earlier-captured_seqno and not from a later read.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ecb5c09-8bae-4027-a180-59053efda983

📥 Commits

Reviewing files that changed from the base of the PR and between efa793b and 5faae35.

📒 Files selected for processing (3)
  • src/checkpoint.rs
  • src/fs/std_fs.rs
  • tests/checkpoint_pitr.rs

- run_checkpoint now samples visible_seqno BEFORE flush_active_memtable
  instead of between flush and current_version(). Previous ordering left
  a window where a concurrent writer could land in the freshly-rotated
  active memtable, advance the counter past data that the SST snapshot
  doesn't contain, and push CheckpointInfo.seqno above the actual
  captured contents. The "before flush" ordering keeps the captured
  seqno a strict lower bound on what's in the snapshot regardless of
  what writers do during the flush+snapshot window.
- AbstractTree::create_checkpoint, Fs::hard_link, and
  link_or_copy_cross_fs docstrings now say "logged via log::debug"
  instead of "log::warn", matching StdFs::hard_link which was downgraded
  earlier to avoid per-file log spam on tier-misconfigured checkpoints.
  Adds a note that operator visibility for unexpected copies is the
  checkpoint driver's job (CheckpointInfo.total_bytes ≫ inode link cost
  is the actionable signal).
@polaz polaz requested a review from Copilot May 20, 2026 01:46
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
REF="feat/%23210-pitr-checkpoint"

echo "=== Commit dda582a1 summary ==="
gh api repos/structured-world/coordinode-lsm-tree/commits/dda582a1 --jq '.commit.message'
echo ""
echo "=== Files changed ==="
gh api repos/structured-world/coordinode-lsm-tree/commits/dda582a1/files --jq '.[] | {filename: .filename, additions: .additions, deletions: .deletions}'

echo ""
echo "=== tests/checkpoint_pitr.rs lines 110-165 ==="
gh api "repos/structured-world/coordinode-lsm-tree/contents/tests/checkpoint_pitr.rs?ref=$REF" \
  --jq '.content' | base64 -d | sed -n '110,165p'

Length of output: 4631


@polaz — confirmed against the live branch.

The assertion at the check site now reads u64::from(i) < info.seqno (strict <), and the surrounding comment block precisely explains why:

  • visible_seqno is "lowest excluded" — record seqno = N is committed iff visible_seqno ≥ N + 1, which is exactly seqno < visible_seqno.
  • The previous <= form was racy: insert(seqno = i) and fetch_max(i + 1) are two separate steps, so the checkpoint thread could observe visible_seqno = i between them, capture i as the watermark, and then demand record-i be present — even though the flush hadn't necessarily materialized it yet. Strict < removes that window entirely.

The "no claim about i >= info.seqno" note correctly bounds what the test asserts, and the reasoning matches the fetch_max(i + 1) semantic throughout. Fix is correct and the cross-target flakes on aarch64/riscv64/musl/i686/ppc64 should be gone. ✅

·.¸¸.·´¯·🐇·´¯`·.¸¸.·´¯)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Comment thread src/abstract_tree.rs Outdated
CheckpointInfo::seqno is captured from `visible_seqno.get()` BEFORE
the version snapshot. visible_seqno is the standard 'lowest-excluded'
watermark — `visible_seqno = N` means records with seqno < N are
visible; seqno == N may or may not be visible yet. The doc previously
claimed 'every record with seqno <= info.seqno is included', which
contradicted the captured semantic and the (now-correct) PITR test
assertion (`i < info.seqno` in tests/checkpoint_pitr.rs:152).

A PITR consumer reading the doc literally could move a recovery
cutoff past data still needed from WAL or replication — the field is
a strict lower-exclusive watermark, not a max-included ceiling.

Rewrite the field doc to:
- explain it's a lowest-excluded watermark
- spell out the inclusion gate as `seqno < info.seqno`
- note seqno == info.seqno and seqno > info.seqno are undetermined
- warn against using `<=` as the inclusion gate
@polaz polaz requested a review from Copilot May 20, 2026 16:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Comment thread src/vlog/blob_file/mod.rs Outdated
Comment thread src/checkpoint.rs Outdated
Comment thread src/table/inner.rs
…ross-Fs copy failure

Three CodeRabbit findings, all in PR files:

1. src/vlog/blob_file/mod.rs:60, src/table/inner.rs (deletion_pause)
   New `std::sync::OnceLock<Arc<DeletionPause>>` fields regressed the
   crate's no-std direction (OnceLock has no alloc-only equivalent
   without the std feature). Switch to `once_cell::sync::OnceCell`,
   which is already pulled in for the zstd path and is one of the
   approved no-std-friendly primitives. The same swap applies to all
   construction sites (table/mod.rs, vlog/mod.rs,
   vlog/blob_file/multi_writer.rs, vlog/blob_file/reader.rs).
   Cargo.toml: once_cell is no longer feature-gated under zstd —
   it's an unconditional dep with default-features = false.

2. src/checkpoint.rs (link_or_copy_cross_fs streamed-copy)
   The cross-Fs streamed-copy path didn't clean up the partial `dst`
   on mid-copy failure (ENOSPC/EIO/EINTR-not-recoverable in
   read/write/flush/fsync). A subsequent retry would hit
   create_new's AlreadyExists check and surface a wholly different
   error, masking the real cause. Wrap the copy in an inner closure
   so any failure path goes through a best-effort
   `dst_fs.remove_file(dst)` before propagating the original error.
   PartialCheckpointGuard handles target-dir cleanup on the normal
   failure path, but link_or_copy_cross_fs is also reachable from
   cross-Fs tests and future direct callers — the local cleanup is
   the safer invariant.

3. src/table/inner.rs:134, src/vlog/blob_file/mod.rs (try_enqueue
   short-circuit)
   The Drop impls cloned `Arc<dyn Fs>` + PathBuf unconditionally
   even when the pause was installed but inactive — the common
   compaction-cleanup case. `try_enqueue` re-checks is_active()
   internally, but the clones were already paid. Add an outer
   `pause.is_active()` short-circuit so the common path skips both
   allocations. The under-lock recheck in try_enqueue stays — the
   outer check is a perf gate, not a correctness one.

No regression test for #2 (cleanup-on-error) — requires a custom
failing-write Fs wrapper that the existing test infrastructure
doesn't provide. The cleanup logic is a straightforward best-effort
unlink in an error path; correctness is verifiable by reading the
new match arm.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Cargo.toml (1)

91-99: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep once_cell gated to the zstd feature to prevent no_std regression.

The Cargo.toml comment states once_cell is "Optional + gated by the zstd feature," but the actual dependency is unconditional with features = ["std"], which pulls a std-bound dependency into --no-default-features --features alloc builds. The new deletion_pause fields use only new(), get(), and set() — all available on std::sync::OnceLock in Rust 1.92 — so they don't require once_cell's get_or_try_init. Only the zstd code in src/compression/mod.rs needs get_or_try_init, which is already feature-gated. Align the dependency to match the comment: gate once_cell to the zstd feature and use OnceLock (or a cfg-gated alias) for the new deletion-pause fields to avoid breaking the no_std CI constraint.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Cargo.toml` around lines 91 - 99, The Cargo.toml currently pulls once_cell
unconditionally which breaks no_std builds; change the dependency to be optional
and feature-gated and switch the new deletion-pause code to use
std::sync::OnceLock (or a cfg-gated alias) so only zstd uses once_cell.
Concretely: make once_cell optional (once_cell = { version = "1",
default-features = false, features = ["std"], optional = true }) and add it to
the zstd feature (features.zstd should include "once_cell"), and update code
references so ZstdDictionary::prepared_handle and src/compression/mod.rs keep
using once_cell::OnceCell/get_or_try_init under #[cfg(feature = "zstd")] while
the new deletion_pause fields use std::sync::OnceLock (or a small cfg alias
type) under non-zstd/no_std builds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@Cargo.toml`:
- Around line 91-99: The Cargo.toml currently pulls once_cell unconditionally
which breaks no_std builds; change the dependency to be optional and
feature-gated and switch the new deletion-pause code to use std::sync::OnceLock
(or a cfg-gated alias) so only zstd uses once_cell. Concretely: make once_cell
optional (once_cell = { version = "1", default-features = false, features =
["std"], optional = true }) and add it to the zstd feature (features.zstd should
include "once_cell"), and update code references so
ZstdDictionary::prepared_handle and src/compression/mod.rs keep using
once_cell::OnceCell/get_or_try_init under #[cfg(feature = "zstd")] while the new
deletion_pause fields use std::sync::OnceLock (or a small cfg alias type) under
non-zstd/no_std builds.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31bdeea2-8c29-4478-b66c-7b180af3f190

📥 Commits

Reviewing files that changed from the base of the PR and between 9808f74 and 4be0ddb.

📒 Files selected for processing (8)
  • Cargo.toml
  • src/checkpoint.rs
  • src/table/inner.rs
  • src/table/mod.rs
  • src/vlog/blob_file/mod.rs
  • src/vlog/blob_file/multi_writer.rs
  • src/vlog/blob_file/reader.rs
  • src/vlog/mod.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Cargo.toml (1)

91-99: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Gate once_cell behind your crate's std feature or use a no-std-capable primitive.

The dependency is declared unconditionally with features = ["std"] (no optional = true), making std a transitive dependency for all builds including --no-default-features --features alloc. This contradicts the manifest's feature contract and documented no-std migration strategy. The deletion_pause field appears unconditionally across multiple modules (vlog, table, tree, checkpoint), so either:

  1. Make once_cell optional and gate it (e.g., once_cell = { version = "1", optional = true, default-features = false, features = ["std"] } + add to std feature), or
  2. Switch to once_cell::race::OnceBox (already noted as your no-std migration target) if no std dependency is acceptable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Cargo.toml` around lines 91 - 99, The Cargo.toml currently unconditionally
pulls once_cell with std via the dependency line `once_cell = { version = "1",
default-features = false, features = ["std"] }`, which breaks the crate's no-std
feature contract; to fix, make the dependency optional and gate it behind your
crate-level `std` feature (e.g., mark once_cell optional and add it to the `std`
feature) or replace usage with a no-std primitive such as
`once_cell::race::OnceBox`; update any code paths that rely on `once_cell`
(notably `ZstdDictionary::prepared_handle` and modules referencing the
`deletion_pause` field in vlog/table/tree/checkpoint) to be conditional on the
`std` feature or to use the no-std alternative so that `--no-default-features
--features alloc` builds do not pull in `std`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@Cargo.toml`:
- Around line 91-99: The Cargo.toml currently unconditionally pulls once_cell
with std via the dependency line `once_cell = { version = "1", default-features
= false, features = ["std"] }`, which breaks the crate's no-std feature
contract; to fix, make the dependency optional and gate it behind your
crate-level `std` feature (e.g., mark once_cell optional and add it to the `std`
feature) or replace usage with a no-std primitive such as
`once_cell::race::OnceBox`; update any code paths that rely on `once_cell`
(notably `ZstdDictionary::prepared_handle` and modules referencing the
`deletion_pause` field in vlog/table/tree/checkpoint) to be conditional on the
`std` feature or to use the no-std alternative so that `--no-default-features
--features alloc` builds do not pull in `std`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: baacb669-ca46-4d4f-83b0-03d205a93a18

📥 Commits

Reviewing files that changed from the base of the PR and between 9808f74 and 4be0ddb.

📒 Files selected for processing (8)
  • Cargo.toml
  • src/checkpoint.rs
  • src/table/inner.rs
  • src/table/mod.rs
  • src/vlog/blob_file/mod.rs
  • src/vlog/blob_file/multi_writer.rs
  • src/vlog/blob_file/reader.rs
  • src/vlog/mod.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Comment thread Cargo.toml Outdated
Comment thread src/vlog/blob_file/merge.rs
polaz added 2 commits May 20, 2026 21:04
…iendly one-shot slots

CodeRabbit Cargo.toml 91-93 thread: the previous `once_cell` config
(`features = ["std"]`, unconditional) contradicted the dep comment
("optional + gated by zstd") AND pulled a std-bound dep into
`--no-default-features --features alloc` builds, breaking the
no-std-check invariant the rules want us to defend.

Resolution: switch every one-shot slot in this crate to
`once_cell::race::OnceBox` — the no-std + alloc variant — and pull
`once_cell` unconditionally with `default-features = false,
features = ["race"]`. This keeps the canonical
`get_or_try_init`-with-retry-on-failure semantics that
`std::sync::OnceLock` lacks below MSRV 1.86, without breaking
`cargo check --no-default-features`.

Sites touched:

- Cargo.toml: `once_cell` is now unconditional (race-only feature),
  removed from the `zstd` feature set, comment rewritten to match.
- src/compression/mod.rs (`ZstdDictionary::prepared_handle`): the
  zstd dictionary cache uses `OnceBox<DictionaryHandle>`. The closure
  returns `Box::new(handle)` per OnceBox API; the surrounding
  retry-on-failure contract is unchanged (the rejection-path test
  in the bench harness still pins it).
- src/table/inner.rs + src/vlog/blob_file/mod.rs: the `deletion_pause`
  field is `OnceBox<Arc<DeletionPause>>`. Construction sites updated
  in table/mod.rs, vlog/mod.rs, blob_file/multi_writer.rs,
  blob_file/reader.rs. `install_deletion_pause` now passes
  `Box::new(pause)` to `OnceBox::set`.
- `cached_blob_bytes: OnceLock<u64>` stays untouched — pre-existing
  pre-PR slot, not in scope for this thread.

structured-zstd is already a no_std-capable dep (we just opt-in to
its `std` feature via the `zstd` feature), so this change removes
the last std-only reason once_cell had to be conditional on zstd.

Verified: cargo check (default + --all-features), cargo nextest run
(1330/1330 passed), cargo clippy --lib --tests --all-features (clean).
Copilot src/vlog/blob_file/merge.rs:37: the previous PartialEq impl
compared only `scan_entry.key` while Ord ordered by
`(key, Reverse(seqno))`. That violated the Rust standard-library
`Ord` contract (a == b ⇔ a.cmp(b) == Equal): two entries with the
same key + different seqno would test equal but cmp() would order
them, which is the kind of latent footgun any future HashSet /
BTreeSet / dedup() pass over IteratorValue would silently rely on.

BinaryHeap (the active consumer since the IntervalHeap → BinaryHeap
swap in dda582a) uses Ord only, so no observable behaviour changes
today — but the contract is now actually upheld, and the comment
calls out why so the next reader doesn't regress it.

No regression test: the bug is a contract violation with no current
caller that exercises Eq on IteratorValue. A test would assert the
contract itself (`a == b iff a.cmp(b) == Equal`), which is what the
new impl directly encodes — no executable behaviour to drive.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.

Comment thread src/compression/mod.rs Outdated
Comment thread Cargo.toml Outdated
Two Copilot doc nits flagged after the OnceLock → OnceBox swap:

1. src/compression/mod.rs:84-91 — the doc block on `prepared` still
   described `OnceCell::get_or_try_init` semantics ("losers wait on
   the internal one-shot synchronisation primitive") even though the
   field is now `OnceBox`. Rewrite to describe the actual race
   mechanic: a single CAS on the slot pointer, winners publish their
   `Box<DictionaryHandle>`, losers drop their unused Box and read the
   winner's value next iteration. No mutex, no internal park.

2. Cargo.toml:91-93 — the comment said `std::sync::OnceLock` lacks
   the needed API "until 1.86, above our 1.92 MSRV", which is
   inverted: 1.86 is *below* 1.92, so OnceLock actually HAS
   `get_or_try_init` and `set` on our MSRV. The real reason for
   once_cell is no-std: OnceLock is std-only and cannot compile under
   `--no-default-features --features alloc`. Rewrite the comment so
   it states the actual reason and stops asserting an MSRV claim that
   contradicts itself.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: create_checkpoint() — consistent hard-link snapshot for PITR backup

2 participants