feat(checkpoint)!: hard-link snapshot for PITR backup (V5 storage)#276
Conversation
- Add `Fs::hard_link(src, dst)` with default `Unsupported` impl; concrete
impls on `StdFs` (with EXDEV→copy fallback), `MemFs` (copy-as-link
semantics) and `IoUringFs` (delegates to `StdFs`).
- Introduce `DeletionPause` — refcount-gated deferred-delete queue that
holds back `Table` / `BlobFile` removals while a checkpoint is in
flight, mirroring RocksDB's `DisableFileDeletions`/`EnableFileDeletions`.
- Add `AbstractTree::create_checkpoint(target) -> CheckpointInfo`,
implemented on `Tree`, `BlobTree`, and (via enum_dispatch) `AnyTree`.
The driver flushes the active memtable, snapshots the current version,
hard-links every live SST + blob into `target/{tables,blobs}/`, copies
manifest / version / current-pointer, and fsyncs the target root.
- `level_routes` tiers are honoured by linking through each table's own
`Fs` handle; cross-`Fs` situations fall back to a streamed byte copy.
- Tests: 11 integration scenarios in `tests/checkpoint_pitr.rs` (round-trip,
concurrent writes, BlobTree, level_routes, isolation, crash-safety,
deferred-delete vs concurrent major_compact, empty tree, byte-accuracy,
double-target rejection, cross-Fs copy) plus unit coverage for
`is_cross_device` and `DeletionPause`.
Closes #210
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughCreates PITR checkpoints: adds Fs mkdir/hard-link primitives, a refcounted DeletionPause to defer deletions, a checkpoint driver that hard-links or copies SST/blob files and metadata, wires pause into Tree/BlobTree lifecycles, and adds end-to-end PITR tests. ChangesCheckpoint PITR Feature
Sequence Diagram(s)sequenceDiagram
participant Client
participant Tree
participant Checkpoint
participant Fs
participant DeletionPause
Client->>Tree: create_checkpoint(target_path)
Tree->>DeletionPause: acquire() (Pause guard)
Tree->>Checkpoint: run_checkpoint(params)
Checkpoint->>Tree: flush_active_memtable()
Checkpoint->>Tree: capture current_version + visible_seqno
Checkpoint->>Fs: create_dir(target), create tables/, optionally blobs/
Checkpoint->>Fs: hard_link or copy SST/blob files
Checkpoint->>Fs: copy manifest, write v<id>, write CURRENT
Checkpoint->>Fs: fsync child dirs then root (and parent if present)
Checkpoint->>DeletionPause: release() (drop Pause)
Checkpoint->>Tree: return CheckpointInfo
Tree->>Client: CheckpointInfo
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/abstract_tree.rs`:
- Around line 89-95: Update the checkpoint docstring to mention that while
checkpoints attempt O(1) hard-links (so they consume no extra space) for all
SST/blob files, the implementation falls back to streamed/copy-on-write behavior
when hard-linking fails across filesystems/backends (e.g., EXDEV), which incurs
linear IO and additional storage for those files; keep references to
Config::open and BlobTree and state that callers using level routing or separate
backup volumes may observe this fallback and its performance/storage
implications.
In `@src/checkpoint.rs`:
- Around line 220-229: The code currently copies the live CURRENT file via
copy_metadata_file(CURRENT_VERSION_FILE) which can race with a concurrent
publish; modify run_checkpoint so that instead of cloning the source CURRENT you
atomically write a new CURRENT in the checkpoint containing the captured version
id (use the captured version.id() / version_id) as the content and perform this
write as the final metadata step; replace the
copy_metadata_file(CURRENT_VERSION_FILE) call with an atomic write of the
captured version string (use the same metadata-write helper/atomic-temp-rename
pattern used elsewhere) so CURRENT always points to the snapshot's version and
not to a concurrently-published one.
- Around line 268-292: After prepare_target succeeds you must install a
best-effort cleanup guard that removes the partially-built checkpoint on any
early return/error: create a scoped guard (or use a Drop-based RAII guard)
immediately after prepare_target(target_root, include_blobs, &**target_fs)? that
captures target_root and target_fs and on drop (if not disarmed) deletes the
target (e.g., remove_dir_all or equivalent), then disarm the guard only after
the final fsync_directory completes successfully. Ensure the guard covers
failures from deletion_pause.acquire(), tree.flush_active_memtable(SeqNo::MAX)?,
link_tables(...), link_blob_files(...), copy_metadata(...), and
fsync_directory(...), and reference/disarm it right after the successful
completion of fsync_directory so retries won’t hit AlreadyExists.
- Around line 280-282: CheckpointInfo::seqno can race ahead of the captured
snapshot because visible_seqno.get() is read after tree.current_version();
change the code to capture the seqno at the same logical snapshot as
current_version() by obtaining visible_seqno as-of that version (or by using the
tree snapshot API that returns version+seqno together) instead of calling
visible_seqno.get() separately; update the code around the calls to
tree.current_version() and visible_seqno.get() so you either call a
visible_seqno.as_of(version) (or equivalent API) or take the snapshot first
(e.g., snapshot = tree.capture_snapshot()) and read snapshot.seqno, ensuring
CheckpointInfo::seqno reflects the same snapshot point as
tree.current_version().
- Around line 53-65: prepare_target currently races by calling
target_fs.exists(target) then create_dir_all, allowing another process to create
the same checkpoint; change it to claim the checkpoint root atomically (either
by using an atomic create-new API on the filesystem or by creating a create-new
sentinel file/dir inside target and failing if it already exists) before
creating TABLES_FOLDER and optional blobs; specifically, update prepare_target
to attempt an atomic create for target (or create_new for a sentinel like
target.join(".lock" or ".claim")), return AlreadyExists if that atomic create
fails, and only after successful claim call
target_fs.create_dir_all(&target.join(TABLES_FOLDER)) and, when include_blobs is
true, create the blobs folder — ensure you reference prepare_target and
TABLES_FOLDER and preserve error mapping to std::io::ErrorKind::AlreadyExists.
- Around line 283-292: The code currently calls link_tables and link_blob_files
to populate target/tables and target/blobs, then writes metadata and calls
fsync_directory(target_root, target_fs) only on the root; to fix, after
link_tables and (if include_blobs) link_blob_files succeed and before the final
fsync_directory(target_root, target_fs) you must call fsync_directory on each
populated child directory (e.g., the tables directory and, if created, the blobs
directory) using the same target_fs handle so those directory entry changes are
durable; keep the final fsync_directory(target_root, target_fs) after these
child-directory syncs and ensure you perform these child fsyncs regardless of
copy_metadata succeeding to preserve correct fsync ordering for
link_tables/link_blob_files, referencing link_tables, link_blob_files,
fsync_directory, target_root and target_fs in your changes.
In `@src/deletion_pause.rs`:
- Around line 150-151: Replace the test-level lint suppression attribute so it
uses expect instead of allow: change the attribute on the test module/section
that currently reads #[allow(clippy::unwrap_used, reason = "...")] (the
attribute immediately following #[cfg(test)]) to #[expect(clippy::unwrap_used,
reason = "...")] so the suppression will emit a warning if it becomes
unnecessary; keep the same reason string and retain #[cfg(test)] unchanged.
- Around line 121-144: The drain is racing with a concurrently-created new pause
because after fetch_sub() transitions active 1->0 a new acquire() can create a
fresh pause and enqueue items that this branch will wrongly take; fix by making
draining keyed or serialized with acquire()/try_enqueue(): add and bump a
monotonic generation counter (e.g., inner.generation) when acquire() creates a
new pause and have drop() capture the generation it observed, then while holding
inner.queue.lock() verify the queue's generation matches that captured
generation before taking it (or only drain items from the matching generation);
alternatively, ensure acquire()/try_enqueue() and the drain both synchronize on
the same mutex so the drain only swaps the queue when no new pause has been
created. Ensure you update acquire(), try_enqueue(), and Drop for DeletionPause
to consult the generation or shared lock and reference the existing fields
active, inner.queue, acquire(), try_enqueue(), and the prev == 1 branch when
applying the change.
In `@src/fs/mem_fs.rs`:
- Around line 1285-1312: The test hard_link_creates_independent_copy should
verify independent backing buffers by mutating one path before removal and
asserting the other is unchanged: after creating src and dst via
fs.hard_link(src, dst), open one (e.g., src) with
FsOpenOptions::new().write(true), write different bytes via write_all and
drop/close it, then remove_file(src) and assert exists(src) is false and
exists(dst) is true, and finally open dst and read_to_string to confirm its
contents remain the original value (not the mutated bytes); update assertions
accordingly to reference hard_link, open, write_all, remove_file,
read_to_string, and exists.
In `@tests/checkpoint_pitr.rs`:
- Around line 179-181: The test is silently swallowing I/O errors by using
filter_map(Result::ok) on std::fs::read_dir(&dst_tables) and by calling
e.metadata().ok(); instead collect the directory entries as a Result<Vec<_>>
(e.g. collect::<std::io::Result<Vec<_>>>()), propagate any read_dir error with ?
and replace e.metadata().ok() with e.metadata()? so filesystem errors bubble up
instead of being dropped; apply the same changes to the other occurrences around
the entries/count logic referenced (the blocks around lines 260-263 and 338-341)
so any unreadable checkpoint entries cause the test to fail with the original
I/O error.
- Around line 154-185: The test never verifies that any SST was actually flushed
into the routed hot tier, so add a pre-check that at least one flushed SST
exists under the hot tier path (hot_dir) before calling
tree.create_checkpoint(&dst_path); e.g., inspect hot_dir.path()/tables (or the
same tables layout used later) with std::fs::read_dir, count entries, and assert
entries >= 1 with a clear message so the checkpoint validation proves the routed
path (level_routes / LevelRoute) was honored.
- Around line 415-454: Change the test to exercise a failure that occurs after
prepare_target rather than before it: do not pre-create the full destination (so
prepare_target runs), then induce a failure after prepare_target but before
checkpoint completion (for example by making the destination unwritable or
creating a conflicting file inside the freshly prepared target to break the
subsequent link/copy steps) when calling tree.create_checkpoint; assert that
create_checkpoint returns an error, that the partially-created destination
directory is removed/cleaned up, and that the source tree (accessed via tree.get
and reopen via open_tree) still contains all data. Ensure the new assertions
reference create_checkpoint and the prepared target behavior so the test
verifies cleanup-on-failure semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c21dbfd7-893d-4247-b5b0-97e6bb93d0ad
📒 Files selected for processing (18)
src/abstract_tree.rssrc/blob_tree/mod.rssrc/checkpoint.rssrc/deletion_pause.rssrc/fs/io_uring_fs.rssrc/fs/mem_fs.rssrc/fs/mod.rssrc/fs/std_fs.rssrc/lib.rssrc/table/inner.rssrc/table/mod.rssrc/tree/inner.rssrc/tree/mod.rssrc/vlog/blob_file/mod.rssrc/vlog/blob_file/multi_writer.rssrc/vlog/blob_file/reader.rssrc/vlog/mod.rstests/checkpoint_pitr.rs
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a hard-link based checkpoint primitive to enable O(1)-per-file point-in-time recovery (PITR) snapshots of an open LSM tree, including correctness around concurrent compaction via deferred deletions.
Changes:
- Introduces
AbstractTree::create_checkpoint(...) -> CheckpointInfoand shared checkpoint driver (src/checkpoint.rs). - Adds
Fs::hard_link(src, dst)with backend implementations (StdFs EXDEV→copy fallback, MemFs copy semantics, IoUringFs delegation). - Implements
DeletionPauseand wires it intoTable/BlobFiledrops to defer deletions during checkpoints; adds end-to-end PITR tests.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/checkpoint_pitr.rs | New end-to-end checkpoint/PITR integration tests including concurrent write/compaction scenarios. |
| src/vlog/mod.rs | Ensures recovered BlobFile::Inner carries fs + deletion pause state. |
| src/vlog/blob_file/reader.rs | Updates blob-file reader test setup to include new Inner fields. |
| src/vlog/blob_file/multi_writer.rs | Ensures newly created blob files carry fs + deletion pause state. |
| src/vlog/blob_file/mod.rs | Adds DeletionPause plumbing + Fs to blob-file inner; defers deletes in Drop. |
| src/tree/mod.rs | Implements create_checkpoint for Tree; installs deletion pause on recovered/new tables/blobs. |
| src/tree/inner.rs | Adds tree-wide deletion_pause to TreeInner. |
| src/table/mod.rs | Adds install_deletion_pause + OnceLock storage for the pause in tables. |
| src/table/inner.rs | Defers table file deletion in Drop when checkpoint pause is active. |
| src/lib.rs | Exposes CheckpointInfo; adds internal checkpoint + deletion_pause modules. |
| src/fs/std_fs.rs | Adds hard_link implementation with EXDEV detection + copy fallback; tests. |
| src/fs/mod.rs | Adds Fs::hard_link default (Unsupported) + documentation. |
| src/fs/mem_fs.rs | Implements hard_link as an independent byte copy for MemFs; tests. |
| src/fs/io_uring_fs.rs | Delegates hard_link to StdFs. |
| src/deletion_pause.rs | New refcounted deletion gate with queue draining on last guard drop; unit tests. |
| src/checkpoint.rs | New checkpoint implementation: prepare target, pause deletions, flush, link/copy files, copy metadata, fsync. |
| src/blob_tree/mod.rs | Implements create_checkpoint for BlobTree (includes blob files). |
| src/abstract_tree.rs | Adds CheckpointInfo type + create_checkpoint API docs on AbstractTree. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Switches `Arc`, `AtomicU32`, `fmt::Debug` and `vec!` imports in the new
checkpoint + deletion-pause modules to their `core::*` / `alloc::*`
forms. `Mutex`, `PathBuf` and `std::io::{Read, Write}` stay on `std::*`
because they have no `core` / `alloc` equivalent — these dependencies
are inherited from the `Fs` trait that already lives in `std`.
No behaviour change; the no-std-check error count stays flat (1206) but
the new code stops contributing trivially-portable usages to the
migration debt.
Concurrency / correctness fixes: - prepare_target uses atomic Fs::create_dir (new trait method) instead of exists()+create_dir_all, closing the TOCTOU window where two concurrent creators could race and both proceed thinking they claimed the directory. - run_checkpoint now writes a fresh `current` pointer for the captured version_id rather than copying the live source file. A concurrent flush/compaction can publish v<N+1> after current_version() snapshot, and the old code would copy that newer CURRENT into the checkpoint — leaving the snapshot pointing at a version it never linked. - Reorders visible_seqno.get() BEFORE current_version() so the recorded CheckpointInfo.seqno is a lower bound on the snapshot's contents. Concurrent writers can no longer advance the cutoff past data that did not make it into the checkpoint. - fsync_directory is now called on tables/ and blobs/ (when populated) before the root, so directory entries for the hard-linked files survive a power loss instead of relying on the root fsync alone. - PartialCheckpointGuard removes the target directory on any early return from run_checkpoint, keeping retries against the same path idempotent. - link_or_copy_cross_fs retries on EINTR in the read loop, matching StdFs::copy_fallback and avoiding flaky failures under signal interruption. DeletionPause: - Drop for Pause re-checks `active` under the queue lock before draining, so a new pause that acquires between fetch_sub→0 and the lock cannot have its freshly-queued items stolen by the previous drop. Includes a regression test for the generation race. - Mutex poisoning in Drop and try_enqueue now recovers via PoisonError::into_inner instead of leaking deferred deletions permanently — the whole point of the type is to keep the filesystem tidy, so leaving a queue stranded on poison defeats the purpose. - Switches test-level #[allow(unwrap_used)] to #[expect(...)] so the suppression warns if it ever becomes unnecessary. Docs / API: - Adds Fs::create_dir (atomic single-leaf mkdir) with Unsupported default + concrete impls on StdFs, MemFs, IoUringFs. - abstract_tree.rs docstring spells out the cross-filesystem copy fall-back so callers using level_routes or detached backup volumes know to expect linear-time / extra-space behaviour in those cases. - std_fs::is_cross_device docstring now explains explicitly why we do NOT classify PermissionDenied as a cross-device signal. - Adds manual Debug impl on vlog::blob_file::Inner now that the Arc<dyn Fs> + OnceLock fields prevent #[derive(Debug)]. Tests: - MemFs hard_link test now mutates one path and asserts the other is unchanged, proving independent backing buffers (the old assertion on remove_file would have passed even with Arc<Mutex<Vec<u8>>> aliasing). - checkpoint_pitr.rs replaces filter_map(Result::ok) with collected Result<Vec<_>> in three places so I/O failures during verification surface instead of being silently dropped. - checkpoint_flattens_level_routes asserts the routed tier received an SST BEFORE checkpointing, so a regression that ignored level_routes would no longer pass silently. - compaction_during_checkpoint_preserves_source_ssts wraps the compactor thread in a RAII guard so an early-returning ? cleans up the worker instead of detaching it. - Removes a tautological assert_eq!(info.sst_files, info.sst_files) and replaces it with a meaningful sst_files >= 1 invariant. - cross_fs_link_or_copy_streams_through_trait now mutates the destination (not a different file) and verifies the StdFs source is unchanged AND the MemFs destination reflects the write. - checkpoint_failure_leaves_source_intact splits into early-reject and retry-after-cleanup paths.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'lsm-tree db_bench'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 9808f74 | Previous: 413a296 | Ratio |
|---|---|---|---|
mergerandom |
382152.66922789026 ops/sec (normalized) |
470343.5799479985 ops/sec (normalized) |
1.23 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/abstract_tree.rs`:
- Around line 33-36: Update the documentation for CheckpointInfo::seqno to state
it is a lower-bound watermark rather than the definitive maximum visible seqno:
change the doc comment on the CheckpointInfo struct (and the other doc block
around the second occurrence) to say something like "A lower-bound watermark:
this seqno is guaranteed to be included in the checkpoint; later seqnos may also
be present due to sampling before flush/snapshot, so callers must treat it as an
inclusive minimum rather than the absolute maximum." Ensure you update the
comment attached to the field named seqno (CheckpointInfo::seqno) and any
repeated explanatory doc blocks in the same file to use this wording.
In `@src/checkpoint.rs`:
- Around line 364-370: The partial-checkpoint cleanup guard is created only
after prepare_target(...) returns which can leak target_root if prepare_target
creates the root but then fails creating child dirs; move the
PartialCheckpointGuard::new(target_root, target_fs) to immediately after the
root directory is claimed (i.e. as soon as prepare_target acquires/creates
target_root) or modify prepare_target(...) to install its own RAII guard for the
subdirectory-creation phase so any early return cleans up the partial root;
ensure the guard is still disarmed by commit() after fsync_directory succeeds
and keep references to target_root and target_fs consistent.
- Around line 201-208: Replace the exists-then-open TOCTOU pattern by opening
the source file directly: instead of calling src_fs.exists(&src) then
src_fs.open(&src, ...), call src_fs.open(&src, &FsOpenOptions::new().read(true))
first and match its Result; if it Errs with a NotFound/NotFound-equivalent from
the filesystem API, return Ok(()) (treat missing source as ignorable), otherwise
propagate the error and proceed to open/create the destination (dst) and
continue; remove the prior src_fs.exists call and ensure error handling checks
the filesystem's NotFound variant when handling the open of src (use the same
src, src_root, file_name, src_fs, FsOpenOptions, and src_file identifiers to
locate the change).
In `@tests/checkpoint_pitr.rs`:
- Around line 292-299: The compactor thread errors are currently being
swallowed: calls to major_compact(...) and the thread join in
CompactorGuard::drop discard results; change those points (references:
major_compact, CompactorGuard, Drop, handle, join, stop) to propagate or surface
failures instead of ignoring them — e.g., have major_compact return Result and
unwrap/expect or send the Err back to the test harness, and replace "let _ =
h.join()" with h.join().expect("compactor thread panicked") (or propagate the
Box<Any> to fail the test); apply the same treatment to the other join site
noted around the 306-309 region so any compaction error or panic fails the test
rather than being hidden.
- Around line 96-109: Capture the CheckpointInfo returned by
tree.create_checkpoint (don't discard it into _info), then use info.seqno to
bound the assertions: iterate keys whose seqno is <= info.seqno and assert each
is present in restored (use restored.get(..., lsm_tree::SeqNo::MAX) to read the
snapshot), while making no assertions about keys with seqno > info.seqno; update
the variable name from _info to info and replace the fixed 0..50 loop with a
loop that uses info.seqno as the upper bound (casting types as needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f9ce0ddf-0862-4fac-b9b9-33d3421f9c86
📒 Files selected for processing (9)
src/abstract_tree.rssrc/checkpoint.rssrc/deletion_pause.rssrc/fs/io_uring_fs.rssrc/fs/mem_fs.rssrc/fs/mod.rssrc/fs/std_fs.rssrc/vlog/blob_file/mod.rstests/checkpoint_pitr.rs
Correctness / safety:
- prepare_target now arms a local RAII cleanup guard immediately after
the atomic create_dir(target) succeeds. If creating tables/ or blobs/
subsequently fails, the freshly-claimed root is removed so retries
against the same path don't hit AlreadyExists.
- link_or_copy_cross_fs replaces the Arc::ptr_eq fast-path with a
"try hard_link first, fall back on NotFound/Unsupported" strategy.
Two Arc<dyn Fs> handles backed by the same kernel filesystem (e.g.
an Arc::new(StdFs) created independently in level_routes versus the
primary config.fs) used to be pessimised into an O(bytes) byte copy;
they now hard-link in O(1) as intended.
- copy_metadata_file opens the source directly instead of exists() +
open() (closes the TOCTOU window where the file disappears between
the two calls). It also distinguishes Required vs Optional metadata:
v<id> is now Required so a concurrent version-GC that removes the
captured v<id> fails the checkpoint loudly instead of producing an
unopenable snapshot.
Docs:
- CheckpointInfo.seqno docstring rewritten as a lower-bound watermark
("guaranteed-included, may also contain later seqnos"). The previous
"maximum visible seqno" wording could mislead PITR consumers into
treating it as a replay ceiling.
Tests:
- DeletionPause race test now uses two mpsc channels to pin the
invariant check at the exact moment Thread B holds an active pause
and has enqueued. Without the fix, A's drain would have already
swept B's file before the check runs; the buggy implementation now
fails the assertion mid-flight instead of producing the same final
state as the fixed one.
- compaction_during_checkpoint_preserves_source_ssts: CompactorGuard
now propagates panics via resume_unwind and surfaces Err returns as
test failures (with thread::panicking() guards to avoid double-
panic). The compactor loop itself uses ? on major_compact instead
of swallowing errors with `let _ = ...`.
- checkpoint_survives_concurrent_writes captures CheckpointInfo and
asserts the lower-bound watermark contract: every key with
seqno <= info.seqno MUST be present in the restored checkpoint
(keys above the watermark make no claim either way).
- checkpoint_failure_leaves_source_intact (Unix path) now genuinely
exercises the post-prepare failure scenario by chmod-ing the source
tables/ directory to 0o000 between flush and create_checkpoint.
link(2) requires search (x) permission on every path component, so
prepare_target succeeds and link_tables then fails with EACCES,
exercising PartialCheckpointGuard's cleanup. The non-Unix branch
falls back to the previous success-then-cleanup-then-retry chain.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/abstract_tree.rs`:
- Around line 37-39: The doc comment for the watermark around
AbstractTree::current_version contains contradictory wording: it claims the
watermark's seqno is "≤ every seqno reflected in the checkpointed SSTs" which
conflicts with the lower-bound semantics used elsewhere; update the comment on
the watermark (the captured seqno described just before
AbstractTree::current_version) to either state it is a lower bound (i.e., ≤ the
eventual maximum reflected seqno) or remove the misleading sentence and keep
only the explicit lower-bound guarantee — ensure references to
AbstractTree::current_version and "watermark"/"seqno" are adjusted so the
comment clearly says the value is captured before current_version and is a
lower-bound on the maximum seqno reflected in checkpointed SSTs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7e645e32-fb47-48d2-9add-be8da3c7ebcc
📒 Files selected for processing (4)
src/abstract_tree.rssrc/checkpoint.rssrc/deletion_pause.rstests/checkpoint_pitr.rs
- `is_cross_device_detects_exdev_and_kind_variants` now gates the raw `from_raw_os_error(18)` assertion behind `#[cfg(unix)]`. errno 18 is EXDEV on Linux/macOS/BSDs but means ERROR_NO_MORE_FILES on Windows, and the `#[cfg(unix)]` branch in `is_cross_device` is not compiled for non-Unix targets, so the assertion was guaranteed to fail on Windows test runners. The ErrorKind variant assertions stay cross-platform. - `CheckpointInfo::seqno` docstring previously said the watermark is "≤ every seqno reflected" which conflicts with the lower-bound semantics described in the next paragraph. Rewritten as "≤ the *maximum* seqno reflected" so PITR consumers can't misread it as a per-record ceiling.
…noise - `crate::checkpoint` is now `pub(crate)` instead of `pub` + `#[doc(hidden)]`. The cross-Fs `link_or_copy_cross_fs` integration test is moved into inline unit tests inside `src/checkpoint.rs`, so the helpers no longer leak into the stable public API surface. - `MemFs::create_dir` now delegates to `ensure_parent_dir`, so a parent that's actually a regular file produces a "parent is not a directory" error (matching POSIX `ENOTDIR`) instead of an ambiguous `NotFound`. - `DeletionPause::new()` returns `Self` (plain ownership); a new `new_shared() -> Arc<Self>` convenience covers the install-into-tree case. Brings the constructor in line with the rest of the crate's ownership-explicit style. - `StdFs::hard_link` now logs the cross-device fallback at `debug!` rather than `warn!`. A tier-misconfigured checkpoint can hit this path once per SST and once per blob — per-file warnings would drown real signal. Per-checkpoint summary remains the responsibility of the driver. - `is_cross_device` replaces the bare literal `18` with a named `EXDEV` constant + explanatory comment (matches the existing `flock` pattern that also declares POSIX values inline rather than pulling in `libc`).
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/checkpoint.rs`:
- Around line 513-518: The code currently fsyncs tables, blobs (when
include_blobs), and target_root via fsync_directory but never fsyncs the parent
directory of target_root, so the checkpoint directory entry can be lost; after
the existing fsync_directory(target_root, &**target_fs)? call, add a final
fsync_directory on target_root.parent() (handle the Option safely—return an
error if None or propagate) using the same target_fs to ensure the directory
entry for the checkpoint is durable; reference the existing fsync_directory
function and the target_root/TABLES_FOLDER/BLOBS_FOLDER logic so the new
parent-directory fsync occurs as the last step before returning success.
In `@src/fs/std_fs.rs`:
- Around line 258-281: In copy_fallback, ensure we remove any partially-written
destination if a post-create error occurs: after
OpenOptions::new().write(true).create_new(true).open(dst) succeeds, if any
subsequent operation (read/write error, io::ErrorKind::Interrupted aside, or
dst_file.sync_all()) returns Err, best-effort remove the dst file (e.g.,
std::fs::remove_file(dst)) before returning the error; alternatively implement
writing to a temporary file (e.g., dst with a unique suffix) and only rename
(std::fs::rename) to dst after sync_all() succeeds, and ensure the temporary
file is removed on any failure — update error paths around src_file.read,
dst_file.write_all, and dst_file.sync_all to perform the cleanup.
In `@tests/checkpoint_pitr.rs`:
- Around line 340-364: The compactor thread may not have begun calling
major_compact before create_checkpoint runs; add a one-shot handshake so the
test waits until the worker has entered its loop (and attempted at least one
major_compact) before calling create_checkpoint. Modify the thread spawn block
that calls major_compact (the closure moved into thread::spawn referencing
compactor_tree, compactor_stop, major_compact) to take a
std::sync::mpsc::Sender<()> (or a oneshot channel) and send a single "ready"
signal from inside the loop before the first sleep/iteration; in the main test
code, create the Receiver and block on recv() (or try_recv/timeouts as
appropriate) right after spawning the thread and before calling
tree.create_checkpoint(&dst_path) so the checkpoint is gated until the compactor
loop has actually started. Ensure CompactorGuard construction and error paths
still behave the same.
- Around line 85-97: The test races: the writer thread may not perform any
post-50 writes before create_checkpoint runs; add a one-shot synchronization so
the main thread waits for the writer to perform its first insert/flush before
calling tree.create_checkpoint. Concretely, create a std::sync::mpsc::channel
(or oneshot) and move the Sender into the writer thread; after the writer does
its first insert or flush (e.g. right after the first writer_tree.insert(...) or
after the first writer_tree.flush_active_memtable(0)?), call sender.send(()) to
signal readiness; in the test thread call receiver.recv().unwrap() (or try_recv
with timeout) before invoking tree.create_checkpoint(&dst_path)?; keep existing
writer, writer_tree, insert, flush_active_memtable, create_checkpoint and
writer.join usage unchanged otherwise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: caf4b740-58fa-485b-bcc3-ba55f9421a8d
📒 Files selected for processing (8)
src/checkpoint.rssrc/deletion_pause.rssrc/fs/mem_fs.rssrc/fs/std_fs.rssrc/lib.rssrc/tree/inner.rssrc/tree/mod.rstests/checkpoint_pitr.rs
…hakes - run_checkpoint now fsync's the directory CONTAINING target_root after the root sync. Without this the checkpoint directory entry can still vanish after a power loss even though `create_checkpoint()` returned Ok and the children are durable. - copy_fallback wraps its read/write/sync_all sequence in an inner closure and best-effort `remove_file(dst)` on any post-`create_new` error. Previously an ENOSPC/EIO mid-copy left a truncated file behind, and the next retry would see AlreadyExists and surface a corrupt file from an operation that already reported failure. - checkpoint_survives_concurrent_writes adds an mpsc handshake: the writer thread sends `()` after its first post-flush insert and the main thread blocks on `recv()` before calling create_checkpoint. A fast scheduler could otherwise finish the snapshot before the writer touched the tree, collapsing the test into a single-threaded smoke test. - compaction_during_checkpoint_preserves_source_ssts adds the same handshake for the compactor — the worker sends `()` from inside its loop (after the announced flag flips) so we know major_compact has actually started before the checkpoint races it.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/checkpoint.rs`:
- Around line 482-496: The checkpoint seqno can advance after
flush_active_memtable but before current_version, causing CheckpointInfo::seqno
to exceed the actual persisted checkpoint contents; fix by computing the
checkpoint cutoff from the persisted snapshot rather than the live
visible_seqno: capture a lower-bound seqno before mutating visible_seqno (or
read it prior to calling tree.flush_active_memtable), or obtain the seqno from
the version returned by tree.current_version() and use that for
CheckpointInfo::seqno so the recorded seqno never outruns the snapshot; update
the code that sets captured_seqno and the use of tree.current_version()
accordingly (references: visible_seqno, tree.flush_active_memtable,
tree.current_version, CheckpointInfo::seqno).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c7cb5d1f-89b7-4339-a07e-4e4c10c725c1
📒 Files selected for processing (8)
src/checkpoint.rssrc/deletion_pause.rssrc/fs/mem_fs.rssrc/fs/std_fs.rssrc/lib.rssrc/tree/inner.rssrc/tree/mod.rstests/checkpoint_pitr.rs
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/checkpoint.rs (1)
482-496:⚠️ Potential issue | 🟠 MajorCapture the PITR cutoff before the checkpoint flush.
A writer that lands between Line 486 and Line 495 can advance
visible_seqno, whilecurrent_version()at Line 496 still snapshots only persisted files. That letsCheckpointInfo::seqnooutrun the actual checkpoint contents, which breaks the watermark contract for PITR consumers.Suggested fix
- tree.flush_active_memtable(SeqNo::MAX)?; - - // Capture the seqno BEFORE the version snapshot. Reading - // `visible_seqno` after `current_version()` would let concurrent - // writers advance the counter past data that never made it into the - // checkpoint, so PITR consumers could move the recovery cutoff past - // records they still need from WAL / replication. The "before" - // ordering means `CheckpointInfo::seqno` is a lower bound on the - // seqno reflected in the version snapshot — safe for cutoffs. - let captured_seqno = visible_seqno.get(); + // Capture the cutoff BEFORE the checkpoint-triggered flush. Any + // later write can still race into the new active memtable or even a + // later flush, but this value remains a safe lower bound for what + // the checkpoint is guaranteed to contain. + let captured_seqno = visible_seqno.get(); + tree.flush_active_memtable(SeqNo::MAX)?; let version = tree.current_version();As per coding guidelines,
Flag concurrency bugs: data races on shared segment lists, lock ordering violations, missing synchronization on manifest updates (Tier 1 — MUST flag).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/checkpoint.rs` around lines 482 - 496, The code captures visible_seqno after flushing the active memtable and taking the version snapshot which allows a concurrent writer to advance visible_seqno and violate the PITR watermark; move the read of visible_seqno.get() (the captured_seqno used for CheckpointInfo::seqno) to before calling tree.flush_active_memtable(SeqNo::MAX) and before tree.current_version() so the checkpoint's seqno is a true lower bound on the snapshot; ensure CheckpointInfo::seqno is assigned from that earlier-captured_seqno and not from a later read.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/checkpoint.rs`:
- Around line 482-496: The code captures visible_seqno after flushing the active
memtable and taking the version snapshot which allows a concurrent writer to
advance visible_seqno and violate the PITR watermark; move the read of
visible_seqno.get() (the captured_seqno used for CheckpointInfo::seqno) to
before calling tree.flush_active_memtable(SeqNo::MAX) and before
tree.current_version() so the checkpoint's seqno is a true lower bound on the
snapshot; ensure CheckpointInfo::seqno is assigned from that
earlier-captured_seqno and not from a later read.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1ecb5c09-8bae-4027-a180-59053efda983
📒 Files selected for processing (3)
src/checkpoint.rssrc/fs/std_fs.rstests/checkpoint_pitr.rs
- run_checkpoint now samples visible_seqno BEFORE flush_active_memtable instead of between flush and current_version(). Previous ordering left a window where a concurrent writer could land in the freshly-rotated active memtable, advance the counter past data that the SST snapshot doesn't contain, and push CheckpointInfo.seqno above the actual captured contents. The "before flush" ordering keeps the captured seqno a strict lower bound on what's in the snapshot regardless of what writers do during the flush+snapshot window. - AbstractTree::create_checkpoint, Fs::hard_link, and link_or_copy_cross_fs docstrings now say "logged via log::debug" instead of "log::warn", matching StdFs::hard_link which was downgraded earlier to avoid per-file log spam on tier-misconfigured checkpoints. Adds a note that operator visibility for unexpected copies is the checkpoint driver's job (CheckpointInfo.total_bytes ≫ inode link cost is the actionable signal).
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 4631
The assertion at the check site now reads
The "no claim about (¯ |
CheckpointInfo::seqno is captured from `visible_seqno.get()` BEFORE the version snapshot. visible_seqno is the standard 'lowest-excluded' watermark — `visible_seqno = N` means records with seqno < N are visible; seqno == N may or may not be visible yet. The doc previously claimed 'every record with seqno <= info.seqno is included', which contradicted the captured semantic and the (now-correct) PITR test assertion (`i < info.seqno` in tests/checkpoint_pitr.rs:152). A PITR consumer reading the doc literally could move a recovery cutoff past data still needed from WAL or replication — the field is a strict lower-exclusive watermark, not a max-included ceiling. Rewrite the field doc to: - explain it's a lowest-excluded watermark - spell out the inclusion gate as `seqno < info.seqno` - note seqno == info.seqno and seqno > info.seqno are undetermined - warn against using `<=` as the inclusion gate
…ross-Fs copy failure Three CodeRabbit findings, all in PR files: 1. src/vlog/blob_file/mod.rs:60, src/table/inner.rs (deletion_pause) New `std::sync::OnceLock<Arc<DeletionPause>>` fields regressed the crate's no-std direction (OnceLock has no alloc-only equivalent without the std feature). Switch to `once_cell::sync::OnceCell`, which is already pulled in for the zstd path and is one of the approved no-std-friendly primitives. The same swap applies to all construction sites (table/mod.rs, vlog/mod.rs, vlog/blob_file/multi_writer.rs, vlog/blob_file/reader.rs). Cargo.toml: once_cell is no longer feature-gated under zstd — it's an unconditional dep with default-features = false. 2. src/checkpoint.rs (link_or_copy_cross_fs streamed-copy) The cross-Fs streamed-copy path didn't clean up the partial `dst` on mid-copy failure (ENOSPC/EIO/EINTR-not-recoverable in read/write/flush/fsync). A subsequent retry would hit create_new's AlreadyExists check and surface a wholly different error, masking the real cause. Wrap the copy in an inner closure so any failure path goes through a best-effort `dst_fs.remove_file(dst)` before propagating the original error. PartialCheckpointGuard handles target-dir cleanup on the normal failure path, but link_or_copy_cross_fs is also reachable from cross-Fs tests and future direct callers — the local cleanup is the safer invariant. 3. src/table/inner.rs:134, src/vlog/blob_file/mod.rs (try_enqueue short-circuit) The Drop impls cloned `Arc<dyn Fs>` + PathBuf unconditionally even when the pause was installed but inactive — the common compaction-cleanup case. `try_enqueue` re-checks is_active() internally, but the clones were already paid. Add an outer `pause.is_active()` short-circuit so the common path skips both allocations. The under-lock recheck in try_enqueue stays — the outer check is a perf gate, not a correctness one. No regression test for #2 (cleanup-on-error) — requires a custom failing-write Fs wrapper that the existing test infrastructure doesn't provide. The cleanup logic is a straightforward best-effort unlink in an error path; correctness is verifiable by reading the new match arm.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Cargo.toml (1)
91-99:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
once_cellgated to thezstdfeature to prevent no_std regression.The Cargo.toml comment states
once_cellis "Optional + gated by thezstdfeature," but the actual dependency is unconditional withfeatures = ["std"], which pulls a std-bound dependency into--no-default-features --features allocbuilds. The newdeletion_pausefields use onlynew(),get(), andset()— all available onstd::sync::OnceLockin Rust 1.92 — so they don't requireonce_cell'sget_or_try_init. Only the zstd code insrc/compression/mod.rsneedsget_or_try_init, which is already feature-gated. Align the dependency to match the comment: gateonce_cellto thezstdfeature and useOnceLock(or a cfg-gated alias) for the new deletion-pause fields to avoid breaking the no_std CI constraint.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Cargo.toml` around lines 91 - 99, The Cargo.toml currently pulls once_cell unconditionally which breaks no_std builds; change the dependency to be optional and feature-gated and switch the new deletion-pause code to use std::sync::OnceLock (or a cfg-gated alias) so only zstd uses once_cell. Concretely: make once_cell optional (once_cell = { version = "1", default-features = false, features = ["std"], optional = true }) and add it to the zstd feature (features.zstd should include "once_cell"), and update code references so ZstdDictionary::prepared_handle and src/compression/mod.rs keep using once_cell::OnceCell/get_or_try_init under #[cfg(feature = "zstd")] while the new deletion_pause fields use std::sync::OnceLock (or a small cfg alias type) under non-zstd/no_std builds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@Cargo.toml`:
- Around line 91-99: The Cargo.toml currently pulls once_cell unconditionally
which breaks no_std builds; change the dependency to be optional and
feature-gated and switch the new deletion-pause code to use std::sync::OnceLock
(or a cfg-gated alias) so only zstd uses once_cell. Concretely: make once_cell
optional (once_cell = { version = "1", default-features = false, features =
["std"], optional = true }) and add it to the zstd feature (features.zstd should
include "once_cell"), and update code references so
ZstdDictionary::prepared_handle and src/compression/mod.rs keep using
once_cell::OnceCell/get_or_try_init under #[cfg(feature = "zstd")] while the new
deletion_pause fields use std::sync::OnceLock (or a small cfg alias type) under
non-zstd/no_std builds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 31bdeea2-8c29-4478-b66c-7b180af3f190
📒 Files selected for processing (8)
Cargo.tomlsrc/checkpoint.rssrc/table/inner.rssrc/table/mod.rssrc/vlog/blob_file/mod.rssrc/vlog/blob_file/multi_writer.rssrc/vlog/blob_file/reader.rssrc/vlog/mod.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Cargo.toml (1)
91-99:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftGate
once_cellbehind your crate'sstdfeature or use a no-std-capable primitive.The dependency is declared unconditionally with
features = ["std"](nooptional = true), makingstda transitive dependency for all builds including--no-default-features --features alloc. This contradicts the manifest's feature contract and documented no-std migration strategy. Thedeletion_pausefield appears unconditionally across multiple modules (vlog, table, tree, checkpoint), so either:
- Make
once_celloptional and gate it (e.g.,once_cell = { version = "1", optional = true, default-features = false, features = ["std"] }+ add tostdfeature), or- Switch to
once_cell::race::OnceBox(already noted as your no-std migration target) if nostddependency is acceptable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Cargo.toml` around lines 91 - 99, The Cargo.toml currently unconditionally pulls once_cell with std via the dependency line `once_cell = { version = "1", default-features = false, features = ["std"] }`, which breaks the crate's no-std feature contract; to fix, make the dependency optional and gate it behind your crate-level `std` feature (e.g., mark once_cell optional and add it to the `std` feature) or replace usage with a no-std primitive such as `once_cell::race::OnceBox`; update any code paths that rely on `once_cell` (notably `ZstdDictionary::prepared_handle` and modules referencing the `deletion_pause` field in vlog/table/tree/checkpoint) to be conditional on the `std` feature or to use the no-std alternative so that `--no-default-features --features alloc` builds do not pull in `std`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@Cargo.toml`:
- Around line 91-99: The Cargo.toml currently unconditionally pulls once_cell
with std via the dependency line `once_cell = { version = "1", default-features
= false, features = ["std"] }`, which breaks the crate's no-std feature
contract; to fix, make the dependency optional and gate it behind your
crate-level `std` feature (e.g., mark once_cell optional and add it to the `std`
feature) or replace usage with a no-std primitive such as
`once_cell::race::OnceBox`; update any code paths that rely on `once_cell`
(notably `ZstdDictionary::prepared_handle` and modules referencing the
`deletion_pause` field in vlog/table/tree/checkpoint) to be conditional on the
`std` feature or to use the no-std alternative so that `--no-default-features
--features alloc` builds do not pull in `std`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: baacb669-ca46-4d4f-83b0-03d205a93a18
📒 Files selected for processing (8)
Cargo.tomlsrc/checkpoint.rssrc/table/inner.rssrc/table/mod.rssrc/vlog/blob_file/mod.rssrc/vlog/blob_file/multi_writer.rssrc/vlog/blob_file/reader.rssrc/vlog/mod.rs
…iendly one-shot slots
CodeRabbit Cargo.toml 91-93 thread: the previous `once_cell` config
(`features = ["std"]`, unconditional) contradicted the dep comment
("optional + gated by zstd") AND pulled a std-bound dep into
`--no-default-features --features alloc` builds, breaking the
no-std-check invariant the rules want us to defend.
Resolution: switch every one-shot slot in this crate to
`once_cell::race::OnceBox` — the no-std + alloc variant — and pull
`once_cell` unconditionally with `default-features = false,
features = ["race"]`. This keeps the canonical
`get_or_try_init`-with-retry-on-failure semantics that
`std::sync::OnceLock` lacks below MSRV 1.86, without breaking
`cargo check --no-default-features`.
Sites touched:
- Cargo.toml: `once_cell` is now unconditional (race-only feature),
removed from the `zstd` feature set, comment rewritten to match.
- src/compression/mod.rs (`ZstdDictionary::prepared_handle`): the
zstd dictionary cache uses `OnceBox<DictionaryHandle>`. The closure
returns `Box::new(handle)` per OnceBox API; the surrounding
retry-on-failure contract is unchanged (the rejection-path test
in the bench harness still pins it).
- src/table/inner.rs + src/vlog/blob_file/mod.rs: the `deletion_pause`
field is `OnceBox<Arc<DeletionPause>>`. Construction sites updated
in table/mod.rs, vlog/mod.rs, blob_file/multi_writer.rs,
blob_file/reader.rs. `install_deletion_pause` now passes
`Box::new(pause)` to `OnceBox::set`.
- `cached_blob_bytes: OnceLock<u64>` stays untouched — pre-existing
pre-PR slot, not in scope for this thread.
structured-zstd is already a no_std-capable dep (we just opt-in to
its `std` feature via the `zstd` feature), so this change removes
the last std-only reason once_cell had to be conditional on zstd.
Verified: cargo check (default + --all-features), cargo nextest run
(1330/1330 passed), cargo clippy --lib --tests --all-features (clean).
Copilot src/vlog/blob_file/merge.rs:37: the previous PartialEq impl compared only `scan_entry.key` while Ord ordered by `(key, Reverse(seqno))`. That violated the Rust standard-library `Ord` contract (a == b ⇔ a.cmp(b) == Equal): two entries with the same key + different seqno would test equal but cmp() would order them, which is the kind of latent footgun any future HashSet / BTreeSet / dedup() pass over IteratorValue would silently rely on. BinaryHeap (the active consumer since the IntervalHeap → BinaryHeap swap in dda582a) uses Ord only, so no observable behaviour changes today — but the contract is now actually upheld, and the comment calls out why so the next reader doesn't regress it. No regression test: the bug is a contract violation with no current caller that exercises Eq on IteratorValue. A test would assert the contract itself (`a == b iff a.cmp(b) == Equal`), which is what the new impl directly encodes — no executable behaviour to drive.
Two Copilot doc nits flagged after the OnceLock → OnceBox swap:
1. src/compression/mod.rs:84-91 — the doc block on `prepared` still
described `OnceCell::get_or_try_init` semantics ("losers wait on
the internal one-shot synchronisation primitive") even though the
field is now `OnceBox`. Rewrite to describe the actual race
mechanic: a single CAS on the slot pointer, winners publish their
`Box<DictionaryHandle>`, losers drop their unused Box and read the
winner's value next iteration. No mutex, no internal park.
2. Cargo.toml:91-93 — the comment said `std::sync::OnceLock` lacks
the needed API "until 1.86, above our 1.92 MSRV", which is
inverted: 1.86 is *below* 1.92, so OnceLock actually HAS
`get_or_try_init` and `set` on our MSRV. The real reason for
once_cell is no-std: OnceLock is std-only and cannot compile under
`--no-default-features --features alloc`. Rewrite the comment so
it states the actual reason and stops asserting an MSRV claim that
contradicts itself.
Summary
Implements #210 — the consistent hard-link checkpoint primitive that unblocks point-in-time recovery (PITR) backup for CoordiNode and any other consumer that needs an O(1)-per-file snapshot of an open LSM-tree.
Fs::hard_link(src, dst)with a defaultUnsupportedimpl (non-breaking for downstream backends).StdFsperforms a realhard_link(2)with an EXDEV→copy fallback;MemFsproduces an independent copy;IoUringFsdelegates toStdFssince linking is a metadata-only syscall.Fs::backend_id() -> Option<u64>— explicit shared-namespace capability check that gates cross-Fshard-link attempts.StdFs/IoUringFsreturn a commonKERNEL_BACKEND_ID; eachMemFs::new()allocates a unique per-instance id; default impl returnsNonefor safe-by-default third-party backends.DeletionPause— a tree-wide refcount-gated deferred-delete queue installed into eachTable::Inner/BlobFile::Inner. While at least onePauseguard is alive, bothDropimpls queue removals instead of executing them; the queue drains when the last guard is dropped. Mirrors RocksDB'sDisableFileDeletions/EnableFileDeletionspattern. Synchronisation usesspin::Mutexso the type isno_std-compatible.AbstractTree::create_checkpoint(target) -> CheckpointInfo— implemented onTree,BlobTree, and (viaenum_dispatch)AnyTree. The driver capturesvisible_seqnofirst, flushes the active memtable (eviction seqno0— never expands MVCC GC beyond what would have happened anyway), snapshots the current version, hard-links every live SST + blob intotarget/{tables,blobs}/, copies manifest /v<id>/current, and fsyncs the target root.level_routes(tiered storage) is honoured automatically — eachTable::Inneralready carries its own routedFs, so the link is performed through the right backend. Cross-Fssituations (e.g.MemFssource vs.StdFstarget) transparently fall back to a streamed byte copy after thebackend_idnamespace check declines the hard-link path.Tree::opennow rejects a directory that has version artifacts (tables/,blobs/,vN) but nocurrentpointer — the on-disk signature of a half-written checkpoint. Previously the code silently calledcreate_new, which would have overwritten the partial state with an empty tree.Test plan
cargo nextest run --all-features— 1327/1327 passed (6 environment-skipped), including 14 integration scenarios intests/checkpoint_pitr.rs:major_compactcannot remove an SST captured by an in-flight checkpointlevel_routesSSTs land in the flattenedtarget/tables/directorylink_tablesfailure)sst_files == 0CheckpointInfo.total_bytesmatches the on-disk sumAlreadyExistsSeqNo::MAX)currentpointer rejected byTree::opencurrentpointer rejected byTree::openFs(StdFs↔MemFs) link-or-copy streams through both trait objects (inline insrc/checkpoint.rs)Plus unit coverage for:
DeletionPause— defer / inactive-passthrough / nested-pauses / generation-race-under-lockStdFs::is_cross_device— raw EXDEV +ErrorKindvariantsFs::hard_link— round-trip + duplicate / missing-source rejection (bothStdFsandMemFs)copy_fallback— independence + create-new semanticscargo clippy --lib --testscleancargo test --docpasses for all touched modulesCloses #210
Summary by CodeRabbit
New Features
Reliability / Bug Fixes
Utilities
Filesystem compatibility
Tests / Documentation
BREAKING CHANGE: on-disk format V5. V3/V4 databases are NOT readable by this version and vice versa. V5 ships the BuRR filter wire-format change (replaces standard bloom, #269) plus the PITR hard-link checkpoint primitive (this PR, #210). The major bump goes through release-plz on squash.