Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
580e44a
feat(checkpoint): hard-link snapshot for PITR backup
polaz May 19, 2026
fcbab5c
refactor(checkpoint): prefer core/alloc imports for no-std friendliness
polaz May 19, 2026
dd23588
fix(checkpoint): address race conditions, cleanup, and test hardening
polaz May 19, 2026
8fde911
fix(checkpoint): tighten cleanup, TOCTOU, hard-link strategy, and tests
polaz May 19, 2026
5c5db06
fix(fs,docs): gate raw EXDEV test to Unix + clarify watermark docstring
polaz May 19, 2026
efa793b
refactor(checkpoint,fs,deletion-pause): tighten public surface + log …
polaz May 19, 2026
5faae35
fix(checkpoint,fs,tests): parent fsync, copy cleanup, race-test hands…
polaz May 20, 2026
16e2118
fix(checkpoint): seqno watermark race + warn→debug doc alignment
polaz May 20, 2026
0abd9ea
fix(checkpoint,fs,tests): align hard_link docs/behavior + log cross-b…
polaz May 20, 2026
3e1e880
test(checkpoint): regression for MVCC GC leak through flush
polaz May 20, 2026
27c1b2d
fix(checkpoint): pass 0 as flush GC threshold, not SeqNo::MAX
polaz May 20, 2026
4471fef
docs(checkpoint): correct seqno in MVCC regression test docstring
polaz May 20, 2026
8378b2f
docs(checkpoint): correct crash-recovery comment for missing CURRENT
polaz May 20, 2026
b42bf79
docs(checkpoint): explain why link_or_copy_cross_fs re-stats dst
polaz May 20, 2026
4d4d5b2
feat(fs): add Fs::backend_id namespace capability check
polaz May 20, 2026
8ec05fe
test(checkpoint): regression for missing/corrupt CURRENT pointer
polaz May 20, 2026
7d513eb
fix(tree): reject half-written checkpoint when CURRENT is missing
polaz May 20, 2026
33914fd
fix(tree): treat missing directory as 'no state' in version-state probe
polaz May 20, 2026
70904ba
feat(tree): log open failure when CURRENT is missing with stale state
polaz May 20, 2026
b2a4558
refactor(deletion_pause): swap std::sync::Mutex for spin::Mutex
polaz May 20, 2026
6f2f7b5
fix(checkpoint): fsync parent dir for relative target paths + name CU…
polaz May 20, 2026
3dfec59
test(checkpoint): assert ErrorKind::AlreadyExists structurally
polaz May 20, 2026
3ef4e07
test(deletion_pause): replace timing-based race reproducer with hands…
polaz May 20, 2026
21d031a
docs(readme): mention create_checkpoint in the Concurrency & API section
polaz May 20, 2026
0b8ba45
feat(tree): return structured Io(InvalidData) for half-written checkp…
polaz May 20, 2026
8f00597
test(checkpoint): structural ErrorKind assert in early-reject failure…
polaz May 20, 2026
f3bd055
fix(vlog): close blob file accessor before unlinking on drop
polaz May 20, 2026
e369bf0
test(checkpoint): regression for manifest-GC race deleting captured vN
polaz May 20, 2026
80f731e
fix(checkpoint): serialise captured Version into target instead of co…
polaz May 20, 2026
6aaeba8
test(checkpoint): tighten tamper test + use authoritative version id
polaz May 20, 2026
40f0b3f
test(checkpoint): regression for parent-fsync . path on non-StdFs
polaz May 20, 2026
54c202e
fix(checkpoint): skip parent fsync when target.parent() is empty
polaz May 20, 2026
8a2c200
test(checkpoint): regression for ./checkpoint relative target on MemFs
polaz May 20, 2026
a6e8afc
fix(checkpoint): normalise target_root by stripping CurDir components
polaz May 20, 2026
79a0ad8
test(checkpoint): make concurrent_writes watermark assertion meaningful
polaz May 20, 2026
dda582a
feat(filter)!: V5 storage format breaks V3/V4 compatibility
polaz May 20, 2026
a2d5d81
test(deletion_pause): make race test actually interleave A's drop and…
polaz May 20, 2026
9808f74
docs(checkpoint): correct CheckpointInfo::seqno watermark semantics
polaz May 20, 2026
4be0ddb
refactor(checkpoint): no-std-friendlier deletion pause + cleanup on c…
polaz May 20, 2026
240019d
refactor(deletion-pause,compression): switch to OnceBox for no-std-fr…
polaz May 20, 2026
9d839cc
fix(merge): align IteratorValue Eq with Ord (key + Reverse(seqno))
polaz May 20, 2026
d767602
docs(compression,cargo): correct OnceBox wording + once_cell rationale
polaz May 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/instructions/rust.instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,37 @@ These are not actionable review findings. Do not raise them:
- Integration tests that require specific disk layout or large data use `#[ignore = "reason"]`
- Prefer `assert_eq!` with message over bare `assert!` for better failure output
- Hardcoded values in tests are fine when accompanied by explanatory comments or assertion messages
## no_std + alloc Compatibility (Direction, Not Hard Mandate)

This crate is moving toward a `no_std + alloc` build. The CI `no-std-check` job builds against `thumbv7em-none-eabihf` with `--no-default-features --features alloc` and is **the direction we're heading**, not a hard mandate yet. Per-crate tier tables are NOT a required artifact — the rules below describe what reviewers should SUGGEST (not block) for changes that move in the right direction, and what they should ACTIVELY FLAG when changes regress no-std readiness.

When in doubt, suggest — don't gatekeep.

### Direction rules (suggest)

1. **No-std capability is the target.** Every library crate SHOULD in principle support `no_std + alloc` builds. Existing code that uses `std` is grandfathered until someone migrates it.
2. **Primitive selection order for NEW code**: `core::*` → `alloc::*` → external `no_std + alloc` crate → `std::*` behind `#[cfg(feature = "std")]` → unconditional `std::*` (last resort). Reviewers SHOULD suggest higher-tier alternatives.
3. **Cargo.toml shape**: a crate that already declares `default = ["std"]`, `std = []`, `alloc = []` should keep them coherent. `src/lib.rs` should open with `#![cfg_attr(not(feature = "std"), no_std)]` + `extern crate alloc;` once the migration is underway.
5. **Public API surface** SHOULD prefer `core` / `alloc` types where the crate's tier permits. Suggest, don't block.
13. **Prefer `core::*` over `std::*` re-exports** whenever the type is available there (e.g. `core::sync::atomic::AtomicU64`, `core::cmp::Ordering`). One-line suggestion — `std::*` re-exports of `core` types break the build under `no_std` even when binary-identical for `std`.
14. **The std-vs-no_std choice is per-primitive, NOT per-crate.** A crate tiered `std-only` is still encouraged to prefer the faster no_std-ready primitive when one exists (e.g. `hashbrown`, `parking_lot`, `smallvec`, `rustc_hash`, `bytes`) — these are normally faster than their `std::*` counterparts on hot paths.
18. **Tier reclassification suggestions** (e.g. `alloc` → `std-bound, leaf-isolated`) are welcome in the PR description. Reviewers MUST NOT block on a missing tier table — soft expectation only.

### Active flags (reject if introduced)

4. **CI gate**: the `no-std-check` job MUST run against a no-std-only target (e.g. `thumbv7em-none-eabihf`) with `--no-default-features --features alloc`. Host targets with available `std` MUST NOT be used — they silently pull `std` in via transitive features and hide real failures. PRs disabling or weakening this job's target/flags are rejected.
6. **NEW `std::collections::HashMap` / `HashSet` uses in alloc-tier modules** — prefer `hashbrown::HashMap` / `HashSet` (no_std + alloc), or `rustc_hash::FxHashMap` for internal-ID keys. Suggest in std-tier modules; reject in alloc-tier.
7. **`std::sync::Mutex` / `RwLock` in NEW code on hot paths.** Prefer `parking_lot::Mutex` / `RwLock`. `spin::Mutex` only in genuinely no-std contexts and only for very short critical sections. Existing usage is grandfathered.
8. **NEW `std::sync::OnceLock` for fallible init.** Use `once_cell::sync::OnceCell::get_or_try_init` or `once_cell::race::OnceBox`.
9. **NEW `thread_local!` in alloc-tier modules.** Replace with caller-managed scratch parameters or atomic-pointer patterns.
10. **NEW `std::io::Error` in public APIs of alloc-tier modules.** Define a crate-local error enum; `From<std::io::Error>` impls live behind `#[cfg(feature = "std")]`. Tolerate in `std-only` tier.
11. **NEW `std::time::Instant` / `std::time::SystemTime` in public APIs of alloc-tier modules.** Use a caller-provided clock trait or a `#[cfg(feature = "std")]`-gated convenience wrapper. Tolerate in `std-only` tier.
12. **NEW `std::thread::*` in alloc-tier modules.** Threading must be hoisted to a higher-tier crate.
15. **Adding `use std::*` to an alloc-tier module that was previously no-std-clean — without justification — is a regression.** Suggest a no_std alternative first; only reject if the PR's stated direction is no-std cleanup and this addition undoes that progress. No per-crate tier table is required to make this judgement.
16. **`no-std-check` compile-error count MUST NOT increase per PR.** While a crate is in transition, the job MAY run `continue-on-error: true` and the count tracked as a metric — but it MUST decrease or stay equal, never increase.
20. **Adding a transitive dependency that pulls `std` into an otherwise no-std-clean module — without justification — regresses no-std readiness.** Suggest alternatives first; reject only if the PR's stated direction is no-std cleanup and the addition undoes it.

### Always-applies

17. **Test code (`#[cfg(test)]`), benches (`benches/`), and binaries (`src/bin/`) are NOT subject to no-std rules** — they MAY use `std::*` freely. Only library code in `src/lib.rs` and its submodules is governed.
19. **Doc comments and rustdoc `# Examples` blocks** on no-std-capable APIs SHOULD NOT depend on `std::*` types if the API itself does not. Doctest examples requiring `std` should be gated `#[cfg(feature = "std")]`. Reject only when the API is documented as alloc-only and the doctest contradicts that.
29 changes: 18 additions & 11 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ lz4 = ["dep:lz4_flex", "std"]
# conventional `!` markers on this PR's breaking commits (BuRR filter
# wire format, V5 manifest gate); release-plz raises the crate's major
# version on the next release tag accordingly.
zstd = ["dep:structured-zstd", "dep:once_cell", "std"]
zstd = ["dep:structured-zstd", "std"]
encryption = ["dep:aes-gcm", "dep:rand_chacha", "std"]
bytes_1 = ["dep:bytes"]
metrics = []
Expand All @@ -81,19 +81,26 @@ bytes = { version = "1", optional = true }
byteorder = { package = "byteorder-lite", version = "0.1.0" }
byteview = "~0.10.1"
enum_dispatch = "0.3.13"
interval-heap = "0.0.5"
log = "0.4.27"
# `spin = "0.9"` provides a no_std-compatible Mutex. Used by
# `deletion_pause` so that module's only std footprint comes from the
# `Fs` trait it interacts with, not from its own synchronisation.
spin = { version = "0.9", default-features = false, features = ["mutex", "spin_mutex"] }
lz4_flex = { version = "0.13.0", optional = true, default-features = false }
structured-zstd = { version = "0.0.21", optional = true, default-features = false, features = ["std"] }
# `once_cell::sync::OnceCell` has stable `get_or_try_init`; `std::sync::OnceLock`
# does not until 1.86+ (still unstable on our 1.92 MSRV path via `once_cell_try`).
# Use the external crate for the canonical single-parse-across-racers primitive
# without an auxiliary `Mutex`. The `race` module from this same crate provides
# the no-std + alloc variant (`OnceBox`) we'll swap to during the no-std
# migration. Optional + gated by the `zstd` feature — only the zstd
# `ZstdDictionary::prepared_handle` path uses it today, so non-zstd consumers
# do not pull this crate.
once_cell = { version = "1", optional = true }
# `once_cell::race::OnceBox` — the no-std + alloc one-shot primitive.
# We pick it over `std::sync::OnceLock` (which has both `set` and the
# stabilised `get_or_try_init` on our 1.92 MSRV) because OnceLock is
# std-only: it can't compile on the `cargo check
# --no-default-features --features alloc` CI target, and the crate's
# direction is no-std-friendly. Unconditional dep with
# `default-features = false, features = ["race"]` so the crate stays
# no-std-friendly even with `--no-default-features`. Used by:
# - zstd `ZstdDictionary::prepared_handle` (canonical single-parse
# across racers, no auxiliary `Mutex`)
# - `deletion_pause` slot on tables / blob files (one-shot install of
# the shared `Arc<DeletionPause>` after recovery / compaction).
once_cell = { version = "1", default-features = false, features = ["race"] }
quick_cache = { version = "0.6.16", default-features = false, features = [] }
rustc-hash = "2.1.1"
self_cell = "1.2.0"
Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,11 @@ On-disk format version **V5**. V5 introduces a wire-format break for filter bloc
- `SequenceNumberGenerator` trait — pluggable seqno source.
- Custom `UserComparator` for non-lexicographic ordering.
- MVCC: snapshot reads at a chosen `SeqNo`.
- Point-in-time recovery snapshots via `Tree::create_checkpoint` — hard-link
every live SST + blob file into a fresh directory in O(1) per file, zero
extra disk until the source files compact away. Compaction continues
during the checkpoint (deletions are deferred), and the resulting
directory opens as an independent tree.

### Internals

Expand Down
93 changes: 93 additions & 0 deletions src/abstract_tree.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,44 @@ pub type RangeItem = crate::Result<KvPair>;

type FlushToTablesResult = (Vec<Table>, Option<Vec<BlobFile>>);

/// Summary of a checkpoint produced by
/// [`AbstractTree::create_checkpoint`].
///
/// All byte counts are *logical* file sizes — hard links share the
/// underlying inode storage, so a checkpoint's marginal disk usage is
/// typically zero until the original files are compacted away.
#[derive(Debug, Clone, Copy)]
pub struct CheckpointInfo {
/// Number of SST files captured.
pub sst_files: usize,
/// Number of blob (value-log) files captured. Always `0` for a
/// standard [`Tree`].
pub blob_files: usize,
/// Sum of the logical file sizes of every captured SST + blob.
pub total_bytes: u64,
/// The version ID embedded in the checkpoint's `current` pointer.
pub version_id: u64,
/// Lower-bound visible-seqno watermark for the snapshot.
///
/// Captured from the tree's `visible_seqno` generator BEFORE
/// [`AbstractTree::current_version`]. Following the standard
/// "lowest-excluded" watermark convention, `info.seqno = N` means
/// every record with `seqno < N` was committed at sample time and
/// is therefore guaranteed to be present in the snapshot. Records
/// with `seqno == N` may or may not be included (writers can hold
/// a record in the memtable for an instant before publishing the
/// next watermark); records with `seqno > N` may also be present
/// (writers can advance the counter between sample and version
/// snapshot, and those keys still land in the captured memtable).
///
/// PITR consumers MUST use `seqno < info.seqno` as the inclusion
/// gate. Using `<=` (treating this as a max-included ceiling)
/// could move a recovery cutoff past data still needed from WAL
/// or replication; the field is a strict lower-exclusive watermark,
/// not a max-included ceiling.
pub seqno: SeqNo,
Comment thread
polaz marked this conversation as resolved.
}

// Sealed on purpose: this trait is still public as a consumer-side bound
// (`&impl AbstractTree`), but external implementations are no longer part of
// the supported extension surface. Internal flush/version hooks keep evolving
Expand Down Expand Up @@ -62,6 +100,61 @@ pub trait AbstractTree: sealed::Sealed {
#[doc(hidden)]
fn get_version_history_lock(&self) -> RwLockWriteGuard<'_, crate::version::SuperVersions>;

/// Creates a hard-linked checkpoint of the tree's on-disk state in
/// `target_path` for point-in-time recovery (PITR) backup.
///
/// The checkpoint is a fully functional tree that can be opened
/// independently via [`Config::open`](crate::Config::open). For the
/// common single-filesystem case all SST files (and blob files, for
/// [`BlobTree`]) are hard-linked rather than copied, so the operation
/// is O(1) per file and consumes zero additional disk space until the
/// original files are compacted away — at which point the inode is
/// kept alive by the checkpoint link.
///
/// # Cross-filesystem / cross-backend fall-back
///
/// When a source file lives on a different filesystem than the
/// checkpoint target — e.g. an SST routed to a hot tier via
/// [`level_routes`](crate::Config::level_routes) on a separate volume,
/// or a backup directory on a foreign mount — the hard link cannot
/// be created (Unix `EXDEV`). In that case the checkpoint silently
/// falls back to a streamed byte copy, which:
///
/// - takes time linear in the file size instead of O(1), and
/// - consumes disk space equal to the copied bytes on the target
/// volume (no inode sharing across filesystems).
///
/// Each fall-back call emits one [`log::debug`] line (deliberately not
/// `warn`: a misconfigured tier could trigger this path once per SST
/// and per blob — thousands of times per snapshot — and per-file
/// warnings would drown real signal). Operators wanting hard-visibility
/// of unexpected full copies should enable debug logging on the `fs`
/// module or watch the `CheckpointInfo.total_bytes` figure (≫ inode
/// link cost means the fallback fired). The same `debug` policy applies
/// when source and target use entirely different [`Fs`](crate::fs::Fs)
/// backends (e.g. [`MemFs`](crate::fs::MemFs) → [`StdFs`](crate::fs::StdFs)
Comment thread
polaz marked this conversation as resolved.
/// in tests).
Comment thread
polaz marked this conversation as resolved.
///
/// # Concurrency
///
Comment thread
polaz marked this conversation as resolved.
/// While the checkpoint is being built, compaction continues normally
/// but the physical removal of obsolete files is deferred until the
/// checkpoint hard-links are in place. This is implemented by an
/// internal reference-counted deletion gate; callers do not have to
/// pause compaction themselves.
///
/// # Errors
///
/// Returns an error if:
/// - the active memtable could not be flushed,
/// - `target_path` already exists (to prevent accidental overwrites),
/// - a hard link / copy fall-back could not be created, or
/// - the manifest / version pointer files could not be replicated.
///
/// On error any partial checkpoint files are removed automatically
/// (best-effort) so callers can safely retry against the same path.
fn create_checkpoint(&self, target_path: &std::path::Path) -> crate::Result<CheckpointInfo>;

/// Seals the active memtable and flushes to table(s).
///
/// If there are already other sealed memtables lined up, those will be flushed as well.
Expand Down
18 changes: 18 additions & 0 deletions src/blob_tree/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,24 @@ impl BlobTree {
impl crate::abstract_tree::sealed::Sealed for BlobTree {}

impl AbstractTree for BlobTree {
fn create_checkpoint(
&self,
target_path: &std::path::Path,
) -> crate::Result<crate::CheckpointInfo> {
crate::checkpoint::run_checkpoint(
self,
&crate::checkpoint::CheckpointParams {
target_root: target_path,
target_fs: &self.index.config.fs,
src_root: &self.index.config.path,
src_fs: &self.index.config.fs,
deletion_pause: &self.index.deletion_pause,
visible_seqno: &self.index.config.visible_seqno,
include_blobs: true,
},
)
}

fn print_trace(&self, key: &[u8]) -> crate::Result<()> {
self.index.print_trace(key)
}
Expand Down
Loading
Loading