Skip to content
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
ded0a6f
feat(filter): vendor ribbon-filter v0.2.0 in-tree as ribbon module fo…
polaz May 19, 2026
94e68e7
feat(filter): scaffold BuRR module on top of vendored Ribbon primitives
polaz May 19, 2026
c002a3b
feat(filter/burr): per-block threshold computation + partition helpers
polaz May 19, 2026
91368f6
feat(filter/burr): wire per-block thresholds into builder + probe path
polaz May 19, 2026
d6312be
feat(filter/burr): wire-format encode + decode + reader probe
polaz May 19, 2026
6c62c98
feat(filter/burr): hash-based build + probe API for LSM integration
polaz May 19, 2026
31e9d6c
feat(filter): replace standard bloom with BuRR
polaz May 19, 2026
478a31f
test(filter/burr): edge cases + end-to-end table reopen
polaz May 19, 2026
4083b91
chore(filter/burr): polish docs + CHANGELOG entry
polaz May 19, 2026
4053a6c
revert(changelog): release-plz manages CHANGELOG.md
polaz May 19, 2026
2519476
revert(docs): drop copilot-instructions changelog rule
polaz May 19, 2026
0be3b9f
fix(filter/burr): harden wire decode + remove probe-path allocations
polaz May 19, 2026
a1dc317
docs(filter/ribbon): note ribbon-serde feature wiring
polaz May 19, 2026
4f0d60e
docs(filter/ribbon): clarify why vendored block uses #[allow]
polaz May 19, 2026
ab031a3
feat(filter)!: BuRR replaces standard bloom — wire-format break
polaz May 19, 2026
009979b
perf(filter/burr): inline hot path + pre-decode band words
polaz May 19, 2026
f22618b
perf(filter/burr): revert eager z_words decode — keep wire-borrowed
polaz May 19, 2026
868a59d
fix(filter): align is_active+empty-write paths, expect over allow
polaz May 19, 2026
af26f2a
docs(filter/ribbon): note allow-propagation into burr submodule
polaz May 19, 2026
81a6a4a
fix(filter): unblock CI — feature wiring + decode fail-closed
polaz May 19, 2026
f32ba27
fix(filter): partitioned empty-fail-closed, doc + estimator alignment
polaz May 19, 2026
4145129
fix(filter): bitvec without atomic for 32-bit cross-arch compatibility
polaz May 19, 2026
bc36ce7
fix(filter): drop bitvec dep, harden BuRR builder + wire
polaz May 19, 2026
03e4bec
fix(filter/ribbon): bump default retry_limit to 8 for endian portability
polaz May 19, 2026
8544d47
feat(format)!: bump disk version to V5, drop zstd-pure alias
polaz May 19, 2026
eafd77b
docs(readme): bump on-disk version to V5
polaz May 19, 2026
af55c86
docs(readme): add Credits section with original-author attribution
polaz May 19, 2026
c8b9588
docs(readme): expand feature-flag rationale per flag
polaz May 19, 2026
9808284
docs(format): clarify V5 incompatibility mechanism, expect over allow
polaz May 19, 2026
38e87be
fix(filter): checked arithmetic, hard scratch checks, specific error …
polaz May 19, 2026
4fae47e
test(filter/burr): cover wire single-pass entry point + error paths
polaz May 19, 2026
b9f88d1
fix(filter/burr): symmetric probe-pairing docs + owned-input build va…
polaz May 19, 2026
dcdab54
docs(cargo): explain zstd-pure alias removal in features block
polaz May 19, 2026
ddc3a61
fix(filter): harden builder/wire against malformed input
polaz May 19, 2026
3de592a
docs(cargo): clarify zstd-pure feature removal rationale
polaz May 19, 2026
f998647
chore(license): migrate per-file headers to SPDX + dual copyright
polaz May 19, 2026
11d2fb4
chore(changelog): restore 4.5.0 section from main after rebase
polaz May 19, 2026
656ed27
docs(license,filter): clarify SPDX header convention + probe-mode rat…
polaz May 19, 2026
6758e09
test(filter/block): add regression for empty-payload sentinel
polaz May 19, 2026
d170963
fix(filter/block): honour empty-payload sentinel in maybe_contains_hash
polaz May 19, 2026
4bbfc31
fix(filter): checked layer-header offset + preserve spill buffer capa…
polaz May 19, 2026
16d2a55
fix(filter/block): drop unfulfilled clippy::unwrap_used expect on tes…
polaz May 19, 2026
73be152
test(filter/ribbon): expand error-Display coverage for every variant
polaz May 19, 2026
5806aca
fix(filter,docs): enforce b >= w invariant + accurate licensing/docs
polaz May 19, 2026
0fd208b
docs(filter/params): note layer_m floor invariant + builder gate
polaz May 19, 2026
931dbf7
test(filter/ribbon): expand coverage for params + BurrBuilder validation
polaz May 19, 2026
9d50308
test(tree/recovery): cover both V3 and V4 manifest rejection
polaz May 19, 2026
40c3122
test(filter/ribbon): direct coverage for verbatim-seed builder variants
polaz May 19, 2026
0bdcecf
fix(filter,docs): correct key/value limits + harden layer_m + tighten…
polaz May 19, 2026
10c5c23
docs(lib): range tombstones came in V4, not V5
polaz May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 36 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "coordinode-lsm-tree"
description = "A K.I.S.S. implementation of log-structured merge trees (LSM-trees/LSMTs) — CoordiNode fork"
description = "Embedded LSM-tree storage engine: BuRR filters, zstd dictionary compression, MVCC, range tombstones, merge operators, K/V separation, AES-256-GCM at rest."
license = "Apache-2.0"
version = "4.5.0"
edition = "2024"
Expand All @@ -9,8 +9,19 @@ readme = "README.md"
include = ["src/**/*", "build.rs", "LICENSE-APACHE", "README.md", "CHANGELOG.md"]
repository = "https://github.com/structured-world/coordinode-lsm-tree"
homepage = "https://github.com/structured-world/coordinode-lsm-tree"
keywords = ["lsm-tree", "storage", "database", "coordinode", "key-value"]
categories = ["data-structures", "database-implementations"]
documentation = "https://docs.rs/coordinode-lsm-tree"
keywords = ["lsm-tree", "storage", "database", "embedded", "key-value"]
categories = ["data-structures", "database-implementations", "filesystem", "compression"]

[package.metadata.docs.rs]
# Build the docs.rs page with every feature enabled so the rendered
# crate page exposes the full public API surface (zstd dictionary
# compression, encryption, io-uring, bytes integration, metrics,
# ribbon-serde). cfg(docsrs) is set so #[cfg_attr(docsrs, ...)] items
# render their feature/availability badges.
all-features = true
rustdoc-args = ["--cfg", "docsrs"]
targets = ["x86_64-unknown-linux-gnu"]

[lib]
name = "lsm_tree"
Expand All @@ -20,11 +31,26 @@ path = "src/lib.rs"
default = []
io-uring = ["dep:io-uring"]
lz4 = ["dep:lz4_flex"]
# The previous `zstd-pure = ["zstd"]` alias was removed. It was
# documented as deprecated when there were two candidate zstd backends
# on the roadmap; only structured-zstd remains, so the alias serves no
# purpose and is dropped per the standard deprecation lifecycle. The
# removal is signalled to release tooling as a breaking change via the
# conventional `!` markers on this PR's breaking commits (BuRR filter
# wire format, V5 manifest gate); release-plz raises the crate's major
# version on the next release tag accordingly.
zstd = ["dep:structured-zstd"]
Comment thread
polaz marked this conversation as resolved.
zstd-pure = ["zstd"]
encryption = ["dep:aes-gcm", "dep:rand_chacha"]
bytes_1 = ["dep:bytes"]
metrics = []
# Vendored Ribbon filter retains its `#[cfg(feature = "ribbon-serde")]`
# guards (renamed from upstream's bare `serde` feature to avoid clashing
# with any future top-level serde feature in this crate). We do not
# consume the serde repr from inside this crate — the BuRR on-disk
# format is byteorder-encoded — but the feature wires `serde` as an
# optional dep so `--all-features` builds and a future extraction back
# into a standalone crate compile cleanly.
ribbon-serde = ["dep:serde"]

[dependencies]
bytes = { version = "1", optional = true }
Expand All @@ -38,6 +64,9 @@ structured-zstd = { version = "0.0.21", optional = true, default-features = fals
quick_cache = { version = "0.6.16", default-features = false, features = [] }
rustc-hash = "2.1.1"
self_cell = "1.2.0"
# Optional — only pulled in by the vendored Ribbon filter under the
# `ribbon-serde` feature flag (off by default).
serde = { version = "1", optional = true, features = ["derive"] }
sfa = "~1.0.0"
tempfile = "3.20.0"
varint-rs = "2.2.0"
Expand Down Expand Up @@ -65,6 +94,9 @@ fs_extra = "1.3.0"
nanoid = "0.5.0"
proptest = "1"
rand = "0.10.1"
# Used by the vendored ribbon-filter's #[cfg(feature = "ribbon-serde")]
# round-trip tests. Dev-only; production code does not depend on it.
serde_json = "1"
strum = { version = "0.28.0", features = ["derive"] }
test-log = "0.2.18"

Expand Down
141 changes: 71 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
<p align="center">
<img src="/logo.png" height="160">
</p>
# coordinode-lsm-tree

[![CI](https://github.com/structured-world/coordinode-lsm-tree/actions/workflows/coordinode-ci.yml/badge.svg)](https://github.com/structured-world/coordinode-lsm-tree/actions/workflows/coordinode-ci.yml)
[![codecov](https://codecov.io/gh/structured-world/coordinode-lsm-tree/graph/badge.svg)](https://codecov.io/gh/structured-world/coordinode-lsm-tree)
Expand All @@ -11,107 +9,106 @@
[![dependency status](https://deps.rs/repo/github/structured-world/coordinode-lsm-tree/status.svg)](https://deps.rs/repo/github/structured-world/coordinode-lsm-tree)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue)](#license)

> LSM-tree engine for [CoordiNode](https://github.com/structured-world/coordinode), maintained by [Structured World Foundation](https://sw.foundation).
> Derivative work of [fjall-rs/lsm-tree](https://github.com/fjall-rs/lsm-tree), developed independently with diverging features: zstd dictionary compression, custom sequence number generators, multi_get (batch-optimized), PinnableSlice zero-copy reads, WriteBatch seqno-grouped batch writes with caller-controlled atomic visibility, intra-L0 compaction, and security hardening.
LSM-tree storage engine in Rust. Embedded library; provides keyed point reads, prefix and range scans, MVCC snapshots, compaction, and a block cache. No write-ahead log — durability is the caller's responsibility. Built for [CoordiNode](https://github.com/structured-world/coordinode); usable standalone.

> [!IMPORTANT]
> This fork now introduces a fork-specific **disk format V4** compatibility boundary.
> `V4` is a breaking on-disk change relative to `V3` because the fork persists new semantics such as range tombstones and merge operands.
> New code may continue reading supported `V3` databases, but databases written with these `V4` semantics must not be opened by older `V3` binaries.
## Status

A K.I.S.S. implementation of log-structured merge trees (LSM-trees/LSMTs) in Rust.
On-disk format version **V5**. V5 introduces a wire-format break for filter blocks (BuRR replaces Bloom); V3 and V4 databases are not readable by this version and vice versa. Versioning is single-monotonic — every breaking format change bumps to the next version with explicit migration notes.

> [!NOTE]
> This crate only provides a primitive LSM-tree, not a full storage engine.
> For example, it does not ship with a write-ahead log.
> You probably want to use https://github.com/fjall-rs/fjall instead.
## Features

## About
### Read path

This is the most feature-rich LSM-tree implementation in Rust! It features:
- Point reads via `get` / `multi_get` (batch-optimized).
- `PinnableSlice` for zero-copy reads.
- `BurrFilter` AMQ filter (Bumped Ribbon Retrieval, Walzer & Dillinger 2022): ~1% memory overhead vs the information-theoretic minimum — ~30% smaller filter blocks than a same-FPR Bloom filter, or ~10× tighter FPR at the same memory budget. Used for both per-key and per-prefix membership checks.
- Forward and reverse range / prefix iteration.
- Block cache with size cap.
- File-descriptor cache to bound `fopen` syscalls.

- Thread-safe `BTreeMap`-like API
- Mostly [safe](./UNSAFE.md) & 100% stable Rust
- Block-based tables with compression support & prefix truncation
- Optional block hash indexes in data blocks for faster point lookups [[3]](#footnotes)
- Per-level filter/index block pinning configuration
- Range & prefix searching with forward and reverse iteration
- Block caching to keep hot data in memory
- File descriptor caching with upper bound to reduce `fopen` syscalls
- *AMQ* filters (currently Bloom filters) to improve point lookup performance
- Multi-versioning of KVs, enabling snapshot reads
- Optionally partitioned block index & filters for better cache efficiency [[1]](#footnotes)
- Leveled and FIFO compaction
- Optional key-value separation for large value workloads [[2]](#footnotes), with automatic garbage collection
- Single deletion tombstones ("weak" deletion)
- Optional compaction filters to run custom logic during compactions
### Write path

Keys are limited to 65536 bytes, values are limited to 2^32 bytes.
As is normal with any kind of storage engine, larger keys and values have a bigger performance impact.
- `WriteBatch` with seqno-grouped batch writes — caller-controlled atomic visibility.
- Single deletion tombstones (`remove_weak`).
- Range tombstones (`delete_range` / `delete_prefix`).
- Merge operators for commutative LSM operations.
- Optional key-value separation (BlobTree) for large-value workloads with automatic garbage collection.

## Feature flags

### lz4
### Compaction

Allows using `LZ4` compression, powered by [`lz4_flex`](https://github.com/PSeitz/lz4_flex).
- Leveled, size-tiered, dynamic-leveled, and FIFO strategies.
- Intra-L0 compaction for overlapping runs.
- Major compaction (full force flush + merge).
- Optional compaction filters for custom logic during compactions.
- Merge-aware compaction resolves operands lazily.

*Disabled by default.*
### Storage & encoding

### zstd
- Block-based tables with optional compression (none / LZ4 / Zstd) and prefix truncation.
- Per-table data block size policy and per-table compression policy.
- Optional **zstd dictionary compression** — trained per-table or per-column for small (4-64 KiB) blocks and blob files.
- Optional **block-level encryption at rest** — AES-256-GCM, key supplied by caller.
- Optional per-table block hash indexes for faster point lookups [[3]](#footnotes).
- Optional partitioned block index & filters for better cache efficiency [[1]](#footnotes).
- Per-level filter/index block pinning configuration.

Allows using `Zstd` compression via a pure Rust implementation, powered by
[`structured-zstd`](https://github.com/structured-world/structured-zstd) (managed fork of ruzstd).
Requires no C compiler or system libraries — compiles with `cargo build` alone.
Supports both regular zstd (`CompressionType::Zstd`) and dictionary compression
(`CompressionType::ZstdDict`) for improved ratios on small table blocks (4–64 KiB)
and blob files.
### Concurrency & API

**Current limitations:**
- Decompression throughput is ~2–3.5× slower than the C reference implementation
- Thread-safe `BTreeMap`-like API.
- `SequenceNumberGenerator` trait — pluggable seqno source.
- Custom `UserComparator` for non-lexicographic ordering.
- MVCC: snapshot reads at a chosen `SeqNo`.

*Disabled by default.*
### Internals

### zstd-pure
- 100% stable Rust, MSRV 1.92.
- No FFI: zstd via [`structured-zstd`](https://github.com/structured-world/structured-zstd) (pure-Rust), LZ4 via `lz4_flex`, AES via `aes-gcm`.
- Pluggable `Fs` trait — back the engine on the standard filesystem, on `io_uring`, on an in-memory `MemFs`, or on a custom implementation.
- Pluggable `CompressionProvider` for third-party codecs.

Deprecated alias for `zstd`. Enabling `zstd-pure` is equivalent to enabling `zstd`
and will be removed in a future release.
## Limits

*Disabled by default.*
- Keys: up to 65 536 bytes.
- Values: up to 2³² bytes.
- Larger keys and values carry a proportional performance cost.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

### bytes
## Feature flags

Uses [`bytes`](https://github.com/tokio-rs/bytes) as the underlying `Slice` type.
All optional, all off by default. The default build is the minimal core (no compression, no encryption, std filesystem). Every flag below is gated because it pulls in extra dependencies or runtime overhead.

*Disabled by default.*
| Flag | Pulls in | Enable when |
|---|---|---|
| `lz4` | [`lz4_flex`](https://github.com/PSeitz/lz4_flex) | Block compression wanted, decompression latency matters more than ratio. |
| `zstd` | [`structured-zstd`](https://github.com/structured-world/structured-zstd) (pure-Rust, no FFI) | Block compression wanted, ratio matters more than absolute decompression speed. Supports `CompressionType::Zstd` and dictionary-mode `CompressionType::ZstdDict`. Decompression is ~2-3.5× slower than C reference. |
| `encryption` | `aes-gcm`, `rand_chacha` | AES-256-GCM block encryption at rest. Keys are caller-managed. |
| `io-uring` (linux only) | [`io-uring`](https://github.com/tokio-rs/io-uring) | I/O-bound workload on a modern Linux kernel — adds an `io_uring` `Fs` backend. |
| `bytes_1` | [`bytes`](https://github.com/tokio-rs/bytes) | Consumer already speaks `bytes::Bytes` (tokio/hyper/tonic stack) and wants zero-copy interop with engine slices. |
| `metrics` | — | Production observability or profiling. Compiles in atomic counters around block I/O, filter probes, compaction, and cache hit rates (`tree.metrics()`). Small but non-zero hot-path cost. |
| `ribbon-serde` | `serde` | Snapshotting the internal `RibbonFilterRepr` for debugging or out-of-band transport. Not used by the on-disk format. |

## Benchmarks

CI runs [`db_bench`](tools/db_bench) on every push to `main` and on pull requests.
Results from `main` are published to the
[benchmark dashboard](https://structured-world.github.io/coordinode-lsm-tree/dev/bench/).
PRs that regress performance by >15% trigger an alert; >25% regression fails CI.
CI runs [`db_bench`](tools/db_bench) on every push to `main` and on pull requests. Results from `main` are published to the [benchmark dashboard](https://structured-world.github.io/coordinode-lsm-tree/dev/bench/). PRs regressing performance by more than 15% trigger an alert; more than 25% fails CI.

Flamegraphs are generated on every merge to `main` using instrumented `db_bench` runs
and published under `flamegraphs/<commit-sha>/flamegraph.svg` on
[gh-pages](https://structured-world.github.io/coordinode-lsm-tree/).
Flamegraphs are generated on every merge to `main` from instrumented `db_bench` runs and published under `flamegraphs/<commit-sha>/flamegraph.svg` on [gh-pages](https://structured-world.github.io/coordinode-lsm-tree/).

To run Criterion microbenchmarks locally:
Local Criterion microbenchmarks:

```bash
cargo bench --features lz4
```

To generate flamegraphs locally (requires the `flamegraph` feature):
Local flamegraphs:

```bash
cd tools/db_bench
cargo run --release --features flamegraph -- \
--benchmark all --num 100000 --flamegraph --skip-calibration
# Folded stacks written to target/flamegraphs/all.folded
# Render with: cargo install inferno && inferno-flamegraph target/flamegraphs/all.folded > flame.svg
# Folded stacks: target/flamegraphs/all.folded
# Render: cargo install inferno && inferno-flamegraph target/flamegraphs/all.folded > flame.svg
```

## Support the Project
## Support the project

<div align="center">

Expand All @@ -121,13 +118,17 @@ USDT (TRC-20): `TFDsezHa1cBkoeZT5q2T49Wp66K8t2DmdA`

</div>

## Credits

Originally created by Marvin Blum as part of [fjall-rs/lsm-tree](https://github.com/fjall-rs/lsm-tree); this codebase carries the original copyright (`Copyright (c) 2024-present, fjall-rs`). The vendored Ribbon filter (`src/table/filter/ribbon/`) is by [William Rågstad](https://github.com/WilliamRagstad) — see [`src/table/filter/ribbon/_vendored/`](src/table/filter/ribbon/_vendored/) for the upstream license texts.

## License

All source code is licensed under Apache-2.0.
All source code is licensed under [Apache-2.0](LICENSE-APACHE). Each first-party `.rs` file carries an `SPDX-License-Identifier: Apache-2.0` header alongside the original-author copyright and the maintainer copyright (Structured World Foundation). Contributions are accepted under the same license.

All contributions are to be licensed as Apache-2.0.
The vendored Ribbon filter (`src/table/filter/ribbon/`) keeps its upstream layout — it carries William Rågstad's per-module licensing commentary rather than per-file SPDX headers, plus the original `LICENSE-APACHE` and `LICENSE-MIT` preserved verbatim in `src/table/filter/ribbon/_vendored/`. The upstream crate is dual-licensed (`MIT OR Apache-2.0`); we redistribute the vendored copy only under the Apache-2.0 arm per Apache-2.0 §4.

Originally derived from [fjall-rs/lsm-tree](https://github.com/fjall-rs/lsm-tree). Independently maintained by [Structured World Foundation](https://sw.foundation).
Maintained by [Structured World Foundation](https://sw.foundation).
Comment thread
polaz marked this conversation as resolved.

## Footnotes

Expand Down
Loading
Loading