feat(compression): enable dictionary compression in pure Rust backend#229
Conversation
- Implement `compress_with_dict()` in `ZstdPureProvider` using `FrameCompressor::set_dictionary()` from `structured-zstd` v0.0.7, which added dictionary encoder support in structured-world/structured-zstd#25 - Support both finalized zstd dictionaries (magic 0x37A430EC + entropy tables, parsed via `decode_dict`) and raw content dictionaries (bare byte sequences, parsed via `from_raw_content` with xxh3-derived ID), matching the transparent handling in the C FFI backend - Remove the config-time guard that rejected `ZstdDict` compression policies under the pure backend — the feature is now fully supported - Update `decompress_with_dict` to apply the same format-detection logic so raw content dicts round-trip correctly end-to-end Unit tests (src/compression/zstd_pure.rs): - compress_with_dict_roundtrip_pure_to_pure - compress_with_dict_produces_zstd_magic - compress_with_dict_roundtrip_all_levels (levels 1, 3, 9, 19) - compress_with_dict_empty_dict_returns_error - compress_with_dict_raw_content_dict_works - compress_with_dict_empty_plaintext_roundtrips Integration tests (tests/zstd_dict_roundtrip.rs, zstd_pure_dict mod): - pure_tree_write_flush_read_zstd_dict (full Tree write→flush→read) - pure_tree_range_scan_with_zstd_dict - pure_tree_reopen_with_dict_reads_back_correctly - pure_zstd_dict_missing_returns_error Closes #218
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplaced the C-FFI zstd backend with a single pure‑Rust structured‑zstd backend; implemented dictionary compression/decompression (finalized vs raw-content) with TLS caching and compressor cache; simplified ZstdDictionary API; removed zstd_ffi; updated Cargo/features, build script, CI, README, benchmarks, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ZstdPure as ZstdPureProvider
participant TLSCache
participant Compressor as FrameCompressor
participant Decoder as FrameDecoder
rect rgba(200,200,255,0.5)
Caller->>ZstdPure: compress_with_dict(input, dict_raw, level)
ZstdPure->>TLSCache: lookup (xxh3_64(dict_raw), level)
TLSCache-->>ZstdPure: compressor (create if missing)
ZstdPure->>Compressor: compress(input, dict)
Compressor-->>ZstdPure: compressed_frame
ZstdPure->>ZstdPure: if raw-content dict → strip_dict_id(frame)
ZstdPure-->>Caller: compressed_frame
end
rect rgba(200,255,200,0.5)
Caller->>ZstdPure: decompress_with_dict(frame, dict_raw, capacity)
ZstdPure->>ZstdPure: detect dict format (DICT_MAGIC / raw)
alt raw-content path
ZstdPure->>Decoder: init with force_dict + bounded decode
Decoder-->>ZstdPure: decompressed_bytes / size_error
else finalized-dict path
ZstdPure->>Decoder: decode_all_to_vec(finalized)
Decoder-->>ZstdPure: decompressed_bytes
end
ZstdPure-->>Caller: decompressed_bytes / error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/compression/mod.rs`:
- Around line 38-43: The documentation for the compression function is too
strict: update the doc comment that currently says "must be a finalized zstd
dictionary" to state that dict_raw accepts both finalized zstd dictionaries
(magic 0x37A430EC header, entropy tables, content) and bare content dictionaries
as produced/returned by ZstdDictionary::new() / ZstdDictionary::raw(); mention
that both the C FFI backend and the pure Rust backend accept both formats so
callers know either form is valid.
In `@src/compression/zstd_pure.rs`:
- Around line 111-118: The frame encoder currently clamps the xxh3 low-32-bit
hash to 1 locally (let id = { xxh3_64(dict_raw) as u32; h.max(1) }) causing
mismatch with ZstdDictionary::id() and CompressionType::ZstdDict; instead
compute a single normalized dict id from the raw hash in the shared
dictionary-ID path (e.g., produce raw_id = (xxh3_64(dict_raw) as u32) and
normalized_id = raw_id.max(1)) and use normalized_id everywhere metadata,
ZstdDictionary::id(), CompressionType::ZstdDict, and frame encoding read/write
IDs; remove the local h.max(1) clamping in the encoder and apply the same change
for the other occurrence around lines 179-183 so both metadata and frame headers
observe the identical normalized id.
In `@tests/zstd_dict_roundtrip.rs`:
- Around line 270-271: The zstd_pure_dict module is gated with #[cfg(all(feature
= "zstd-pure", not(feature = "zstd")))] and will be skipped by cargo test
--all-features, so add a dedicated CI job that runs tests with the pure backend
enabled (e.g., run cargo test --no-default-features --features zstd-pure or
cargo test --features zstd-pure while ensuring the "zstd" feature is not
enabled) so the zstd_pure_dict module is exercised; update your CI workflow to
include this new job (name it clearly, e.g., "test: zstd-pure") and ensure it
uses the same test matrix/platforms as the other test jobs.
- Around line 278-291: The current make_test_dictionary() only wraps raw sample
bytes with ZstdDictionary::new(), so tests never exercise the
finalized-dictionary path; change or add a test fixture that actually
trains/finalizes a dictionary (use the library's training/finalize API instead
of ZstdDictionary::new()), then use that finalized dictionary in an end-to-end
encode/decode test to exercise Dictionary::decode_dict and
FrameCompressor::set_dictionary_from_bytes; locate make_test_dictionary and
replace or add a make_finalized_test_dictionary that invokes the training
function (or calls the dictionary finalization method) on the same samples and
return the finalized ZstdDictionary for the new test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9f444fc6-12f8-4148-ae09-8969ce781690
📒 Files selected for processing (5)
README.mdsrc/compression/mod.rssrc/compression/zstd_pure.rssrc/config/mod.rstests/zstd_dict_roundtrip.rs
💤 Files with no reviewable changes (2)
- README.md
- src/config/mod.rs
There was a problem hiding this comment.
Pull request overview
This PR enables zstd dictionary compression for the zstd-pure backend (structured-zstd), removing the previous config-time rejection and adding unit + integration tests to exercise dictionary-compressed Tree roundtrips under the pure backend.
Changes:
- Implement
compress_with_dict()forZstdPureProvider, including finalized-vs-raw dictionary format detection. - Remove the
Config::open()guard that previously rejectedCompressionType::ZstdDictunderzstd-pure. - Add new unit tests (
src/compression/zstd_pure.rs) and integration tests (tests/zstd_dict_roundtrip.rs) for pure-backend dictionary compression.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/compression/zstd_pure.rs |
Implements pure-backend dictionary compression and expands unit test coverage. |
src/config/mod.rs |
Removes the pure-backend dictionary-compression rejection during config validation. |
src/compression/mod.rs |
Updates trait-level docs to reflect dictionary support in the pure backend. |
tests/zstd_dict_roundtrip.rs |
Adds integration tests for dict compression via the pure backend through the Tree API. |
README.md |
Updates documentation by removing the old “dict compression unsupported” limitation for zstd-pure. |
…-zero - Move .max(1) into ZstdDictionary::id() so config validation, stored metadata, and zstd frame headers all observe the same dict id (id=0 is invalid in the zstd frame format; edge case when xxh3 lower 32 bits are zero) - Remove redundant .max(1) from decompress_with_dict raw-content path - Update compress_with_dict trait doc to reflect that both finalized and raw-content dictionaries are accepted by both backends - Add dedicated test-zstd-pure CI job to exercise the pure backend independently (--all-features enables C zstd and skips the not(feature = "zstd") gate) - Add pure_finalized_dict_roundtrip integration test to cover the Dictionary::decode_dict path in the pure Rust backend - Correct make_test_dictionary() doc: builds a raw-content dict, not a finalized-dictionary fixture
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/coordinode-ci.yml:
- Around line 62-80: Add a new step in the existing test-zstd-pure job to run
clippy under the pure-only feature combo: insert a step before the "Run tests"
step that runs `cargo clippy --no-default-features --features zstd-pure,lz4
--all-targets -- -D warnings`; this ensures code behind #[cfg(all(feature =
"zstd-pure", not(feature = "zstd")))] (e.g. src/compression/zstd_pure.rs and
tests/zstd_dict_roundtrip.rs) is linted and won't introduce warnings.
In `@src/compression/mod.rs`:
- Around line 158-176: The current normalization in ZstdDictionary::id() forces
id=0 to 1 and breaks read compatibility: revert or limit that normalization so
existing stored metadata with dict_id == 0 can still be recognized;
specifically, keep ZstdDictionary::id64() unchanged, change ZstdDictionary::id()
to return the raw lower 32 bits (including 0) for read/validation paths and
perform the clamp-to-1 only in write/encoding code paths that emit new frame
headers (i.e., where CompressionType::ZstdDict is serialized), or alternatively
make validation logic that compares dictionaries accept both raw 0 and
normalized 1 as equivalent for the same dictionary; update the code paths that
currently call ZstdDictionary::id() for config validation and block/table
open-time checks to use the raw value or the tolerant comparison, and add a
regression test that writes metadata with dict_id == 0 and verifies it can be
read back without triggering ZstdDictMismatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d7702fe5-b3ec-47af-9ebb-251245a313de
📒 Files selected for processing (4)
.github/workflows/coordinode-ci.ymlsrc/compression/mod.rssrc/compression/zstd_pure.rstests/zstd_dict_roundtrip.rs
- Add TLS FrameCompressor cache to compress_with_dict in pure Rust backend, matching the existing FrameDecoder TLS cache in decompress_with_dict - Cache key: (xxh3_64(dict_raw), level); dictionary is parsed at most once per thread per distinct (dict, level) pair - Use owned Cursor<Vec<u8>> source and Vec<u8> drain for 'static TLS compat - Revert ZstdDictionary::id() to return raw lower-32-bits of xxh3 (no clamping); backends that embed dict_id in the zstd frame header clamp to .max(1) themselves to preserve backward compat with on-disk dict_id=0 (theoretical) - Update id() doc to clarify that clamping is the backend's responsibility - Restore dict.id().max(1) in decompress_with_dict raw-content path - Add cargo clippy step for pure backend in test-zstd-pure CI job - Update benchmark.yml actions/checkout from v4 to v6
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/compression/mod.rs:178
- PR description says
ZstdDictionary::id()is normalized to always return ≥1, but the current implementation returns the raw lower 32 bits of xxh3 and the docs explicitly note it may be 0. Since id=0 is reserved/invalid in zstd frame headers, leavingid()able to return 0 means on-diskCompressionType::ZstdDict{dict_id}can be 0 while the pure backend clamps the frame header dict ID to ≥1 for raw-content dictionaries. Please either implement the promised normalization (and adjust docs/tests accordingly) or update the PR description to match the intended semantics.
/// Returns a 32-bit dictionary fingerprint (lower 32 bits of xxh3).
///
/// Intended for config validation (matching a `CompressionType::ZstdDict`
/// `dict_id` against the supplied `ZstdDictionary`) and external interop.
///
/// The value is the raw lower 32 bits of xxh3 and may theoretically be `0`
/// (probability ≈ 1/2³²). Backends that embed a dict ID in the zstd frame
/// header (where id=0 is reserved) are responsible for clamping to at
/// least 1 themselves. Config validation is unaffected: both sides derive
/// the ID from the same bytes and therefore agree even in the zero case.
///
/// For internal cache keying use [`id64`](ZstdDictionary::id64) to avoid
/// hash collisions.
#[must_use]
#[expect(
clippy::cast_possible_truncation,
reason = "intentional: public API returns 32-bit fingerprint"
)]
pub fn id(&self) -> u32 {
self.id as u32
}
- Fix module-level doc: remove reference to set_dictionary_from_bytes (implementation uses set_dictionary with a parsed Dictionary) - Reuse the exhausted Cursor<Vec<u8>> source buffer across compress() calls via take_source().map_or_else(...) to avoid a per-block O(data.len()) allocation after the first call on a given thread; drain ownership is transferred to the caller so its capacity cannot be recovered without an extra copy
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/compression/zstd_pure.rs`:
- Line 130: Extract the duplicated DICT_MAGIC array into a single module-level
constant and use it wherever the inline definition appears: add a top-level
const DICT_MAGIC: [u8; 4] = [0x37, 0xA4, 0x30, 0xEC] (with doc comment) near the
file imports, then remove the inline definitions in the functions (e.g., the
current local DICT_MAGIC in decompress_with_dict) and replace their uses with
the new module-level DICT_MAGIC constant so the magic value is defined once.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9739d26c-0a93-4be1-baaf-f423a58fe9c8
📒 Files selected for processing (1)
src/compression/zstd_pure.rs
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/compression/zstd_pure.rs`:
- Around line 146-159: The test suite is missing a cross-backend regression test
for raw-content dictionaries: add a test that creates a raw-content dictionary
(use Dictionary::from_raw_content with the synthetic id logic based on dict_key
and DICT_MAGIC/Dictionary::decode_dict semantics), compresses data using the
pure backend, then attempts to decompress the resulting frame using the FFI
backend to assert successful round-trip; ensure the test exercises the path
where dict_raw does not start with DICT_MAGIC so the synthetic id branch (the id
computed from dict_key) is used, and verify decompression succeeds despite the
differing dictID treatment between backends.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 027cfd5b-791d-4f90-a206-b3c0f3d2f603
📒 Files selected for processing (7)
.github/workflows/benchmark.yml.github/workflows/coordinode-ci.ymlREADME.mdsrc/compression/mod.rssrc/compression/zstd_pure.rssrc/config/mod.rstests/zstd_dict_roundtrip.rs
💤 Files with no reviewable changes (2)
- README.md
- src/config/mod.rs
The pure Rust backend (structured-zstd) rejects dictID=0 for
Dictionary::from_raw_content, so it internally assigns a synthetic
non-zero ID derived from xxh3_64(dict_raw). The C zstd library always
records dictID=0 (absent) for raw-content dicts, causing two failures:
pure → FFI: pure embeds synthetic dictID in frame header;
libzstd ZSTD_DDict has id=0; decompressor checks
dctx->dictID(0) != fParams.dictID(synthetic) → "Dictionary mismatch".
FFI → pure: FFI records dictID=0 in frame; structured-zstd
treats dictID=0 as "no dict required" and skips dict lookup,
then fails decompression with NotEnoughBytesInDictionary.
Fix the pure backend in both directions:
compress_with_dict (raw-content): after compression, strip the
Dict_ID field from the zstd frame header via strip_dict_id().
The synthetic ID is an internal detail — removing it aligns the
output with the C convention (dictID absent/0), so the C FFI
decompressor skips the ID check and accepts the frame.
decompress_with_dict (raw-content): replace decode_all_to_vec
(which calls init() internally and skips dict loading when
dictID=0) with the manual init() → force_dict() → decode_blocks()
→ collect() flow. force_dict() loads the raw-content dict
regardless of the frame's dictID field, handling all three cases:
- Frame from C FFI backend (dictID absent): force_dict loads dict.
- Frame from new pure backend (dictID stripped): same.
- Frame from old pure backend (dictID=synthetic): force_dict
reloads same dict (idempotent).
Add cross-backend interoperability tests gated on
#[cfg(all(test, feature = "zstd", feature = "zstd-pure"))]:
pure_compress_ffi_decompress_raw_content_dict_roundtrip
ffi_compress_pure_decompress_raw_content_dict_roundtrip
mod zstd_pure is now compiled under #[cfg(feature = "zstd-pure")]
(previously not(feature = "zstd")) so it is visible to the
cross-backend test module when --all-features is used.
Closes #218
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'lsm-tree db_bench'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: ffd9512 | Previous: ab61d33 | Ratio |
|---|---|---|---|
mergerandom |
261870.7610778383 ops/sec (normalized) |
380260.6261186645 ops/sec (normalized) |
1.45 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/compression/zstd_pure.rs`:
- Around line 151-174: The TLS variables (e.g., TLS_COMPRESSOR holding
Option<(u64, i32, CachedCompressor)>) implement a single-entry memoizer contrary
to the comment; replace the Option with a small keyed cache (HashMap or tiny
LRU) keyed by (dict_hash: u64, level: i32) so a thread can store multiple
CachedCompressor entries and avoid re-parsing when switching
dictionaries/levels; update the thread_local! RefCell<Option<...>> declarations
to RefCell<HashMap<(u64, i32), CachedCompressor>> or a bounded LRU map and apply
the same change to the analogous TLS cells referenced in the diff (the decoder
TLS instances), ensuring lookups insert/evict correctly and keys use
dict.id64()/level as described in the comments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6b0df8e2-09ae-4eda-a5fd-d5d8f7ca0431
📒 Files selected for processing (2)
src/compression/mod.rssrc/compression/zstd_pure.rs
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/compression/zstd_pure.rs (1)
344-345:⚠️ Potential issue | 🟡 MinorKeep the finalized-dictionary magic wording consistent.
These comments still spell the header as
0x37A430EC, which reintroduces the endian confusion you already fixed forDICT_MAGIC. Reword them to reference the byte sequence (37 A4 30 EC) or the corrected little-endian value instead.Also applies to: 499-503
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/compression/zstd_pure.rs` around lines 344 - 345, Update the finalized-dictionary comment wording so it does not use the big-endian hex literal 0x37A430EC (which causes endian confusion); instead reference the byte sequence "37 A4 30 EC" or the corrected little-endian numeric value wherever the header is described (e.g., the comments near DICT_MAGIC and the finalized-dictionary descriptions in src/compression/zstd_pure.rs, including the other occurrence around the block that spans lines ~499-503). Ensure both places use the same phrasing ("byte sequence 37 A4 30 EC" or the corrected little-endian value) so the comment is unambiguous across platforms and matches the DICT_MAGIC constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/coordinode-ci.yml:
- Around line 122-124: The coverage comment on the workflow step that runs
`cargo +nightly llvm-cov --no-report nextest --no-default-features --features
zstd,lz4` is misleading because `--all-features` now includes the `zstd` backend
(since `zstd-pure` is an alias), so update the comment to state that this extra
run validates a narrower feature combination (zstd and lz4 without default
features) rather than exercising dictionary-compression paths missed by
`--all-features`; keep the note that the extra run is intentional for validating
that specific feature set and remove the incorrect claim about exercising
dictionary paths.
In `@src/compression/mod.rs`:
- Around line 28-34: Update the doc comment for the compression function to
avoid implying ZstdDictionary::raw() only yields a finalized dictionary:
rephrase the sentence under "Compress `data` using a zstd dictionary." so it
states that `dict_raw` may be either a finalized zstd dictionary (magic
0x37A430EC, entropy tables, content — produced by `zstd --train`) or raw content
bytes, and that callers may pass bytes produced by `zstd --train` or bytes
obtained from `ZstdDictionary::raw()`; make clear the zstd backend accepts
either representation. Target the doc block surrounding the function that begins
with "Compress `data` using a zstd dictionary." and adjust the wording
accordingly.
In `@src/compression/zstd_pure.rs`:
- Around line 179-183: Replace the misleading "// SAFETY:" comment above the
slice access with "// Invariant:" because this is documenting a safe
slice-bounds invariant (not reasoning for an unsafe block); update the comment
that precedes the call to output.get_mut(prev_len..) / unwrap_or_else(...) so it
reads "// Invariant: ..." and explains that output was resized to new_len and
prev_len.. is a valid slice (referencing variables output, prev_len, new_len,
and dest to locate the code).
---
Duplicate comments:
In `@src/compression/zstd_pure.rs`:
- Around line 344-345: Update the finalized-dictionary comment wording so it
does not use the big-endian hex literal 0x37A430EC (which causes endian
confusion); instead reference the byte sequence "37 A4 30 EC" or the corrected
little-endian numeric value wherever the header is described (e.g., the comments
near DICT_MAGIC and the finalized-dictionary descriptions in
src/compression/zstd_pure.rs, including the other occurrence around the block
that spans lines ~499-503). Ensure both places use the same phrasing ("byte
sequence 37 A4 30 EC" or the corrected little-endian value) so the comment is
unambiguous across platforms and matches the DICT_MAGIC constant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6ec60988-727d-4765-8fed-1ff6a08976e0
📒 Files selected for processing (11)
.github/workflows/benchmark.yml.github/workflows/coordinode-ci.ymlCargo.tomlREADME.mdbenches/zstd_dict.rsbuild.rssrc/compression/mod.rssrc/compression/zstd_ffi.rssrc/compression/zstd_pure.rssrc/config/mod.rstests/zstd_dict_roundtrip.rs
💤 Files with no reviewable changes (2)
- src/config/mod.rs
- src/compression/zstd_ffi.rs
…issues - Standardize '0x37A430EC' → bytes '37 A4 30 EC' in all comments (mod.rs compress_with_dict, zstd_pure.rs compress/decompress paths) - mod.rs compress_with_dict: clarify ZstdDictionary::raw() is an accessor, not a dict producer; mention it's accessible for persistence/interop - zstd_pure.rs decode_raw_content_bounded: rename '// SAFETY:' → '// Invariant:' (not an unsafe block, documents safe slice-bounds invariant) - zstd_dict_roundtrip.rs: fix assertion message 'L0 SSTs' → 'tables' (table_count() returns total across all levels, not L0-only) - coordinode-ci.yml: correct coverage comment — zstd-pure is now an alias for zstd, so the extra run validates a narrower feature set, not dict paths missed by --all-features
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/zstd_dict_roundtrip.rs`:
- Around line 264-294: Add an explicit postcondition after calling
tree.major_compact(...) to assert that compaction actually reduced L0 tables (or
total SST count): query the tree for its table/level state (e.g., use whatever
API your tree exposes such as a method that returns number of tables or tables
per level, e.g., tree.num_tables_in_level(0) or
tree.table_count()/tree.levels()) and assert that L0 is zero or the total table
count decreased compared to pre-compaction; place this check immediately after
the major_compact call (before the read/scan assertions) and fail the test if
compaction did not occur.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2925a12a-264b-4b96-86eb-0520641c0045
📒 Files selected for processing (4)
.github/workflows/coordinode-ci.ymlsrc/compression/mod.rssrc/compression/zstd_pure.rstests/zstd_dict_roundtrip.rs
…t test Without this check, if major_compact() ever regresses to a no-op that still returns Ok(()), the test would pass against the original L0 tables without exercising the dict compress/decompress compaction hot path.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/compression/mod.rs`:
- Around line 110-111: The doc comment for the handle is incorrect: `new()`
stores the full 64-bit xxh3 fingerprint in `self.id` (see `id64()`), while
`id()` truncates to the lower 32 bits; update the comment to state that the
handle stores the full 64‑bit xxh3 hash of `raw` and that `id()` returns the
lower 32 bits on demand (referencing `ZstdDictionary::id` and `id64()` for
clarity).
In `@tests/zstd_dict_roundtrip.rs`:
- Around line 269-272: The test currently masks failures by calling
tree.level_table_count(0).unwrap_or(0); instead, call level_table_count(0) and
assert on its success and result explicitly: retrieve the Result via
.expect(...) or .unwrap() with a clear message (e.g., on the call to
level_table_count), bind it to a variable (e.g., count) and then assert_eq!(0,
count, "L0 must be empty after major_compact — compaction may not have run");
this ensures failures to fetch the L0 count fail the test rather than being
coerced to zero.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0b978752-95b1-42f5-bebc-a380b250c4d8
📒 Files selected for processing (4)
.github/workflows/coordinode-ci.ymlsrc/compression/mod.rssrc/compression/zstd_pure.rstests/zstd_dict_roundtrip.rs
- Fix test postcondition: assert_eq!(Some(0), level_table_count(0)) instead of unwrap_or(0) — properly distinguishes "L0 exists and is empty" from unexpected None (Copilot #54, CodeRabbit #56) - Fix ZstdDictionary::new() doc: handle stores full 64-bit xxh3 hash internally; id() returns lower 32 bits on demand (CodeRabbit #55)
… in decode_raw_content_bounded
## 🤖 New release * `coordinode-lsm-tree`: 4.3.1 -> 4.4.0 <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> ## [4.4.0](v4.3.1...v4.4.0) - 2026-04-09 ### Added - *(compression)* enable dictionary compression in pure Rust backend ([#229](#229)) ### Performance - *(compression)* cache pre-compiled Dictionary across block decompress calls ([#227](#227)) </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/release-plz/release-plz/). Co-authored-by: sw-release-bot[bot] <255865126+sw-release-bot[bot]@users.noreply.github.com>
Summary
Technical Details
Format detection: `compress_with_dict` and `decompress_with_dict` check for the zstd dictionary magic prefix (bytes `37 A4 30 EC`, little-endian `0xEC30A437`):
Raw-content dict ID stripping: `compress_with_dict` strips the synthetic dictID from the frame header after compression. This matches the zstd standard convention where `dictID=0` (absent) means "raw-content dict, id unknown, accept any", preventing decompressors from requiring the specific synthetic id.
Decompression-bomb guard: The raw-content dict decompress path calls `decoder.content_size()` after `init()` and rejects frames whose declared size exceeds the caller's capacity limit before allocating the output buffer. Frames without the FCS field fall back to the post-decode check.
`ZstdDictionary::id()`: Returns the raw lower 32 bits of xxh3 (may theoretically be 0). Config validation paths compare `dict.id()` against on-disk `dict_id` — both sides derive the same value, so validation is unaffected. The `.max(1)` clamp is only applied inside the backend when embedding an id in a zstd frame header.
Blocker resolved: structured-world/structured-zstd#25 (FastCOVER + dictionary finalization) merged; `structured-zstd` bumped to v0.0.11 (encoding performance improvements: row-based match finder, HC positions rebase, streaming scratch buffer reuse, FSE decoder packing, and HC table improvements).
Known Limitations
Test Plan
Closes #218
Summary by CodeRabbit
New Features
Documentation
Tests
Chores