feat: [Do not merge] blob info checkpoint creation#3460
Draft
shuowang12 wants to merge 8 commits into
Draft
Conversation
Contributor
|
Warning: This PR modifies one of the example config files. Please consider the
|
Add a versioned, deterministic serialization format for the blob info tables that are identical across nodes after GC phase 1 at the epoch boundary: per_object_blob_info, per_object_pooled_blob_info, and storage_pool_info. The aggregate_blob_info table is excluded by design: it contains node-local state (is_metadata_stored) and entries whose deletion timing depends on background GC phase 2, so it is not deterministic across nodes and is instead reconstructed during recovery. The format is self-delimiting (single-pass write and read), carries the epoch and exact event-stream cursor in its header, reserves chunking fields for snapshots exceeding the maximum blob size, and ends with an xxhash64 checksum. A golden-byte test pins the v1 serialization so that any byte-level change (which is consensus-critical, since all nodes must produce bit-identical snapshots) fails CI until the format version is bumped. This is the first step towards storage node recovery for storage pools (WAL-1185): pool membership cannot be recovered through event replay, so recovering nodes will bootstrap from a quorum-certified snapshot blob and replay events forward. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add `walrus-node db-tool bench-blob-info-snapshot`, which serializes the per_object_blob_info, per_object_pooled_blob_info, and storage_pool_info column families of an existing node database into a snapshot file and reports: entry counts, snapshot size and bytes per entry, serialize + write + sync duration, read + deserialize duration, bulk-load duration into a scratch database (using the same key/value encodings as a real node), and zstd compression ratios at configurable levels. The database is opened read-only with only the three relevant column families, so the benchmark can run against a stopped node's database or a copy of a running node's database. This provides production-scale measurements for the snapshot design (size, serialization cost, compression benefit) before any node deployment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The xxhash64 trailer is a deterministic fingerprint of the snapshotted table contents, so printing it lets operators compare snapshots taken at the same epoch boundary across nodes with a single line of output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When enabled (default off), the node creates a RocksDB checkpoint of its database at the epoch boundary, directly after GC phase 1 has settled the blob info tables and before any further events are processed, and removes older checkpoints so that at most one exists at a time. Startup finishes any cleanup that a crash interrupted, so a stale checkpoint cannot pin deleted SST files' disk space for a full epoch. The node does not serialize anything itself: operators verify snapshot determinism offline by running `walrus-node db-tool bench-blob-info-snapshot --db-path <checkpoint>` on each node and comparing the reported digests for the same epoch. Checkpoint creation duration is reported as a metric; failures increment an error counter and never fail epoch processing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tency check The checkpoint takes seconds (memtable flush plus hard links) while the consistency check's background scan reads the whole table for minutes or longer. Running the checkpoint first keeps the scan's disk traffic from stretching the inline checkpoint duration; determinism is unaffected since both capture their state while event processing is blocked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Parse the epoch out of a checkpoint_epoch_<N> input directory and embed it in the snapshot header, so digests only match across snapshots taken at the same epoch boundary. The event cursor deliberately stays at its default: the cursor stored in the database is not deterministic at the checkpoint instant because event completion is marked by a background task. Also compute the event-reprocessing guard before execute_epoch_change spawns the finisher task, which marks the event complete in the background and could otherwise misclassify normal processing as reprocessing, skipping the checkpoint and consistency check. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
d05dd43 to
ba872e5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
#deploying a custom branch to walrus-private-testnet
This draft PR is for deployment to the PTN storage nodes as part of the blob info snapshot. The branch is main + one isolated, config-gated feature: when enabled, a node creates a RocksDB checkpoint of its database at the epoch boundary (post-GC, the deterministic point) and deletes the previous one. I'll then run an offline db-tool against each node's checkpoint to verify the serialized blob info snapshots are byte-identical across nodes.
Rollout plan, in stages:
Expected impact:
Test plan
How did you test the new or updated feature?
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)