Skip to content

feat: [Do not merge] blob info checkpoint creation#3460

Draft
shuowang12 wants to merge 8 commits into
mainfrom
shuo/blob-info-checkpoint-minimal
Draft

feat: [Do not merge] blob info checkpoint creation#3460
shuowang12 wants to merge 8 commits into
mainfrom
shuo/blob-info-checkpoint-minimal

Conversation

@shuowang12

@shuowang12 shuowang12 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description

#deploying a custom branch to walrus-private-testnet

This draft PR is for deployment to the PTN storage nodes as part of the blob info snapshot. The branch is main + one isolated, config-gated feature: when enabled, a node creates a RocksDB checkpoint of its database at the epoch boundary (post-GC, the deterministic point) and deletes the previous one. I'll then run an offline db-tool against each node's checkpoint to verify the serialized blob info snapshots are byte-identical across nodes.

Rollout plan, in stages:

  1. Binary to all nodes via the standard PTN deploy workflow (no wipe). The feature is off by default, so this step changes no behavior.
  2. Enable on 1–2 nodes first via the PTN config-update workflow (blob_info_snapshot.enabled: true with a host limit), and watch them across an epoch boundary or two.
  3. Enable fleet-wide once the canary nodes look clean.

Expected impact:

  • No changes to event processing, blob lifecycle, or APIs. Checkpoint errors are log-and-count only and cannot fail epoch processing.
  • When enabled: a pause of event processing for the checkpoint duration (expected seconds) once per 2-hour epoch, and one retained checkpoint per node (hard links, small incremental disk).
  • Verification runs are read-only against the checkpoint directory, never the live DB.

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)

  • Storage node:
  • Aggregator:
  • Publisher:
  • CLI:

@github-actions

Copy link
Copy Markdown
Contributor

Warning: This PR modifies one of the example config files. Please consider the
following:

  • Make sure the changes are backwards compatible with the current configuration.
  • Make sure any added parameters follow the conventions of the existing parameters; in
    particular, durations should take seconds or milliseconds using the naming convention
    _secs or _millis, respectively.
  • If there are added optional parameter sections, it should be possible to specify them
    partially. A useful pattern there is to implement Default for the struct and derive
    #[serde(default)] on it, see BlobRecoveryConfig as an example.
  • You may need to update the documentation to reflect the changes.

@shuowang12 shuowang12 changed the title [Do not merge] feat:blob info checkpoint creation feat: [Do not merge] blob info checkpoint creation Jun 11, 2026
shuowang12 and others added 8 commits June 11, 2026 20:39
Add a versioned, deterministic serialization format for the blob info
tables that are identical across nodes after GC phase 1 at the epoch
boundary: per_object_blob_info, per_object_pooled_blob_info, and
storage_pool_info. The aggregate_blob_info table is excluded by design:
it contains node-local state (is_metadata_stored) and entries whose
deletion timing depends on background GC phase 2, so it is not
deterministic across nodes and is instead reconstructed during recovery.

The format is self-delimiting (single-pass write and read), carries the
epoch and exact event-stream cursor in its header, reserves chunking
fields for snapshots exceeding the maximum blob size, and ends with an
xxhash64 checksum. A golden-byte test pins the v1 serialization so that
any byte-level change (which is consensus-critical, since all nodes must
produce bit-identical snapshots) fails CI until the format version is
bumped.

This is the first step towards storage node recovery for storage pools
(WAL-1185): pool membership cannot be recovered through event replay, so
recovering nodes will bootstrap from a quorum-certified snapshot blob and
replay events forward.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add `walrus-node db-tool bench-blob-info-snapshot`, which serializes the
per_object_blob_info, per_object_pooled_blob_info, and storage_pool_info
column families of an existing node database into a snapshot file and
reports: entry counts, snapshot size and bytes per entry, serialize +
write + sync duration, read + deserialize duration, bulk-load duration
into a scratch database (using the same key/value encodings as a real
node), and zstd compression ratios at configurable levels.

The database is opened read-only with only the three relevant column
families, so the benchmark can run against a stopped node's database or
a copy of a running node's database. This provides production-scale
measurements for the snapshot design (size, serialization cost,
compression benefit) before any node deployment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The xxhash64 trailer is a deterministic fingerprint of the snapshotted
table contents, so printing it lets operators compare snapshots taken at
the same epoch boundary across nodes with a single line of output.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When enabled (default off), the node creates a RocksDB checkpoint of its
database at the epoch boundary, directly after GC phase 1 has settled
the blob info tables and before any further events are processed, and
removes older checkpoints so that at most one exists at a time. Startup
finishes any cleanup that a crash interrupted, so a stale checkpoint
cannot pin deleted SST files' disk space for a full epoch.

The node does not serialize anything itself: operators verify snapshot
determinism offline by running
`walrus-node db-tool bench-blob-info-snapshot --db-path <checkpoint>`
on each node and comparing the reported digests for the same epoch.
Checkpoint creation duration is reported as a metric; failures increment
an error counter and never fail epoch processing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tency check

The checkpoint takes seconds (memtable flush plus hard links) while the
consistency check's background scan reads the whole table for minutes or
longer. Running the checkpoint first keeps the scan's disk traffic from
stretching the inline checkpoint duration; determinism is unaffected
since both capture their state while event processing is blocked.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Parse the epoch out of a checkpoint_epoch_<N> input directory and embed
it in the snapshot header, so digests only match across snapshots taken
at the same epoch boundary. The event cursor deliberately stays at its
default: the cursor stored in the database is not deterministic at the
checkpoint instant because event completion is marked by a background
task.

Also compute the event-reprocessing guard before execute_epoch_change
spawns the finisher task, which marks the event complete in the
background and could otherwise misclassify normal processing as
reprocessing, skipping the checkpoint and consistency check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@shuowang12 shuowang12 force-pushed the shuo/blob-info-checkpoint-minimal branch from d05dd43 to ba872e5 Compare June 12, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant