fix: do not cancel in-progress blob metadata recovery when gaining new shards by halfprice · Pull Request #3462 · MystenLabs/walrus

halfprice · 2026-06-12T06:55:37Z

Description

Found this issue while working on node recovery across epoch.

When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task:

skipped metadata recovery entirely,
only started syncs for the newly gained shards (the earlier epoch's shards never reached record_start_shard_sync, so they stayed at status None), and
incorrectly flipped the node status from RecoverMetadata to Active.

As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned, with no repair on restart because restart_syncs only resumes shards persisted as ActiveSync/ActiveRecover and the node status was already Active.

Fix

sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status:

When the task observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped in start_new_shard_sync), so a task that aborts its predecessor adopts the predecessor's unstarted work. This mirrors what restart_syncs already does in the RecoverMetadata branch, and that branch now reuses the same code path.
Stored shards the node does not own — for example, shards locked for transfer to another node in the same epoch change — are filtered out, so the sync task cannot clobber a LockedToMove status back to ActiveSync and start syncing a shard the node no longer owns. (This exposure previously existed in the restart_syncs RecoverMetadata branch as well, which synced all existing shard storages unconditionally.)
The RecoverMetadata -> Active transition is now only performed by the task that actually executed the metadata recovery, so a concurrent task can no longer prematurely mark the node Active.

This makes aborting the previous sync-shards task safe, because the replacement task reconstructs its work from persisted state. Note that aborting the orchestrator task never cancelled the already-running per-shard sync tasks (they are tracked separately in shard_sync_in_progress), so long-running shard transfers are unaffected.

Test plan

Two new simtests, using a new pause fail point in sync_certified_blob_metadata:

sync_shard_new_epoch_does_not_cancel_metadata_recovery: parks metadata recovery in flight, gains another shard via a second start_sync_shards call, releases the pause, and asserts that the node reaches Active with both shards fully synced and an unowned locked shard untouched. Fails on main (the first shard remains at status None); passes with this fix.
sync_shard_recover_metadata_skips_unowned_shards: node in RecoverMetadata holds storages for shards it no longer owns in the current committee (one locked for transfer); asserts owned shards sync to Active while the locked shard keeps its LockedToMove status. Fails without the ownership filter (the locked shard is clobbered to ActiveSync and retries syncing a shard the source nodes reject); passes with this fix.

Also ran:

cargo simtest failure_injection_tests — all 41 shard-sync failure-injection simtests pass
cargo nextest run -p walrus-service sync_shard shard_sync — 30 tests pass

…w shards When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task skipped metadata recovery, only started syncs for the newly gained shards, and incorrectly flipped the node status from RecoverMetadata to Active. As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned at status None, with no repair on restart because the node status was already Active. Fix: sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status. When it observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped), so a task that aborts its predecessor adopts the predecessor's unstarted work. Stored shards the node does not own (for example, shards locked for transfer to another node in the same epoch change) are filtered out so their status is not clobbered. The RecoverMetadata -> Active transition now only happens in the task that actually performed the metadata recovery. Also adds a pause fail point in sync_certified_blob_metadata and simtests that reproduce both scenarios.

…overy Metadata recovery can run for a long time, during which a concurrent path (for example, entering recovery or dropping out of the committee at an epoch change) may move the node out of RecoverMetadata. Reusing the node status read at the start of sync_shards_task could therefore clobber such a transition back to Active. Re-read the status from the db immediately before the compare-and-set so we only flip to Active when the node is still recovering metadata. Adds a simtest that pauses metadata recovery, changes the node status concurrently, and asserts the change is not clobbered.

sadhansood

Thanks @halfprice , looks great - thank you for your effort in doing this fix. Also like the new simtest. Just a small nit but not a blocker.

sadhansood · 2026-06-26T16:57:10Z

-                }));
+            // The task observes the `RecoverMetadata` status and derives the shards to sync from
+            // the existing shard storages.
+            self.start_sync_shards(vec![]).await?;


nit: vec![] reads a little off as the function ignores the argument when in metadata sync. Maybe a specific helper restart_metadata_sync might read better vs passing an empty vector.

shuowang12

LGTM. Thanks for the fix.

halfprice force-pushed the zhewu/start_sync_shards_bug branch 2 times, most recently from 44f418c to d42d2a3 Compare June 12, 2026 07:34

halfprice requested review from sadhansood and shuowang12 June 12, 2026 07:42

sadhansood reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/walrus-service/src/node/shard_sync.rs

halfprice requested a review from sadhansood June 19, 2026 05:23

halfprice added 2 commits June 24, 2026 21:34

halfprice force-pushed the zhewu/start_sync_shards_bug branch from 2ac4702 to 40ecd7f Compare June 25, 2026 04:34

sadhansood approved these changes Jun 26, 2026

View reviewed changes

shuowang12 approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462

fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462
halfprice wants to merge 2 commits into
mainfrom
zhewu/start_sync_shards_bug

halfprice commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

sadhansood left a comment

Uh oh!

sadhansood Jun 26, 2026

Uh oh!

shuowang12 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

halfprice commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fix

Test plan

Uh oh!

Uh oh!

sadhansood left a comment

Choose a reason for hiding this comment

Uh oh!

sadhansood Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

shuowang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

halfprice commented Jun 12, 2026 •

edited

Loading