fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462
Open
halfprice wants to merge 2 commits into
Open
fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462halfprice wants to merge 2 commits into
halfprice wants to merge 2 commits into
Conversation
44f418c to
d42d2a3
Compare
sadhansood
reviewed
Jun 15, 2026
…w shards When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task skipped metadata recovery, only started syncs for the newly gained shards, and incorrectly flipped the node status from RecoverMetadata to Active. As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned at status None, with no repair on restart because the node status was already Active. Fix: sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status. When it observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped), so a task that aborts its predecessor adopts the predecessor's unstarted work. Stored shards the node does not own (for example, shards locked for transfer to another node in the same epoch change) are filtered out so their status is not clobbered. The RecoverMetadata -> Active transition now only happens in the task that actually performed the metadata recovery. Also adds a pause fail point in sync_certified_blob_metadata and simtests that reproduce both scenarios.
…overy Metadata recovery can run for a long time, during which a concurrent path (for example, entering recovery or dropping out of the committee at an epoch change) may move the node out of RecoverMetadata. Reusing the node status read at the start of sync_shards_task could therefore clobber such a transition back to Active. Re-read the status from the db immediately before the compare-and-set so we only flip to Active when the node is still recovering metadata. Adds a simtest that pauses metadata recovery, changes the node status concurrently, and asserts the change is not clobbered.
2ac4702 to
40ecd7f
Compare
sadhansood
approved these changes
Jun 26, 2026
sadhansood
left a comment
Contributor
There was a problem hiding this comment.
Thanks @halfprice , looks great - thank you for your effort in doing this fix. Also like the new simtest. Just a small nit but not a blocker.
| })); | ||
| // The task observes the `RecoverMetadata` status and derives the shards to sync from | ||
| // the existing shard storages. | ||
| self.start_sync_shards(vec![]).await?; |
Contributor
There was a problem hiding this comment.
nit: vec![] reads a little off as the function ignores the argument when in metadata sync. Maybe a specific helper restart_metadata_sync might read better vs passing an empty vector.
shuowang12
approved these changes
Jun 26, 2026
shuowang12
left a comment
Collaborator
There was a problem hiding this comment.
LGTM. Thanks for the fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Found this issue while working on node recovery across epoch.
When a node joins the committee, it enters
RecoverMetadatastatus and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the newstart_sync_shardscall aborts the in-flight task. Since the node is no longer "newly joining", the replacement task:record_start_shard_sync, so they stayed at statusNone), andRecoverMetadatatoActive.As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned, with no repair on restart because
restart_syncsonly resumes shards persisted asActiveSync/ActiveRecoverand the node status was alreadyActive.Fix
sync_shards_taskno longer takesrecover_metadatafrom the caller and instead derives it from the persisted node status:RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped instart_new_shard_sync), so a task that aborts its predecessor adopts the predecessor's unstarted work. This mirrors whatrestart_syncsalready does in theRecoverMetadatabranch, and that branch now reuses the same code path.LockedToMovestatus back toActiveSyncand start syncing a shard the node no longer owns. (This exposure previously existed in therestart_syncsRecoverMetadatabranch as well, which synced all existing shard storages unconditionally.)RecoverMetadata -> Activetransition is now only performed by the task that actually executed the metadata recovery, so a concurrent task can no longer prematurely mark the nodeActive.This makes aborting the previous sync-shards task safe, because the replacement task reconstructs its work from persisted state. Note that aborting the orchestrator task never cancelled the already-running per-shard sync tasks (they are tracked separately in
shard_sync_in_progress), so long-running shard transfers are unaffected.Test plan
Two new simtests, using a new pause fail point in
sync_certified_blob_metadata:sync_shard_new_epoch_does_not_cancel_metadata_recovery: parks metadata recovery in flight, gains another shard via a secondstart_sync_shardscall, releases the pause, and asserts that the node reachesActivewith both shards fully synced and an unowned locked shard untouched. Fails onmain(the first shard remains at statusNone); passes with this fix.sync_shard_recover_metadata_skips_unowned_shards: node inRecoverMetadataholds storages for shards it no longer owns in the current committee (one locked for transfer); asserts owned shards sync toActivewhile the locked shard keeps itsLockedToMovestatus. Fails without the ownership filter (the locked shard is clobbered toActiveSyncand retries syncing a shard the source nodes reject); passes with this fix.Also ran:
cargo simtest failure_injection_tests— all 41 shard-sync failure-injection simtests passcargo nextest run -p walrus-service sync_shard shard_sync— 30 tests pass