fix(sync/code): clean up to-fetch markers regardless of inFlight state#5356
Open
powerslider wants to merge 11 commits into
Open
fix(sync/code): clean up to-fetch markers regardless of inFlight state#5356powerslider wants to merge 11 commits into
powerslider wants to merge 11 commits into
Conversation
8551b3e to
968b197
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a race where code-to-fetch markers could be orphaned when duplicate work items were skipped due to an existing inFlight entry, leading to rare state sync test flakes.
Changes:
- Reorders code sync worker logic so marker cleanup for already-on-disk code is not gated by
inFlight. - Updates
inFlightdocumentation to reflect its narrower role (deduping network fetches). - Adds regression/stress tests to ensure markers are always cleaned up under duplicate enqueue scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
graft/evm/sync/code/syncer.go |
Moves marker deletion for on-disk code ahead of inFlight dedupe so cleanup always occurs. |
graft/evm/sync/code/syncer_test.go |
Adds deterministic regression test + producer-concurrency stress test for “no marker leaks” invariant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
968b197 to
5ecd4ad
Compare
Bug: - A worker still listed in `inFlight` after deleting its on-disk marker could orphan a marker that `AddCode` rewrote in that window: sibling workers found `inFlight=loaded` for the duplicate and skipped. - Surfaced as `TestFirewoodSync/10,000_keys_from_empty` rarely failing with "expected no remaining code-to-fetch markers after successful sync". Fix: - Run the marker-delete before the `inFlight` check. The delete is idempotent and never needed to be gated. - inFlight now only guards concurrent network fetches. Any rewritten marker is re-cleaned on its next dequeue. Regression Tests: - `TestCodeSyncerCleansMarkerWhenCodeOnDiskAndInFlightHeld` is deterministic and fails against the original ordering. - `TestCodeSyncerDuplicateAddCodeNoMarkerLeak`: end-to-end stress, two variants (code pre-written vs. fetched during sync). resolves #5353 Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)
5ecd4ad to
2b7e78a
Compare
alarso16
reviewed
May 7, 2026
alarso16
approved these changes
May 18, 2026
Contributor
alarso16
left a comment
There was a problem hiding this comment.
I vote to delete TestCodeSyncerCleansMarkerWhenCodeOnDiskAndInFlightHeld, but won't block on it
- Replace inFlight seeding with a wrapped DB that pauses the first Batch.Write - probeHash send acts as a synchronization barrier for the sibling worker
alarso16
approved these changes
May 18, 2026
JonathanOppenheimer
approved these changes
May 18, 2026
Contributor
|
Austin explained the motivation for this PR to me in the office. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this should be merged
Check #5353
How this works
Bug:
inFlightafter deleting its on-disk marker could orphan a marker thatAddCoderewrote in that window: sibling workers foundinFlight=loadedfor the duplicate and skipped.TestFirewoodSync/10,000_keys_from_emptyrarely failing with "expected no remaining code-to-fetch markers after successful sync".Fix:
inFlightcheck. The delete is idempotent and never needed to be gated.inFlightnow only guards concurrent network fetches. Any rewritten marker is re-cleaned on its next dequeue.How this was tested
Regression tests:
TestCodeSyncerCleansMarkerRewrittenMidCleanupis deterministic and fails against the original ordering. The client DB is wrapped so the firstBatch.Writepauses after committing the delete. The test then rewrites the marker and enqueues a duplicate, while a second hash sent through the same channel acts as a synchronization barrier so the sibling worker's decision is committed before the paused worker is released. No syncer internals are touched.TestCodeSyncerDuplicateAddCodeNoMarkerLeak: end-to-end stress, two variants (code pre-written vs. fetched during sync).Need to be documented in RELEASES.md?
no
resolves #5353
Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)