Skip to content

fix(sync/code): clean up to-fetch markers regardless of inFlight state#5356

Open
powerslider wants to merge 11 commits into
masterfrom
powerslider/5353-test-firewood-sync-flake
Open

fix(sync/code): clean up to-fetch markers regardless of inFlight state#5356
powerslider wants to merge 11 commits into
masterfrom
powerslider/5353-test-firewood-sync-flake

Conversation

@powerslider
Copy link
Copy Markdown
Contributor

@powerslider powerslider commented May 7, 2026

Why this should be merged

Check #5353

How this works

Bug:

  • A worker still listed in inFlight after deleting its on-disk marker could orphan a marker that AddCode rewrote in that window: sibling workers found inFlight=loaded for the duplicate and skipped.
  • Surfaced as TestFirewoodSync/10,000_keys_from_empty rarely failing with "expected no remaining code-to-fetch markers after successful sync".

Fix:

  • Run the marker-delete before the inFlight check. The delete is idempotent and never needed to be gated.
  • inFlight now only guards concurrent network fetches. Any rewritten marker is re-cleaned on its next dequeue.

How this was tested

Regression tests:

  • TestCodeSyncerCleansMarkerRewrittenMidCleanup is deterministic and fails against the original ordering. The client DB is wrapped so the first Batch.Write pauses after committing the delete. The test then rewrites the marker and enqueues a duplicate, while a second hash sent through the same channel acts as a synchronization barrier so the sibling worker's decision is committed before the paused worker is released. No syncer internals are touched.
  • TestCodeSyncerDuplicateAddCodeNoMarkerLeak: end-to-end stress, two variants (code pre-written vs. fetched during sync).

Need to be documented in RELEASES.md?

no

resolves #5353

Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)

@powerslider powerslider self-assigned this May 7, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 21:14
@powerslider powerslider requested a review from a team as a code owner May 7, 2026 21:14
@powerslider powerslider force-pushed the powerslider/5353-test-firewood-sync-flake branch from 8551b3e to 968b197 Compare May 7, 2026 21:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race where code-to-fetch markers could be orphaned when duplicate work items were skipped due to an existing inFlight entry, leading to rare state sync test flakes.

Changes:

  • Reorders code sync worker logic so marker cleanup for already-on-disk code is not gated by inFlight.
  • Updates inFlight documentation to reflect its narrower role (deduping network fetches).
  • Adds regression/stress tests to ensure markers are always cleaned up under duplicate enqueue scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
graft/evm/sync/code/syncer.go Moves marker deletion for on-disk code ahead of inFlight dedupe so cleanup always occurs.
graft/evm/sync/code/syncer_test.go Adds deterministic regression test + producer-concurrency stress test for “no marker leaks” invariant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread graft/evm/sync/code/syncer_test.go Outdated
Comment thread graft/evm/sync/code/syncer_test.go Outdated
@powerslider powerslider force-pushed the powerslider/5353-test-firewood-sync-flake branch from 968b197 to 5ecd4ad Compare May 7, 2026 21:23
Bug:
- A worker still listed in `inFlight` after deleting its on-disk marker
  could orphan a marker that `AddCode` rewrote in that window: sibling
  workers found `inFlight=loaded` for the duplicate and skipped.
- Surfaced as `TestFirewoodSync/10,000_keys_from_empty` rarely failing
  with "expected no remaining code-to-fetch markers after successful
  sync".

Fix:
- Run the marker-delete before the `inFlight` check. The delete is
  idempotent and never needed to be gated.
- inFlight now only guards concurrent network fetches. Any rewritten
  marker is re-cleaned on its next dequeue.

Regression Tests:
- `TestCodeSyncerCleansMarkerWhenCodeOnDiskAndInFlightHeld` is
  deterministic and fails against the original ordering.
- `TestCodeSyncerDuplicateAddCodeNoMarkerLeak`: end-to-end stress, two
  variants (code pre-written vs. fetched during sync).

resolves #5353

Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)
@powerslider powerslider force-pushed the powerslider/5353-test-firewood-sync-flake branch from 5ecd4ad to 2b7e78a Compare May 7, 2026 21:24
Comment thread graft/evm/sync/code/syncer_test.go Outdated
Comment thread graft/evm/sync/code/syncer_test.go
Comment thread graft/evm/sync/code/syncer_test.go Outdated
Comment thread graft/evm/sync/code/syncer_test.go Outdated
Copy link
Copy Markdown
Contributor

@alarso16 alarso16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to delete TestCodeSyncerCleansMarkerWhenCodeOnDiskAndInFlightHeld, but won't block on it

- Replace inFlight seeding with a wrapped DB that pauses the first Batch.Write
- probeHash send acts as a synchronization barrier for the sibling worker
Comment thread graft/evm/sync/code/syncer_test.go Outdated
Comment thread graft/evm/sync/code/syncer_test.go
@JonathanOppenheimer
Copy link
Copy Markdown
Contributor

Austin explained the motivation for this PR to me in the office.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Fix flaky TestFirewoodSync

4 participants