fix: avoid DAG scan livelock on bad FS entries#406
Open
frrist wants to merge 1 commit into
Open
Conversation
An unreadable file (e.g. a broken /etc/alternatives symlink under /bin) produced a `types.BadFSEntryError` that left its `dag_scans` row with `cid IS NULL`. The parent directory's scan then returned `(cid.Undef, nil)` — "no error, deferred on incomplete children" — and the outer loop in `ExecuteDagScansForUpload` counted that as progress, so the `executions == 0 → return BadFSEntriesError` guard never fired and the upload pipeline appeared stuck. `executeDAGScan` now returns a `(completed, error)` pair so the outer loop can distinguish "scan got a CID" from "scan was deferred". Only completions count as progress, so a pass containing nothing but bad files and their blocked ancestors now exits with `BadFSEntriesError`, which the existing handler in `uploads.go` cleans up and the `--retry` path in `cmd/upload` can then resume. `badFsEntryErrs` is reset per pass to avoid duplicate entries in the returned error, and the stray in-loop `defer span.End()` is replaced with explicit calls on each exit path. Adds a regression test under a bounded deadline so any future reintroduction of the livelock fails fast instead of hanging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alanshaw
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
An unreadable file (e.g. a broken /etc/alternatives symlink under /bin) produced a
types.BadFSEntryErrorthat left itsdag_scansrow withcid IS NULL. The parent directory's scan then returned(cid.Undef, nil)— "no error, deferred on incomplete children" — and the outer loop inExecuteDagScansForUploadcounted that as progress, so theexecutions == 0 → return BadFSEntriesErrorguard never fired and the upload pipeline appeared stuck.executeDAGScannow returns a(completed, error)pair so the outer loop can distinguish "scan got a CID" from "scan was deferred". Only completions count as progress, so a pass containing nothing but bad files and their blocked ancestors now exits withBadFSEntriesError, which the existing handler inuploads.gocleans up and the--retrypath incmd/uploadcan then resume.badFsEntryErrsis reset per pass to avoid duplicate entries in the returned error, and the stray in-loopdefer span.End()is replaced with explicit calls on each exit path.Adds a regression test under a bounded deadline so any future reintroduction of the livelock fails fast instead of hanging.