Skip to content

fix: avoid DAG scan livelock on bad FS entries#406

Open
frrist wants to merge 1 commit into
mainfrom
fix/dag-scan-livelock
Open

fix: avoid DAG scan livelock on bad FS entries#406
frrist wants to merge 1 commit into
mainfrom
fix/dag-scan-livelock

Conversation

@frrist

@frrist frrist commented Apr 17, 2026

Copy link
Copy Markdown
Member

An unreadable file (e.g. a broken /etc/alternatives symlink under /bin) produced a types.BadFSEntryError that left its dag_scans row with cid IS NULL. The parent directory's scan then returned (cid.Undef, nil) — "no error, deferred on incomplete children" — and the outer loop in ExecuteDagScansForUpload counted that as progress, so the executions == 0 → return BadFSEntriesError guard never fired and the upload pipeline appeared stuck.

executeDAGScan now returns a (completed, error) pair so the outer loop can distinguish "scan got a CID" from "scan was deferred". Only completions count as progress, so a pass containing nothing but bad files and their blocked ancestors now exits with BadFSEntriesError, which the existing handler in uploads.go cleans up and the --retry path in cmd/upload can then resume. badFsEntryErrs is reset per pass to avoid duplicate entries in the returned error, and the stray in-loop defer span.End() is replaced with explicit calls on each exit path.

Adds a regression test under a bounded deadline so any future reintroduction of the livelock fails fast instead of hanging.

An unreadable file (e.g. a broken /etc/alternatives symlink under /bin)
produced a `types.BadFSEntryError` that left its `dag_scans` row with
`cid IS NULL`. The parent directory's scan then returned
`(cid.Undef, nil)` — "no error, deferred on incomplete children" — and
the outer loop in `ExecuteDagScansForUpload` counted that as progress,
so the `executions == 0 → return BadFSEntriesError` guard never fired
and the upload pipeline appeared stuck.

`executeDAGScan` now returns a `(completed, error)` pair so the outer
loop can distinguish "scan got a CID" from "scan was deferred". Only
completions count as progress, so a pass containing nothing but bad
files and their blocked ancestors now exits with `BadFSEntriesError`,
which the existing handler in `uploads.go` cleans up and the `--retry`
path in `cmd/upload` can then resume. `badFsEntryErrs` is reset per
pass to avoid duplicate entries in the returned error, and the stray
in-loop `defer span.End()` is replaced with explicit calls on each
exit path.

Adds a regression test under a bounded deadline so any future
reintroduction of the livelock fails fast instead of hanging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

guppy upload hangs indefinitely when a source directory contains an unreadable file

2 participants