sweepbatcher: harden AddSweep against ctx closure#1130
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a race condition in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enhances the AddSweep method in the Batcher to ensure that context cancellation or batcher shutdown errors are prioritized when internal operations fail. It introduces a helper method addSweepExitErr to unify this logic and adds a unit test to verify the behavior. Feedback suggests simplifying the repetitive error handling blocks by re-assigning the error variable and streamlining the ctx.Done() case, as the helper function is guaranteed to return an error in that scenario.
|
@claude review this |
Code Review: sweepbatcher: harden AddSweep against ctx closureOverviewThis PR fixes a real shutdown race in Issues1. Dead code in the final case <-ctx.Done():
err := b.addSweepExitErr(ctx)
if err != nil {
return err
}
return ctx.Err() // ← unreachableInside the 2. The fix covers if sweep.presigned && !fullyConfirmed {
err := ensurePresigned(ctx, sweeps, b.presignedHelper, ...)
if err != nil {
// no addSweepExitErr check here
return fmt.Errorf(...)
}
}The presigned path is less common, but it carries the same potential to leak backend errors during shutdown. 3. Regression test doesn't run against the real Postgres backend The new func runTests(t *testing.T, testFn func(t *testing.T, store testStore, batcherStore testBatcherStore)) {
t.Run("mocks", ...)
t.Run("loopdb", ...) // real Postgres
}Since the original failure was triggered specifically by Postgres returning 4. func (f *cancelingSweepFetcher) FetchSweep(context.Context, lntypes.Hash, wire.OutPoint) (*SweepInfo, error) {The context is unnamed and discarded. A one-line comment explaining that the mock simulates Postgres returning an error after the caller cancels would improve readability for future maintainers. Minor observations
SummaryThe fix is correct and addresses the root cause. Three items worth addressing before merge:
|
|
@claude review this again |
Code Review: sweepbatcher: harden AddSweep against ctx closureOverviewThis PR correctly fixes a real shutdown race: when a caller cancels their context during Issues1. Misleading error wrapping when substituting the exit error (minor-to-medium)The current pattern for every error site is: if err != nil {
exitErr := b.addSweepExitErr(ctx)
if exitErr != nil {
err = exitErr
}
return fmt.Errorf("someOperation failed: %w", err)
}When
The cleaner pattern (which also removes the need to re-assign if err != nil {
if exitErr := b.addSweepExitErr(ctx); exitErr != nil {
return exitErr
}
return fmt.Errorf("someOperation failed: %w", err)
}This is also what the final 2.
|
|
@claude review this again |
Code Review: sweepbatcher: harden AddSweep against ctx closureOverviewThis PR fixes a real, observable shutdown race in CorrectnessThe coverage of error sites is complete:
Priority order in Minor observation — case <-ctx.Done():
return b.addSweepExitErrIfAny(ctx)Because Possible Nit: unchecked race between successful operationsThe guards are only installed after failing calls. If a context cancellation happens between two successful calls (e.g. after Test CoverageThe new test func (f *cancelingSweepFetcher) FetchSweep(...) (*SweepInfo, error) {
f.cancel()
return nil, driver.ErrBadConn
}This is clean and correctly asserts both:
The test follows project conventions — lowercase One improvement worth considering: The other four guarded error paths ( Style / Minor
SummaryThe fix is correct, well-scoped, and directly addresses the production-observed bug. The test provides meaningful regression coverage. No blocking issues. The one optional improvement is adding a comment to the test about which code path it covers. Approved with the optional suggestion above. |
Use the test context in the fetch-cancellation regression test instead of context.Background.
Document the fetch-cancellation test helper and its regression test. These comments keep the intent of the cancellation race fixture local to the symbols that implement it.
Rename addSweepExitErrIfAny to shutdownOrCancelErrIfAny. The helper is now shared by caller-facing batcher methods, so the old AddSweep-specific name no longer describes its role. This keeps the rename separate from the presign cancellation handling.
Log the original AddSweep setup error before returning a shutdown or cancellation error. The shutdown helper intentionally masks backend-specific errors during normal shutdown, but the discarded error can still be useful when debugging why the shutdown branch was taken.
starius
left a comment
There was a problem hiding this comment.
LGTM!
Thanks for fixing sweepbatcher!
I pushed fixup commits fixing some issues found in the code, as well as other commits, fixing similar issues in Run and in Presign flow, also added a forgotten check in TestSweepBatcherCloseDuringAdding.
Could you check the added commits, please? If you accept the changes, please squash the fixup! commits into your commit.
| return f.store[outpoint], nil | ||
| } | ||
|
|
||
| type cancelingSweepFetcher struct { |
| return nil, driver.ErrBadConn | ||
| } | ||
|
|
||
| func testAddSweepReturnsContextErrorOnFetchCancellation(t *testing.T, |
| sweeps, err := b.fetchSweeps(ctx, *sweepReq) | ||
| if err != nil { | ||
| if exitErr := b.addSweepExitErrIfAny(ctx); exitErr != nil { | ||
| return exitErr |
There was a problem hiding this comment.
I propose to add Info log here and in similar cases not to throw away the original error completely, in case it is important for debug.
AddSweep now masks cancellation races during its caller-side setup, but a request can still be accepted by the batcher run loop before the run context cancellation is selected. If handleSweeps then performs a context-sensitive store call while shutdown is in progress, a backend-specific error such as driver.ErrBadConn can still bubble out through Batcher.Run. Prefer the run context's terminal error when startup, handleSweeps, or asynchronous errChan failures happen after the run context has been canceled. This keeps normal shutdown reporting consistent and avoids surfacing backend driver errors from the accepted-request and async error paths. Log the original run-loop error before returning the context error so normal shutdown remains debuggable without changing the returned error. Add regression tests for both covered run-loop races. One lets AddSweep successfully hand a sweep request to Run, then cancels the run context from the second GetSweepStatus call while returning driver.ErrBadConn. The other queues an errChan error from the event loop while canceling the run context. Both tests assert that Run returns context.Canceled and does not wrap the driver error.
PresignSweepsGroup uses context-sensitive wallet and presigned-helper calls, but it previously returned their raw wrapped errors even when the caller context or batcher shutdown state had already become terminal. That leaves backend/helper errors visible during normal cancellation. Check for shutdown/cancellation before presigning and after fee lookup or presigning failures, preferring context.Canceled or ErrBatcherShuttingDown over lower-level errors. Log the original presign-path error before returning the shutdown or cancellation error so normal shutdown remains debuggable without changing the returned error. Add a regression test with a presigned helper that cancels the caller context while returning driver.ErrBadConn from SignTx. The test asserts PresignSweepsGroup reports context.Canceled and does not wrap the driver error, and runs against both mock and SQL-backed stores.
TestSweepBatcherCloseDuringAdding previously started Batcher.Run in a detached goroutine and called test assertions from that goroutine. The test only waited for the add/cancel workers, so a Run-side shutdown error could be missed or reported unreliably. Route the Run result through a channel and wait for it in the main test goroutine. While waiting, keep draining spend registrations so shutdown cannot deadlock on mock notifier traffic. This makes the existing shutdown-race test cover both sides of the race: AddSweep callers may exit with cancellation, and Batcher.Run must also terminate with an expected shutdown error.
This PR fixes a shutdown race in Batcher.AddSweep where caller cancellation during setup could surface backend-specific errors, such as Postgres returning driver: bad connection, instead of a normal cancellation/shutdown error.
AddSweep now re-checks the batcher/caller shutdown state when setup calls fail and returns context.Canceled or ErrBatcherShuttingDown when appropriate. A regression test covers cancellation during sweep fetching so driver-level errors do not escape during normal shutdown.