sweepbatcher: harden AddSweep against ctx closure by hieblmi · Pull Request #1130 · lightninglabs/loop

hieblmi · 2026-04-28T10:40:12Z

This PR fixes a shutdown race in Batcher.AddSweep where caller cancellation during setup could surface backend-specific errors, such as Postgres returning driver: bad connection, instead of a normal cancellation/shutdown error.

AddSweep now re-checks the batcher/caller shutdown state when setup calls fail and returns context.Canceled or ErrBatcherShuttingDown when appropriate. A regression test covers cancellation during sweep fetching so driver-level errors do not escape during normal shutdown.

2026-04-28 08:37:34.697 [INF] SWEEP: [Batch 1] Stopping
--- PASS: TestSweepFetcher (2.92s)
    --- PASS: TestSweepFetcher/mocks (0.51s)
    --- PASS: TestSweepFetcher/loopdb (2.41s)
=== RUN   TestSweepBatcherCloseDuringAdding
=== RUN   TestSweepBatcherCloseDuringAdding/mocks
2026-04-28 08:37:34.813 [INF] SWEEP: Stopping Batcher: run context cancelled.
2026-04-28 08:37:34.816 [INF] SWEEP: Batcher adding sweep group of 1 sweeps with primarySweep 010101000000, presigned=false, fully_confirmed=false
=== RUN   TestSweepBatcherCloseDuringAdding/loopdb
    test_postgres.go:16: Creating new Postgres DB for testing
2026-04-28 08:37:36.946 [INF] SWEEP: Stopping Batcher: run context cancelled.
    sweep_batcher_test.go:3944: 
        	Error Trace:	/home/runner/work/loop/loop/sweepbatcher/sweep_batcher_test.go:3944
        	            				/opt/hostedtoolcache/go/1.26.2/x64/src/sync/waitgroup.go:258
        	            				/opt/hostedtoolcache/go/1.26.2/x64/src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	fetchSweeps failed: failed to load sweep 0000000000000000000000000000000000000000000000000000000000000101:1: failed to fetch sweep data for 010101000000: failed to fetch loop out for 010101000000: driver: bad connection
        	Test:       	TestSweepBatcherCloseDuringAdding/loopdb
--- FAIL: TestSweepBatcherCloseDuringAdding (2.24s)
    --- PASS: TestSweepBatcherCloseDuringAdding/mocks (0.01s)
    --- FAIL: TestSweepBatcherCloseDuringAdding/loopdb (2.23s)

gemini-code-assist · 2026-04-28T10:40:23Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a race condition in the Batcher.AddSweep method where caller cancellation during the setup phase could result in the propagation of backend-specific errors rather than the expected context cancellation or shutdown errors. By introducing a centralized check for shutdown states, the batcher now correctly identifies and returns the appropriate terminal error, improving the robustness of the shutdown process.

Highlights

Shutdown Race Fix: Implemented a helper method addSweepExitErr to consistently check for batcher or caller shutdown states during AddSweep operations.
Error Handling Improvement: Updated AddSweep to re-verify shutdown status when backend operations fail, ensuring that context cancellation errors are returned instead of misleading driver-level errors.
Regression Testing: Added a new test case TestAddSweepReturnsContextErrorOnFetchCancellation to verify that caller cancellation during sweep fetching is correctly propagated.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the AddSweep method in the Batcher to ensure that context cancellation or batcher shutdown errors are prioritized when internal operations fail. It introduces a helper method addSweepExitErr to unify this logic and adds a unit test to verify the behavior. Feedback suggests simplifying the repetitive error handling blocks by re-assigning the error variable and streamlining the ctx.Done() case, as the helper function is guaranteed to return an error in that scenario.

hieblmi · 2026-04-29T08:55:22Z

@claude review this

claude · 2026-04-29T08:58:20Z

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

This PR fixes a real shutdown race in Batcher.AddSweep where a context cancellation during one of the setup calls (fetchSweeps, GetSweepStatus, GetParentBatch, MinRelayFee) causes the backend (Postgres) to surface driver-level errors (e.g. driver: bad connection) to the caller instead of a clean context.Canceled or ErrBatcherShuttingDown. The approach — re-checking the batcher/caller shutdown state after each fallible call — is sound and pragmatic.

Issues

1. Dead code in the final select — ctx.Done() arm

case <-ctx.Done():
    err := b.addSweepExitErr(ctx)
    if err != nil {
        return err
    }
    return ctx.Err()   // ← unreachable

Inside the <-ctx.Done() case ctx.Err() is always non-nil, so addSweepExitErr will always return a non-nil value, making the final return ctx.Err() dead code. A simple return b.addSweepExitErr(ctx) (with a nil fallback removed) would be clearer.

2. ensurePresigned path is missing an exit check

The fix covers fetchSweeps, GetSweepStatus, GetParentBatch, and MinRelayFee, but ensurePresigned — which itself makes context-sensitive calls — has no equivalent guard:

if sweep.presigned && !fullyConfirmed {
    err := ensurePresigned(ctx, sweeps, b.presignedHelper, ...)
    if err != nil {
        // no addSweepExitErr check here
        return fmt.Errorf(...)
    }
}

The presigned path is less common, but it carries the same potential to leak backend errors during shutdown.

3. Regression test doesn't run against the real Postgres backend

The new TestAddSweepReturnsContextErrorOnFetchCancellation is a standalone function that uses only NewStoreMock(). The rest of the sweepbatcher test suite uses the runTests helper to run each case against both the mock store and a real Postgres instance:

func runTests(t *testing.T, testFn func(t *testing.T, store testStore, batcherStore testBatcherStore)) {
    t.Run("mocks", ...)
    t.Run("loopdb", ...)  // real Postgres
}

Since the original failure was triggered specifically by Postgres returning driver: bad connection, the regression test ideally should also exercise the real backend to prove the fix holds end-to-end. Is TestSweepBatcherCloseDuringAdding/loopdb (the test that was failing in CI) now expected to pass as-is with this fix?

4. cancelingSweepFetcher discards the context argument (minor)

func (f *cancelingSweepFetcher) FetchSweep(context.Context, lntypes.Hash, wire.OutPoint) (*SweepInfo, error) {

The context is unnamed and discarded. A one-line comment explaining that the mock simulates Postgres returning an error after the caller cancels would improve readability for future maintainers.

Minor observations

Repetitive boilerplate: the same 4-line pattern appears four times. A helper wrapShutdownErr could reduce duplication, but not a blocker.
Priority of b.quit over ctx.Err() in addSweepExitErr is intentional and consistent with the file — no concern.
Test doesn't start batcher.Run(): acceptable for a targeted unit test since the failure happens before the addSweepsChan select.

Summary

The fix is correct and addresses the root cause. Three items worth addressing before merge:

Remove the dead return ctx.Err() line in the ctx.Done() arm of the final select.
Add an addSweepExitErr check after ensurePresigned for consistency.
Consider whether the regression test should use runTests to also cover the Postgres backend.

hieblmi · 2026-04-29T09:21:47Z

@claude review this again

claude · 2026-04-29T09:26:05Z

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

This PR correctly fixes a real shutdown race: when a caller cancels their context during AddSweep while the backing store (e.g. Postgres) is in flight, the store returns a driver-level error (driver: bad connection) that escapes upward instead of the expected context.Canceled or ErrBatcherShuttingDown. The approach — check shutdown/cancellation state whenever a store call fails — is the right fix.

Issues

1. Misleading error wrapping when substituting the exit error (minor-to-medium)

The current pattern for every error site is:

if err != nil {
    exitErr := b.addSweepExitErr(ctx)
    if exitErr != nil {
        err = exitErr
    }
    return fmt.Errorf("someOperation failed: %w", err)
}

When exitErr is non-nil, err is replaced and then still wrapped with the operation-specific prefix. This produces messages like:

"fetchSweeps failed: context canceled" — confusing but tolerable
"inputs with primarySweep X were not presigned (call PresignSweepsGroup first): context canceled" — this one is actively harmful; an operator seeing it during shutdown would think they forgot to call PresignSweepsGroup

The cleaner pattern (which also removes the need to re-assign err) is an early return:

if err != nil {
    if exitErr := b.addSweepExitErr(ctx); exitErr != nil {
        return exitErr
    }
    return fmt.Errorf("someOperation failed: %w", err)
}

This is also what the final select effectively does — it doesn't wrap, it just returns addSweepExitErr.

2. `addSweepExitErr` can silently return `nil` (nit)

The function name implies it returns an exit error, but it returns nil when neither b.quit nor ctx.Err() is set. This is safe because all callers guard with if exitErr != nil, but a name like checkExitErr or shutdownErr might be clearer about its nullable return semantics. Not a blocker.

What's good

Correct priority ordering: b.quit (batcher shutdown) is checked before ctx.Err() (caller cancellation) everywhere, which is the right precedence.
New case <-ctx.Done() in the final select (line ~887): Previously, a context cancellation between the initial guard and the channel send could block indefinitely. This case closes that gap cleanly.
cancelingSweepFetcher mock: The test accurately simulates the race — cancel the context inside FetchSweep before returning a driver-level error — and the assertions (ErrorIs(context.Canceled), NotErrorIs(driver.ErrBadConn)) are correct.
runTests harness: Running against both the mock and loopdb backends is exactly the right coverage for this class of bug.

Minor observations

Test coverage for other 4 error sites: Only fetchSweeps cancellation is covered by the new test. The same race exists at GetSweepStatus, GetParentBatch, MinRelayFee, and ensurePresigned, but the fix pattern is consistent so this is a low-risk gap.
TestSweepBatcherCloseDuringAdding/loopdb: The PR description shows this pre-existing test failing before the fix. After the fix, driver.ErrBadConn gets masked by context.Canceled, which is already an accepted exit condition in that test (errors.Is(err, context.Canceled) → break). It would be worth mentioning in the PR description that this pre-existing test is expected to pass after the fix, for traceability.

Summary

The fix is correct and addresses a real problem. The main actionable feedback is to change the error-substitution pattern to return exitErr directly rather than wrapping it with the operation-specific message — the ensurePresigned case especially would produce very confusing logs in production if left as-is. Everything else is minor.

hieblmi · 2026-04-29T09:40:07Z

@claude review this again

claude · 2026-04-29T09:42:19Z

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

This PR fixes a real, observable shutdown race in Batcher.AddSweep where a context cancellation while a setup call is in-flight (e.g. fetchSweeps executing a Postgres query) causes driver: bad connection to surface to callers instead of the expected context.Canceled or ErrBatcherShuttingDown. The approach — introducing a central helper that re-checks the shutdown state after each fallible call — is sound and pragmatic.

Correctness

The coverage of error sites is complete:

Before fetchSweeps — covered (replaces the old bare b.quit select)
After fetchSweeps failure — covered
After GetSweepStatus failure — covered
After GetParentBatch failure — covered
After MinRelayFee failure — covered
After ensurePresigned failure — covered
Final channel select — covered (new <-ctx.Done() arm)

Priority order in addSweepExitErrIfAny is correct: checking b.quit first means that a simultaneous batcher-shutdown + ctx-cancel correctly returns ErrBatcherShuttingDown, the stronger signal.

Minor observation — <-ctx.Done() arm return value is always non-nil:

case <-ctx.Done():
    return b.addSweepExitErrIfAny(ctx)

Because ctx is already done at this point, ctx.Err() is guaranteed non-nil. The helper will always return non-nil here, so the call is correct. Just worth noting for readers: there is no path through addSweepExitErrIfAny that returns nil when reached from a ctx.Done() arm.

Possible Nit: unchecked race between successful operations

The guards are only installed after failing calls. If a context cancellation happens between two successful calls (e.g. after GetSweepStatus succeeds but before we reach the final select), the code continues unimpeded and the cancellation is only caught at the final select. This is unlikely to matter in practice since the final select does handle ctx.Done() now — it's more of a theoretical gap than a real bug. Calling addSweepExitErrIfAny after successful calls too would be overly defensive and noisy.

Test Coverage

The new test TestAddSweepReturnsContextErrorOnFetchCancellation directly reproduces the reported failure scenario:

func (f *cancelingSweepFetcher) FetchSweep(...) (*SweepInfo, error) {
    f.cancel()
    return nil, driver.ErrBadConn
}

This is clean and correctly asserts both:

require.ErrorIs(t, err, context.Canceled) — correct error is returned
require.NotErrorIs(t, err, driver.ErrBadConn) — driver error does not escape

The test follows project conventions — lowercase testAddSweepReturnsContextErrorOnFetchCancellation wired through runTests, Guard(t) deferred, cancelingSweepFetcher is a minimal mock.

One improvement worth considering: The other four guarded error paths (GetSweepStatus, GetParentBatch, MinRelayFee, ensurePresigned) are not tested. Given they all follow the identical guard pattern and the main regression (Postgres fetchSweeps) is covered, this is acceptable — but a brief comment in the test noting which path it exercises and that the same pattern applies to all others would help future readers.

Style / Minor

addSweepExitErrIfAny is a clear name; the godoc comment ("returns nil if AddSweep should continue") accurately captures the semantics.
The cancelingSweepFetcher ignores the context.Context arg in FetchSweep — that's intentional for the mock and fine.
Using _ testStore in testAddSweepReturnsContextErrorOnFetchCancellation correctly signals the store is unused at the sweep-fetcher failure point.

Summary

The fix is correct, well-scoped, and directly addresses the production-observed bug. The test provides meaningful regression coverage. No blocking issues. The one optional improvement is adding a comment to the test about which code path it covers.

Approved with the optional suggestion above.

lightninglabs-deploy · 2026-05-13T12:26:21Z

@starius: review reminder
@hieblmi, remember to re-request review from reviewers when ready

Use the test context in the fetch-cancellation regression test instead of context.Background.

Document the fetch-cancellation test helper and its regression test. These comments keep the intent of the cancellation race fixture local to the symbols that implement it.

Rename addSweepExitErrIfAny to shutdownOrCancelErrIfAny. The helper is now shared by caller-facing batcher methods, so the old AddSweep-specific name no longer describes its role. This keeps the rename separate from the presign cancellation handling.

Log the original AddSweep setup error before returning a shutdown or cancellation error. The shutdown helper intentionally masks backend-specific errors during normal shutdown, but the discarded error can still be useful when debugging why the shutdown branch was taken.

starius

LGTM!

Thanks for fixing sweepbatcher!

I pushed fixup commits fixing some issues found in the code, as well as other commits, fixing similar issues in Run and in Presign flow, also added a forgotten check in TestSweepBatcherCloseDuringAdding.

Could you check the added commits, please? If you accept the changes, please squash the fixup! commits into your commit.

starius · 2026-05-17T00:09:07Z

 	return f.store[outpoint], nil
 }

+type cancelingSweepFetcher struct {


Please add godoc.

I fixed this in a fixup

starius · 2026-05-17T00:09:12Z

+	return nil, driver.ErrBadConn
+}
+
+func testAddSweepReturnsContextErrorOnFetchCancellation(t *testing.T,


Please add godoc.

I fixed this in a fixup

starius · 2026-05-17T00:35:00Z

 	sweeps, err := b.fetchSweeps(ctx, *sweepReq)
 	if err != nil {
+		if exitErr := b.addSweepExitErrIfAny(ctx); exitErr != nil {
+			return exitErr


I propose to add Info log here and in similar cases not to throw away the original error completely, in case it is important for debug.

I fixed this in a fixup

AddSweep now masks cancellation races during its caller-side setup, but a request can still be accepted by the batcher run loop before the run context cancellation is selected. If handleSweeps then performs a context-sensitive store call while shutdown is in progress, a backend-specific error such as driver.ErrBadConn can still bubble out through Batcher.Run. Prefer the run context's terminal error when startup, handleSweeps, or asynchronous errChan failures happen after the run context has been canceled. This keeps normal shutdown reporting consistent and avoids surfacing backend driver errors from the accepted-request and async error paths. Log the original run-loop error before returning the context error so normal shutdown remains debuggable without changing the returned error. Add regression tests for both covered run-loop races. One lets AddSweep successfully hand a sweep request to Run, then cancels the run context from the second GetSweepStatus call while returning driver.ErrBadConn. The other queues an errChan error from the event loop while canceling the run context. Both tests assert that Run returns context.Canceled and does not wrap the driver error.

PresignSweepsGroup uses context-sensitive wallet and presigned-helper calls, but it previously returned their raw wrapped errors even when the caller context or batcher shutdown state had already become terminal. That leaves backend/helper errors visible during normal cancellation. Check for shutdown/cancellation before presigning and after fee lookup or presigning failures, preferring context.Canceled or ErrBatcherShuttingDown over lower-level errors. Log the original presign-path error before returning the shutdown or cancellation error so normal shutdown remains debuggable without changing the returned error. Add a regression test with a presigned helper that cancels the caller context while returning driver.ErrBadConn from SignTx. The test asserts PresignSweepsGroup reports context.Canceled and does not wrap the driver error, and runs against both mock and SQL-backed stores.

TestSweepBatcherCloseDuringAdding previously started Batcher.Run in a detached goroutine and called test assertions from that goroutine. The test only waited for the add/cancel workers, so a Run-side shutdown error could be missed or reported unreliably. Route the Run result through a channel and wait for it in the main test goroutine. While waiting, keep draining spend registrations so shutdown cannot deadlock on mock notifier traffic. This makes the existing shutdown-race test cover both sides of the race: AddSweep callers may exit with cancellation, and Batcher.Run must also terminate with an expected shutdown error.

hieblmi requested a review from starius April 28, 2026 10:40

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread sweepbatcher/sweep_batcher.go

Comment thread sweepbatcher/sweep_batcher.go Outdated

hieblmi force-pushed the fix-racae branch from 8db097b to eb44986 Compare April 28, 2026 10:53

hieblmi self-assigned this Apr 29, 2026

hieblmi force-pushed the fix-racae branch from eb44986 to 7df652a Compare April 29, 2026 09:21

sweepbatcher: harden AddSweep against ctx closure

bc2ff0d

hieblmi force-pushed the fix-racae branch from 7df652a to bc2ff0d Compare April 29, 2026 09:34

starius added 4 commits May 17, 2026 00:15

fixup! sweepbatcher: harden AddSweep against ctx closure

a87b9e9

Use the test context in the fetch-cancellation regression test instead of context.Background.

fixup! sweepbatcher: harden AddSweep against ctx closure

c17f86f

Document the fetch-cancellation test helper and its regression test. These comments keep the intent of the cancellation race fixture local to the symbols that implement it.

starius approved these changes May 17, 2026

View reviewed changes

starius added 3 commits May 17, 2026 00:23

starius force-pushed the fix-racae branch from 691c04f to 41bd240 Compare May 17, 2026 05:24

Conversation

hieblmi commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot commented Apr 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

hieblmi commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

Issues

Minor observations

Summary

Uh oh!

hieblmi commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

Issues

1. Misleading error wrapping when substituting the exit error (minor-to-medium)

2. addSweepExitErr can silently return nil (nit)

What's good

Minor observations

Summary

Uh oh!

hieblmi commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

Correctness

Possible Nit: unchecked race between successful operations

Test Coverage

Style / Minor

Summary

Uh oh!

lightninglabs-deploy commented May 13, 2026

Uh oh!

starius left a comment

Choose a reason for hiding this comment

Uh oh!

starius May 17, 2026

Choose a reason for hiding this comment

Uh oh!

starius May 17, 2026

Choose a reason for hiding this comment

Uh oh!

starius May 17, 2026

Choose a reason for hiding this comment

Uh oh!

starius May 17, 2026

Choose a reason for hiding this comment

Uh oh!

starius May 17, 2026

Choose a reason for hiding this comment

Uh oh!

starius May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

2. `addSweepExitErr` can silently return `nil` (nit)