Skip to content

auctioneer: reconnect on EOF instead of silently leaving the subscribe stream dead#518

Merged
Roasbeef merged 2 commits into
lightninglabs:masterfrom
djkazic:fix-eof-no-reconnect
May 20, 2026
Merged

auctioneer: reconnect on EOF instead of silently leaving the subscribe stream dead#518
Roasbeef merged 2 commits into
lightninglabs:masterfrom
djkazic:fix-eof-no-reconnect

Conversation

@djkazic
Copy link
Copy Markdown
Contributor

@djkazic djkazic commented May 20, 2026

Summary

Treat io.EOF on the SubscribeBatchAuction stream the same as any other transport-level stream error: surface it as ErrServerErrored so the rpcserver consumer triggers a reconnect via HandleServerShutdown. Previously EOF was emitted as a separate ErrServerShutdown sentinel that the consumer silently ignored, on the (incorrect) assumption that the client had already scheduled its own reconnect.

The result was a permanently dead subscription stream after any clean stream close, with the auctioneer filtering the trader's orders out of matching as "offline" until the process was restarted.

Background

auctioneer.Client.readIncomingStream distinguishes two stream-failure cases:

  • Non-EOF errors (transport is closing, Unavailable, etc.) emit ErrServerErrored to StreamErrChan. The consumer in rpcserver.go sees this and calls HandleServerShutdown, which closes the dead stream, re-dials, and re-runs StartAccountSubscription for every previously subscribed account.
  • EOF emitted ErrServerShutdown, which the consumer treated as a no-op with the comment "the client has already scheduled a restart."

The comment was only accurate for one specific code path: when the auctioneer explicitly sends a SubscribeError_SERVER_SHUTDOWN application message, the read loop handles it inline (calls HandleServerShutdown(nil) directly and returns). For any transport-level EOF / proxy / load-balancer idle timeout, gRPC MaxConnectionAge, server stream handler returning for some other reason, server crash without graceful shutdown etc., no reconnect was ever scheduled.

Compounding this, closeStream was also never called on the EOF path, so any subsequent subscribe attempt from elsewhere short-circuited at the "already subscribed" guard in connectAndAuthenticate without ever sending a fresh Commit message on a new stream.

In production this manifested as: the trader's persistent stream died, the client process kept running and continued serving unary RPCs (so account / order CLI calls worked normally), but every batch attempt flagged offline trader against that account until the poold process was restarted.

Tests

Adds auctioneer/client_test.go with three table-style tests against readIncomingStream driven by a fake stream:

  • TestReadIncomingStreamEOFTriggersReconnect — the regression case; fails on pre-fix code with expected ErrServerErrored on EOF, got: server shutting down.
  • TestReadIncomingStreamTransportErrorTriggersReconnect — locks in the pre-existing behaviour for non-EOF transport errors so the two paths stay unified.
  • TestReadIncomingStreamContextCanceledDoesNotReconnect — asserts client-initiated cancels don't double-trigger reconnect.

@djkazic
Copy link
Copy Markdown
Contributor Author

djkazic commented May 20, 2026

Adding jitter for reconnects to avoid thundering herds (the motive of the original change that introduced this behavior).

Comment thread auctioneer/client.go
Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exp backoff already in place

LGTM 🕊️

@Roasbeef Roasbeef merged commit 2dd0a29 into lightninglabs:master May 20, 2026
6 checks passed
@djkazic djkazic deleted the fix-eof-no-reconnect branch May 20, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants