auctioneer: reconnect on EOF instead of silently leaving the subscribe stream dead#518
Merged
Merged
Conversation
Contributor
Author
|
Adding |
Roasbeef
reviewed
May 20, 2026
Roasbeef
approved these changes
May 20, 2026
Member
Roasbeef
left a comment
There was a problem hiding this comment.
Exp backoff already in place
LGTM 🕊️
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Treat
io.EOFon theSubscribeBatchAuctionstream the same as any other transport-level stream error: surface it asErrServerErroredso the rpcserver consumer triggers a reconnect viaHandleServerShutdown. Previously EOF was emitted as a separateErrServerShutdownsentinel that the consumer silently ignored, on the (incorrect) assumption that the client had already scheduled its own reconnect.The result was a permanently dead subscription stream after any clean stream close, with the auctioneer filtering the trader's orders out of matching as "offline" until the process was restarted.
Background
auctioneer.Client.readIncomingStreamdistinguishes two stream-failure cases:transport is closing,Unavailable, etc.) emitErrServerErroredtoStreamErrChan. The consumer inrpcserver.gosees this and callsHandleServerShutdown, which closes the dead stream, re-dials, and re-runsStartAccountSubscriptionfor every previously subscribed account.ErrServerShutdown, which the consumer treated as a no-op with the comment "the client has already scheduled a restart."The comment was only accurate for one specific code path: when the auctioneer explicitly sends a
SubscribeError_SERVER_SHUTDOWNapplication message, the read loop handles it inline (callsHandleServerShutdown(nil)directly and returns). For any transport-level EOF / proxy / load-balancer idle timeout, gRPCMaxConnectionAge, server stream handler returning for some other reason, server crash without graceful shutdown etc., no reconnect was ever scheduled.Compounding this,
closeStreamwas also never called on the EOF path, so any subsequent subscribe attempt from elsewhere short-circuited at the "already subscribed" guard inconnectAndAuthenticatewithout ever sending a fresh Commit message on a new stream.In production this manifested as: the trader's persistent stream died, the client process kept running and continued serving unary RPCs (so account / order CLI calls worked normally), but every batch attempt flagged
offline traderagainst that account until thepooldprocess was restarted.Tests
Adds
auctioneer/client_test.gowith three table-style tests againstreadIncomingStreamdriven by a fake stream:TestReadIncomingStreamEOFTriggersReconnect— the regression case; fails on pre-fix code withexpected ErrServerErrored on EOF, got: server shutting down.TestReadIncomingStreamTransportErrorTriggersReconnect— locks in the pre-existing behaviour for non-EOF transport errors so the two paths stay unified.TestReadIncomingStreamContextCanceledDoesNotReconnect— asserts client-initiated cancels don't double-trigger reconnect.