Summary
When a send/recv operation times out in the TCP transport, the resulting IoException doesn't say which ranks were involved. In large distributed jobs this makes it very hard to tell which peer is actually stuck. We should include the local rank and the remote rank(s) being waited on in the timeout message.
Current behavior
A recv timeout currently surfaces as:
[trainer22|6]:RuntimeError: [gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
The [trainer22|6] prefix is added by the launcher (e.g. PyTorch / torchrun), not by Gloo, and the [gloo/.../unbound_buffer.cc:81] prefix comes from the GLOO_ERROR_MSG macro. Gloo's own message — Timed out waiting 1800000ms for recv operation to complete — carries no rank context. With thousands of ranks and only partial stderr, you can't tell which peer this rank was waiting on.
The two messages live in gloo/transport/tcp/unbound_buffer.cc:
Proposed behavior
Include the local rank and the remote rank(s) being waited on, e.g.:
Rank 6 timed out after 1800000ms waiting for recv from rank 22 (slot 7)
and for recv-from-any (multiple eligible sources):
Rank 6 timed out after 1800000ms waiting for recv from any of ranks [4, 5, 22] (slot 7)
(send is symmetric.) Even without the launcher's [host|rank] prefix, the message alone then identifies both ends of the stuck transfer.
Implementation sketch
All the data needed is already reachable from tcp::UnboundBuffer, so the change is self-contained to gloo/transport/tcp/unbound_buffer.{h,cc} and needs no cross-transport interface change.
Local rank — free. UnboundBuffer already holds a std::shared_ptr<tcp::Context> context_, and the base transport::Context stores const int rank / const int size. So context_->rank can be dropped into both messages directly, with no plumbing.
Remote rank — small plumb. The remote rank is known when the op is issued, but it isn't recorded on the buffer before the wait — recvRank_ / sendRank_ stay -1 until completion (they're only set in handleRecvCompletion / handleSendCompletion, which never fire on a timeout). Fix: record the target rank(s) on the buffer when the op starts:
UnboundBuffer::send(int dstRank, ...) → stash dstRank
UnboundBuffer::recv(int srcRank, ...) → stash srcRank
UnboundBuffer::recv(std::vector<int> srcRanks, ...) → stash the vector (recv-from-any has no single peer, so print the candidate set)
Then include the stashed rank(s), context_->rank, and the slot in the two timeout GLOO_ERROR_MSG(...) calls. The slot is especially useful for correlating the two sides of a hang.
This is consistent with the existing convention: waitRecv / waitSend already report the peer rank via the int* rank out-param on success — we'd just also surface it on the timeout path. There's already precedent for formatting ranks (Rank N, comma-joined peer lists) in tcp::Context::printConnectivityInfo().
Optional follow-ups
- Add a
getRemoteRank() accessor on transport::Pair (symmetric with the existing getLocalRank() / setLocalRank()); the TCP Pair already stores const int rank_. That would let the address-based timeout/error messages in pair.cc — which today identify the peer only by peer_.str() (an ip:port, not a rank) — also include the rank.
- Apply the same treatment to other transports (e.g.
ibverbs) for consistency.
Summary
When a send/recv operation times out in the TCP transport, the resulting
IoExceptiondoesn't say which ranks were involved. In large distributed jobs this makes it very hard to tell which peer is actually stuck. We should include the local rank and the remote rank(s) being waited on in the timeout message.Current behavior
A recv timeout currently surfaces as:
The
[trainer22|6]prefix is added by the launcher (e.g. PyTorch / torchrun), not by Gloo, and the[gloo/.../unbound_buffer.cc:81]prefix comes from theGLOO_ERROR_MSGmacro. Gloo's own message —Timed out waiting 1800000ms for recv operation to complete— carries no rank context. With thousands of ranks and only partial stderr, you can't tell which peer this rank was waiting on.The two messages live in
gloo/transport/tcp/unbound_buffer.cc:UnboundBuffer::waitRecv()timeoutUnboundBuffer::waitSend()timeoutProposed behavior
Include the local rank and the remote rank(s) being waited on, e.g.:
and for recv-from-any (multiple eligible sources):
(send is symmetric.) Even without the launcher's
[host|rank]prefix, the message alone then identifies both ends of the stuck transfer.Implementation sketch
All the data needed is already reachable from
tcp::UnboundBuffer, so the change is self-contained togloo/transport/tcp/unbound_buffer.{h,cc}and needs no cross-transport interface change.Local rank — free.
UnboundBufferalready holds astd::shared_ptr<tcp::Context> context_, and the basetransport::Contextstoresconst int rank/const int size. Socontext_->rankcan be dropped into both messages directly, with no plumbing.Remote rank — small plumb. The remote rank is known when the op is issued, but it isn't recorded on the buffer before the wait —
recvRank_/sendRank_stay-1until completion (they're only set inhandleRecvCompletion/handleSendCompletion, which never fire on a timeout). Fix: record the target rank(s) on the buffer when the op starts:UnboundBuffer::send(int dstRank, ...)→ stashdstRankUnboundBuffer::recv(int srcRank, ...)→ stashsrcRankUnboundBuffer::recv(std::vector<int> srcRanks, ...)→ stash the vector (recv-from-any has no single peer, so print the candidate set)Then include the stashed rank(s),
context_->rank, and theslotin the two timeoutGLOO_ERROR_MSG(...)calls. Theslotis especially useful for correlating the two sides of a hang.This is consistent with the existing convention:
waitRecv/waitSendalready report the peer rank via theint* rankout-param on success — we'd just also surface it on the timeout path. There's already precedent for formatting ranks (Rank N, comma-joined peer lists) intcp::Context::printConnectivityInfo().Optional follow-ups
getRemoteRank()accessor ontransport::Pair(symmetric with the existinggetLocalRank()/setLocalRank()); the TCPPairalready storesconst int rank_. That would let the address-based timeout/error messages inpair.cc— which today identify the peer only bypeer_.str()(anip:port, not a rank) — also include the rank.ibverbs) for consistency.