Skip to content

TCP transport: include rank information in send/recv timeout error messages #511

Description

@d4l3k

Summary

When a send/recv operation times out in the TCP transport, the resulting IoException doesn't say which ranks were involved. In large distributed jobs this makes it very hard to tell which peer is actually stuck. We should include the local rank and the remote rank(s) being waited on in the timeout message.

Current behavior

A recv timeout currently surfaces as:

[trainer22|6]:RuntimeError: [gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

The [trainer22|6] prefix is added by the launcher (e.g. PyTorch / torchrun), not by Gloo, and the [gloo/.../unbound_buffer.cc:81] prefix comes from the GLOO_ERROR_MSG macro. Gloo's own message — Timed out waiting 1800000ms for recv operation to complete — carries no rank context. With thousands of ranks and only partial stderr, you can't tell which peer this rank was waiting on.

The two messages live in gloo/transport/tcp/unbound_buffer.cc:

Proposed behavior

Include the local rank and the remote rank(s) being waited on, e.g.:

Rank 6 timed out after 1800000ms waiting for recv from rank 22 (slot 7)

and for recv-from-any (multiple eligible sources):

Rank 6 timed out after 1800000ms waiting for recv from any of ranks [4, 5, 22] (slot 7)

(send is symmetric.) Even without the launcher's [host|rank] prefix, the message alone then identifies both ends of the stuck transfer.

Implementation sketch

All the data needed is already reachable from tcp::UnboundBuffer, so the change is self-contained to gloo/transport/tcp/unbound_buffer.{h,cc} and needs no cross-transport interface change.

Local rank — free. UnboundBuffer already holds a std::shared_ptr<tcp::Context> context_, and the base transport::Context stores const int rank / const int size. So context_->rank can be dropped into both messages directly, with no plumbing.

Remote rank — small plumb. The remote rank is known when the op is issued, but it isn't recorded on the buffer before the wait — recvRank_ / sendRank_ stay -1 until completion (they're only set in handleRecvCompletion / handleSendCompletion, which never fire on a timeout). Fix: record the target rank(s) on the buffer when the op starts:

  • UnboundBuffer::send(int dstRank, ...) → stash dstRank
  • UnboundBuffer::recv(int srcRank, ...) → stash srcRank
  • UnboundBuffer::recv(std::vector<int> srcRanks, ...) → stash the vector (recv-from-any has no single peer, so print the candidate set)

Then include the stashed rank(s), context_->rank, and the slot in the two timeout GLOO_ERROR_MSG(...) calls. The slot is especially useful for correlating the two sides of a hang.

This is consistent with the existing convention: waitRecv / waitSend already report the peer rank via the int* rank out-param on success — we'd just also surface it on the timeout path. There's already precedent for formatting ranks (Rank N, comma-joined peer lists) in tcp::Context::printConnectivityInfo().

Optional follow-ups

  • Add a getRemoteRank() accessor on transport::Pair (symmetric with the existing getLocalRank() / setLocalRank()); the TCP Pair already stores const int rank_. That would let the address-based timeout/error messages in pair.cc — which today identify the peer only by peer_.str() (an ip:port, not a rank) — also include the rank.
  • Apply the same treatment to other transports (e.g. ibverbs) for consistency.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions