TCP transport: include rank information in send/recv timeout error messages

## Summary

When a send/recv operation times out in the TCP transport, the resulting `IoException` doesn't say *which* ranks were involved. In large distributed jobs this makes it very hard to tell which peer is actually stuck. We should include the **local rank** and the **remote rank(s)** being waited on in the timeout message.

## Current behavior

A recv timeout currently surfaces as:

```
[trainer22|6]:RuntimeError: [gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
```

The `[trainer22|6]` prefix is added by the launcher (e.g. PyTorch / torchrun), not by Gloo, and the `[gloo/.../unbound_buffer.cc:81]` prefix comes from the `GLOO_ERROR_MSG` macro. Gloo's own message — `Timed out waiting 1800000ms for recv operation to complete` — carries **no rank context**. With thousands of ranks and only partial stderr, you can't tell which peer this rank was waiting on.

The two messages live in [`gloo/transport/tcp/unbound_buffer.cc`](https://github.com/pytorch/gloo/blob/main/gloo/transport/tcp/unbound_buffer.cc):

- [`UnboundBuffer::waitRecv()` timeout](https://github.com/pytorch/gloo/blob/main/gloo/transport/tcp/unbound_buffer.cc#L78-L82)
- [`UnboundBuffer::waitSend()` timeout](https://github.com/pytorch/gloo/blob/main/gloo/transport/tcp/unbound_buffer.cc#L129-L133)

## Proposed behavior

Include the local rank and the remote rank(s) being waited on, e.g.:

```
Rank 6 timed out after 1800000ms waiting for recv from rank 22 (slot 7)
```

and for recv-from-any (multiple eligible sources):

```
Rank 6 timed out after 1800000ms waiting for recv from any of ranks [4, 5, 22] (slot 7)
```

(send is symmetric.) Even without the launcher's `[host|rank]` prefix, the message alone then identifies both ends of the stuck transfer.

## Implementation sketch

All the data needed is already reachable from `tcp::UnboundBuffer`, so the change is self-contained to [`gloo/transport/tcp/unbound_buffer.{h,cc}`](https://github.com/pytorch/gloo/blob/main/gloo/transport/tcp/unbound_buffer.cc) and needs **no cross-transport interface change**.

**Local rank — free.** `UnboundBuffer` already holds a `std::shared_ptr<tcp::Context> context_`, and the base `transport::Context` stores `const int rank` / `const int size`. So `context_->rank` can be dropped into both messages directly, with no plumbing.

**Remote rank — small plumb.** The remote rank is known when the op is *issued*, but it isn't recorded on the buffer before the wait — `recvRank_` / `sendRank_` stay `-1` until *completion* (they're only set in `handleRecvCompletion` / `handleSendCompletion`, which never fire on a timeout). Fix: record the target rank(s) on the buffer when the op starts:

- `UnboundBuffer::send(int dstRank, ...)` → stash `dstRank`
- `UnboundBuffer::recv(int srcRank, ...)` → stash `srcRank`
- `UnboundBuffer::recv(std::vector<int> srcRanks, ...)` → stash the vector (recv-from-any has no single peer, so print the candidate set)

Then include the stashed rank(s), `context_->rank`, and the `slot` in the two timeout `GLOO_ERROR_MSG(...)` calls. The `slot` is especially useful for correlating the two sides of a hang.

This is consistent with the existing convention: `waitRecv` / `waitSend` already report the peer rank via the `int* rank` out-param on **success** — we'd just also surface it on the **timeout** path. There's already precedent for formatting ranks (`Rank N`, comma-joined peer lists) in `tcp::Context::printConnectivityInfo()`.

### Optional follow-ups

- Add a `getRemoteRank()` accessor on `transport::Pair` (symmetric with the existing `getLocalRank()` / `setLocalRank()`); the TCP `Pair` already stores `const int rank_`. That would let the address-based timeout/error messages in `pair.cc` — which today identify the peer only by `peer_.str()` (an `ip:port`, **not** a rank) — also include the rank.
- Apply the same treatment to other transports (e.g. `ibverbs`) for consistency.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TCP transport: include rank information in send/recv timeout error messages #511

Summary

Current behavior

Proposed behavior

Implementation sketch

Optional follow-ups

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

TCP transport: include rank information in send/recv timeout error messages #511

Description

Summary

Current behavior

Proposed behavior

Implementation sketch

Optional follow-ups

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions