Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419
Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419dmholtz wants to merge 4 commits into
Conversation
This PR highlights additional settings for multi-node trainings on K8s through a remark in the Elastic docstring to avoid TCPStore timeouts if the zero worker starts slower than any other worker. Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
| be assigned to a running node which might have the image in its cache while other workers might require a node scale up and image pull. | ||
| When using the default `torch.distributed.elastic.rendezvous.c10d_rendezvous_backend.C10dRendezvousBackend`, consider also increasing | ||
| the TCPStore `read_timeout`, e.g., {"timeout": 900, "join_timeout": 900, "read_timeout": 900}, as its default value of 60 seconds | ||
| might be too tight if the zero-worker starts slower than any other worker. |
There was a problem hiding this comment.
Nit:
This is mostly relevant when not using gang scheduling, we could mention this here.
There was a problem hiding this comment.
@fg91 Thanks for this hint, I added an additional remark about its relevancy.
There was a problem hiding this comment.
With "zero-worker" you mean rank 0, right?
There was a problem hiding this comment.
With "zero-worker", I mean the pod with the name {flyte-task-id}-worker-0. Not the rank 0 process of torch DDP.
Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
f458f49 to
611c3f7
Compare
fg91
left a comment
There was a problem hiding this comment.
Two nits but LG apart from that, thanks!
Co-authored-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com> Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
d92acc6 to
6bb5ca1
Compare
|
@fg91 just checking in on my PR. Is there any further work needed from my side? |
fg91
left a comment
There was a problem hiding this comment.
Tiny nit:
as provided by e.g. coscheduling or volcano
I‘d personally prefer if we removed these examples and talked about gang scheduling in general terms as the field is moving quite fast in this area and these recommendations might become outdated quickly. I for instance encountered stability issues with coscheduling and personally wouldn‘t recommend it anymore.
After that LGTM, thank you!
This PR highlights additional settings for configuring multi-node trainings with flytekit-kf-pytorch.
Why are the changes needed?
The default
read_timeoutof the default C10dRendezvousBackend in multi-node trainings with flytekit-kf-pytorch is 60 seconds, which might be too tight, if the zero worker starts slower than any other worker.To avoid users of flytekit-kf-pytorch being confused by obscure timeout errors during startup of such elastic Pytorch tasks, we add additional hints for configuration that avoid such errors.
What changes were proposed in this pull request?
This PR adds a remark to set explicitly increase the
read_timeoutof the TCPStore used by the C10dRendezvousBackend, which is the default for multi-node training with flyte-kf-pytorch.I decided not to change the any defaults in the
rdzv_configdictionary to remain agnostic of the chosen backend.How was this patch tested?
This PR touches only documentation, so no tests are required.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link