Improve docstring of Elastic dataclass in flytekit-kf-pytorch by dmholtz · Pull Request #3419 · flyteorg/flytekit

dmholtz · 2026-04-08T08:45:34Z

This PR highlights additional settings for configuring multi-node trainings with flytekit-kf-pytorch.

Why are the changes needed?

The default read_timeout of the default C10dRendezvousBackend in multi-node trainings with flytekit-kf-pytorch is 60 seconds, which might be too tight, if the zero worker starts slower than any other worker.
To avoid users of flytekit-kf-pytorch being confused by obscure timeout errors during startup of such elastic Pytorch tasks, we add additional hints for configuration that avoid such errors.

What changes were proposed in this pull request?

This PR adds a remark to set explicitly increase the read_timeout of the TCPStore used by the C10dRendezvousBackend, which is the default for multi-node training with flyte-kf-pytorch.

I decided not to change the any defaults in the rdzv_config dictionary to remain agnostic of the chosen backend.

How was this patch tested?

This PR touches only documentation, so no tests are required.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

This PR highlights additional settings for multi-node trainings on K8s through a remark in the Elastic docstring to avoid TCPStore timeouts if the zero worker starts slower than any other worker. Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>

fg91 · 2026-04-08T17:33:09Z

            be assigned to a running node which might have the image in its cache while other workers might require a node scale up and image pull.
+            When using the default `torch.distributed.elastic.rendezvous.c10d_rendezvous_backend.C10dRendezvousBackend`, consider also increasing
+            the TCPStore `read_timeout`, e.g., {"timeout": 900, "join_timeout": 900, "read_timeout": 900}, as its default value of 60 seconds
+            might be too tight if the zero-worker starts slower than any other worker.


Nit:
This is mostly relevant when not using gang scheduling, we could mention this here.

@fg91 Thanks for this hint, I added an additional remark about its relevancy.

With "zero-worker" you mean rank 0, right?

With "zero-worker", I mean the pod with the name {flyte-task-id}-worker-0. Not the rank 0 process of torch DDP.

Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>

fg91

Two nits but LG apart from that, thanks!

Co-authored-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com> Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>

dmholtz · 2026-05-05T08:57:11Z

@fg91 just checking in on my PR. Is there any further work needed from my side?

fg91

Tiny nit:

as provided by e.g. coscheduling or volcano

I‘d personally prefer if we removed these examples and talked about gang scheduling in general terms as the field is moving quite fast in this area and these recommendations might become outdated quickly. I for instance encountered stability issues with coscheduling and personally wouldn‘t recommend it anymore.

After that LGTM, thank you!

dmholtz marked this pull request as ready for review April 8, 2026 09:01

dmholtz requested review from cosmicBboy, fg91, kumare3, pingsutw, samhita-alla and wild-endeavor as code owners April 8, 2026 09:01

kumare3 approved these changes Apr 8, 2026

View reviewed changes

fg91 reviewed Apr 8, 2026

View reviewed changes

Add remark about absence of gang-scheduling

611c3f7

Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>

dmholtz force-pushed the dmholtz/improve-elastic-docstring branch from f458f49 to 611c3f7 Compare April 9, 2026 19:12

fg91 reviewed Apr 9, 2026

View reviewed changes

Comment thread plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py Outdated

fg91 approved these changes Apr 9, 2026

View reviewed changes

Update plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py

6bb5ca1

Co-authored-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com> Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>

dmholtz force-pushed the dmholtz/improve-elastic-docstring branch from d92acc6 to 6bb5ca1 Compare April 10, 2026 12:58

Merge branch 'master' into dmholtz/improve-elastic-docstring

fcf7f0a

pingsutw approved these changes May 12, 2026

View reviewed changes

fg91 approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419

Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419
dmholtz wants to merge 4 commits into
flyteorg:masterfrom
dmholtz:dmholtz/improve-elastic-docstring

dmholtz commented Apr 8, 2026

Uh oh!

fg91 Apr 8, 2026

Uh oh!

dmholtz Apr 9, 2026

Uh oh!

fg91 Apr 9, 2026

Uh oh!

dmholtz Apr 10, 2026

Uh oh!

Uh oh!

fg91 left a comment

Uh oh!

dmholtz commented May 5, 2026

Uh oh!

fg91 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dmholtz commented Apr 8, 2026

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Uh oh!

fg91 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

dmholtz Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

fg91 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

dmholtz Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fg91 left a comment

Choose a reason for hiding this comment

Uh oh!

dmholtz commented May 5, 2026

Uh oh!

fg91 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants