-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add multigpu training entrypoint for easier use #5661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
StafaH
wants to merge
7
commits into
isaac-sim:develop
Choose a base branch
from
StafaH:mh/multigpu_command
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
ea57cad
Add clean multigpu command
StafaH c0bf9e6
Merge branch 'develop' into mh/multigpu_command
StafaH 9ace04e
Add ctr-c handling for multigpu
StafaH a91b768
Remove multigpu test
StafaH db1d900
Ruff
StafaH d28b82e
Merge branch 'develop' into mh/multigpu_command
StafaH 550f9f4
Merge branch 'develop' into mh/multigpu_command
StafaH File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md). | ||
| # All rights reserved. | ||
| # | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
|
|
||
| """Multi-GPU training entrypoint for Isaac Lab reinforcement learning workflows.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| import shlex | ||
| import subprocess | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
| SCRIPT_DIR = Path(__file__).resolve().parent | ||
| TRAIN_SCRIPT = SCRIPT_DIR / "train.py" | ||
|
|
||
| DISTRIBUTED_LIBRARIES = ("rl_games", "rsl_rl", "skrl") | ||
|
|
||
|
|
||
| def _parse_args(argv: list[str]) -> tuple[argparse.Namespace, list[str]]: | ||
| """Parse multi-GPU launcher arguments and return forwarded training arguments.""" | ||
| parser = argparse.ArgumentParser( | ||
| description="Launch multi-GPU RL training with torch.distributed.run.", | ||
| formatter_class=argparse.RawDescriptionHelpFormatter, | ||
| allow_abbrev=False, | ||
| epilog=( | ||
| "Examples:\n" | ||
| " train_multigpu --num_gpus 4 --task Isaac-Cartpole-v0 --headless\n" | ||
| " train_multigpu --rl_library skrl --num_gpus 2 --task Isaac-Cartpole-v0 --headless\n" | ||
| "\n" | ||
| "All unrecognized arguments are forwarded to the selected training library." | ||
| ), | ||
| ) | ||
| parser.add_argument( | ||
| "--rl_library", | ||
| choices=DISTRIBUTED_LIBRARIES, | ||
| default="rsl_rl", | ||
| help="Distributed-capable training library to use. Defaults to rsl_rl.", | ||
| ) | ||
| parser.add_argument( | ||
| "--num_gpus", | ||
| "--nproc_per_node", | ||
| dest="nproc_per_node", | ||
| default="gpu", | ||
| help=( | ||
| "Number of trainer processes to launch on each node. Accepts an integer or torchrun values " | ||
| "'gpu', 'cpu', and 'auto'. Defaults to 'gpu'." | ||
| ), | ||
| ) | ||
| parser.add_argument("--nnodes", default=None, help="Number of nodes to use for distributed training.") | ||
| parser.add_argument("--node_rank", default=None, help="Rank of this node in a multi-node job.") | ||
| parser.add_argument("--master_addr", default=None, help="Master node address for static rendezvous.") | ||
| parser.add_argument("--master_port", default=None, help="Master node port for static rendezvous.") | ||
| parser.add_argument("--rdzv_backend", default=None, help="Rendezvous backend used by torchrun.") | ||
| parser.add_argument("--rdzv_endpoint", default=None, help="Rendezvous endpoint used by torchrun.") | ||
| parser.add_argument("--rdzv_id", default=None, help="User-defined rendezvous id used by torchrun.") | ||
| parser.add_argument("--max_restarts", default=None, help="Maximum worker group restarts before failing.") | ||
| parser.add_argument("--monitor_interval", default=None, help="Worker monitor interval [s].") | ||
| parser.add_argument( | ||
| "--start_method", | ||
| choices=("spawn", "fork", "forkserver"), | ||
| default=None, | ||
| help="Multiprocessing start method used by torchrun.", | ||
| ) | ||
| parser.add_argument("--role", default=None, help="User-defined worker role used by torchrun.") | ||
| parser.add_argument("--tee", default=None, help="Tee selected worker stdout/stderr streams.") | ||
| parser.add_argument("--redirects", default=None, help="Redirect selected worker stdout/stderr streams.") | ||
| parser.add_argument("--local_ranks_filter", default=None, help="Only show logs from the listed local ranks.") | ||
| parser.add_argument("--log_dir", default=None, help="Directory used by torchrun for worker logs.") | ||
| parser.add_argument("--dry_run", action="store_true", help="Print the torchrun command without launching it.") | ||
|
|
||
| args_cli, train_args = parser.parse_known_args(argv) | ||
| if train_args[:1] == ["--"]: | ||
| train_args = train_args[1:] | ||
| return args_cli, train_args | ||
|
|
||
|
|
||
| def _append_optional_torchrun_arg(command: list[str], args_cli: argparse.Namespace, name: str) -> None: | ||
| """Append a torchrun argument when it was provided.""" | ||
| value = getattr(args_cli, name) | ||
| if value is not None: | ||
| command.extend([f"--{name}", str(value)]) | ||
|
|
||
|
|
||
| def _with_distributed_arg(train_args: list[str]) -> list[str]: | ||
| """Ensure the selected training library receives the distributed flag.""" | ||
| if "--distributed" in train_args: | ||
| return train_args | ||
| return ["--distributed", *train_args] | ||
|
|
||
|
|
||
| def _build_torchrun_command(args_cli: argparse.Namespace, train_args: list[str]) -> list[str]: | ||
| """Build the torchrun command for multi-GPU training.""" | ||
| command = [ | ||
| sys.executable, | ||
| "-m", | ||
| "torch.distributed.run", | ||
| "--nproc_per_node", | ||
| str(args_cli.nproc_per_node), | ||
| ] | ||
| for name in ( | ||
| "nnodes", | ||
| "node_rank", | ||
| "master_addr", | ||
| "master_port", | ||
| "rdzv_backend", | ||
| "rdzv_endpoint", | ||
| "rdzv_id", | ||
| "max_restarts", | ||
| "monitor_interval", | ||
| "start_method", | ||
| "role", | ||
| "tee", | ||
| "redirects", | ||
| "local_ranks_filter", | ||
| "log_dir", | ||
| ): | ||
| _append_optional_torchrun_arg(command, args_cli, name) | ||
|
|
||
| command.extend( | ||
| [ | ||
| str(TRAIN_SCRIPT), | ||
| "--rl_library", | ||
| args_cli.rl_library, | ||
| *_with_distributed_arg(train_args), | ||
| ] | ||
| ) | ||
| return command | ||
|
|
||
|
|
||
| def main(argv: list[str] | None = None) -> int: | ||
| """Launch multi-GPU training with ``torch.distributed.run``.""" | ||
| if argv is None: | ||
| argv = sys.argv[1:] | ||
|
|
||
| args_cli, train_args = _parse_args(argv) | ||
| command = _build_torchrun_command(args_cli, train_args) | ||
|
|
||
| if args_cli.dry_run: | ||
| print(shlex.join(command)) | ||
| return 0 | ||
|
|
||
| print(f"[INFO] Launching distributed training with: {shlex.join(command)}") | ||
| return subprocess.run(command, check=False).returncode | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| raise SystemExit(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| Added | ||
| ^^^^^ | ||
|
|
||
| * Added the ``train_multigpu`` entry point for launching distributed RL training with | ||
| ``torch.distributed.run``. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md). | ||
| # All rights reserved. | ||
| # | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
|
|
||
| """Tests for the multi-GPU training launcher.""" | ||
|
|
||
| import importlib.util | ||
| import subprocess | ||
| import sys | ||
| from pathlib import Path | ||
| from unittest import mock | ||
|
|
||
| from isaaclab.cli.utils import ISAACLAB_ROOT | ||
|
|
||
|
|
||
| def _load_train_multigpu_module(): | ||
| """Load the train_multigpu script as a test module.""" | ||
| module_path = ISAACLAB_ROOT / "scripts" / "reinforcement_learning" / "train_multigpu.py" | ||
| spec = importlib.util.spec_from_file_location("isaaclab_test_train_multigpu", module_path) | ||
| assert spec is not None | ||
| assert spec.loader is not None | ||
|
|
||
| module = importlib.util.module_from_spec(spec) | ||
| sys.modules[spec.name] = module | ||
| spec.loader.exec_module(module) | ||
| return module | ||
|
|
||
|
|
||
| TRAIN_MULTIGPU = _load_train_multigpu_module() | ||
|
|
||
|
|
||
| def test_builds_single_node_rsl_rl_torchrun_command(): | ||
| """Multi-GPU launcher should preserve training args and inject distributed mode.""" | ||
| args_cli, train_args = TRAIN_MULTIGPU._parse_args( | ||
| [ | ||
| "--num_gpus", | ||
| "4", | ||
| "--master_port", | ||
| "29504", | ||
| "--task=Isaac-Dexsuite-Kuka-Allegro-Reorient-v0", | ||
| "--headless", | ||
| "--num_envs=4096", | ||
| "--max_iterations=100", | ||
| "--run_name=gpu4_vis", | ||
| "presets=newton", | ||
| ] | ||
| ) | ||
|
|
||
| command = TRAIN_MULTIGPU._build_torchrun_command(args_cli, train_args) | ||
| train_script_index = command.index(str(TRAIN_MULTIGPU.TRAIN_SCRIPT)) | ||
|
|
||
| assert command[:5] == [sys.executable, "-m", "torch.distributed.run", "--nproc_per_node", "4"] | ||
| assert command[5:7] == ["--master_port", "29504"] | ||
| assert command[train_script_index + 1 : train_script_index + 4] == ["--rl_library", "rsl_rl", "--distributed"] | ||
| assert command[-5:] == [ | ||
| "--headless", | ||
| "--num_envs=4096", | ||
| "--max_iterations=100", | ||
| "--run_name=gpu4_vis", | ||
| "presets=newton", | ||
| ] | ||
|
|
||
|
|
||
| def test_builds_multi_node_skrl_torchrun_command(): | ||
| """Multi-node torchrun settings should be forwarded before the training script.""" | ||
| args_cli, train_args = TRAIN_MULTIGPU._parse_args( | ||
| [ | ||
| "--rl_library", | ||
| "skrl", | ||
| "--nproc_per_node", | ||
| "2", | ||
| "--nnodes", | ||
| "2", | ||
| "--node_rank", | ||
| "1", | ||
| "--rdzv_backend", | ||
| "c10d", | ||
| "--rdzv_endpoint", | ||
| "host.example.com:5555", | ||
| "--rdzv_id", | ||
| "job-1", | ||
| "--task", | ||
| "Isaac-Cartpole-v0", | ||
| "--distributed", | ||
| ] | ||
| ) | ||
|
|
||
| command = TRAIN_MULTIGPU._build_torchrun_command(args_cli, train_args) | ||
| train_script_index = command.index(str(TRAIN_MULTIGPU.TRAIN_SCRIPT)) | ||
|
|
||
| assert command[:5] == [sys.executable, "-m", "torch.distributed.run", "--nproc_per_node", "2"] | ||
| assert command[5:train_script_index] == [ | ||
| "--nnodes", | ||
| "2", | ||
| "--node_rank", | ||
| "1", | ||
| "--rdzv_backend", | ||
| "c10d", | ||
| "--rdzv_endpoint", | ||
| "host.example.com:5555", | ||
| "--rdzv_id", | ||
| "job-1", | ||
| ] | ||
| assert command[train_script_index + 1 : train_script_index + 3] == ["--rl_library", "skrl"] | ||
| assert command.count("--distributed") == 1 | ||
|
|
||
|
|
||
| def test_dry_run_prints_command_without_launching(capsys): | ||
| """Dry-run mode should not start torchrun.""" | ||
| with mock.patch.object(subprocess, "run") as mock_run: | ||
| result = TRAIN_MULTIGPU.main(["--dry_run", "--num_gpus", "2", "--task", "Isaac-Cartpole-v0"]) | ||
|
|
||
| assert result == 0 | ||
| mock_run.assert_not_called() | ||
| output = capsys.readouterr().out | ||
| assert "torch.distributed.run" in output | ||
| assert "--nproc_per_node 2" in output | ||
| assert "--distributed --task Isaac-Cartpole-v0" in output | ||
|
|
||
|
|
||
| def test_cli_helper_runs_multigpu_script(): | ||
| """The isaaclab CLI helper should dispatch to the multi-GPU training script.""" | ||
| from isaaclab import cli | ||
|
|
||
| with mock.patch("isaaclab.cli.run_python_command") as mock_run: | ||
| cli.train_multigpu(["--dry_run"]) | ||
|
|
||
| mock_run.assert_called_once_with( | ||
| Path(ISAACLAB_ROOT) / "scripts" / "reinforcement_learning" / "train_multigpu.py", | ||
| ["--dry_run"], | ||
| check=True, | ||
| ) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about if I want to run multi-gpu with JAX: https://isaac-sim.github.io/IsaacLab/main/source/features/multi_gpu.html#jax-implementation ?
Also, in the JAX multi-GPU setup, parameters such as
rdzv_backend,rdzv_endpointandrdzv_iddo not existThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Toni-SM, good question! There's a couple choices we can try. We can detect in the arguments if the user has skrl + jax and modify the command for them. We could also create a special simple argument for jax (--jax) that can be used in combination with --rl_library skrl. For args validation, we can add that as well, that's what makes this new entry point script really strong, we can do very quick early parsing and make sure the correct args are there and error out very early if they arent.
Which option would you prefer?