This repository provides a comprehensive tutorial for performing distributed Hyperparameter Optimization (HPO) using Ray Tune on the Jean Zay supercomputer. It demonstrates how to orchestrate a multi-node Ray cluster within a SLURM environment to parallelize machine learning / deep learning experiments efficiently.
The project consists of four main components designed to work together on an HPC cluster:
requirements.py: Contains the list of Python dependencies, includingtorch,ray[tune], andoptuna.multi_node/config.yaml: The central configuration file for managing the search space and Ray resources.multi_node/ray_tune.py: The execution script that defines the data loading, the model, the training loop, and the Ray Tune experiment logic.multi_node/launch_ray_tune.slurm: The SLURM submission script that initializes the Ray cluster across multiple nodes.
We will explain in detail the different components, variables etc.
This file lists the essential libraries needed to run the experiment:
- Deep Learning:
torchandtensorboard. - Optimization:
ray[tune]andoptuna. - Data Processing:
numpy,pandas, andscikit-learn. - Utilities:
PyYAMLandpydantic.
These are the minimum required so run a Ray Tune experiment. If you want to change le data loading, the model and add other features, you will probably have to add specific packages.
This YAML file aims to setup the parameters needed for the Ray Tune experiment.
The file is divided into three main sections:
This section manages the logistics of the experiment and how resources are allocated on the supercomputer.
experiment_name("ray_tune_multinode_jeanzay"): The name of your experiment. Ray Tune uses this to create a dedicated output directory containing all logs, results, and model checkpoints.num_samples(20): The total number of trials Ray Tune will execute. Each trial represents a complete training run using a unique combination of hyperparameters.
Warning: For large models, a high number of samples will require massive compute time.
resources_per_trial: Defines the computational resources allocated to a single trial.cpu(10): Number of CPU cores dedicated to data loading and preprocessing for this specific trial.gpu(1): Number of GPUs dedicated to this trial.
max_concurrent_trials: Caps the number of trials running in parallel to prevent overloading your SLURM allocation.num_nodes(2): The total number of compute nodes you requested from Jean Zay.num_gpu_per_node(4): The number of GPUs available on each node (e.g., H100 or A100 quad-GPU nodes).
Note: With this setup (2 nodes x 4 GPUs), Ray Tune will run a maximum of 8 trials concurrently.
This section defines the ranges and values the optimization algorithm will explore.
lr_min(1e-4) andlr_max(1e-1): The lower and upper bounds for the learning rate. The Python script is configured to sample this value on a logarithmic scale (log-uniform) between these two boundaries.batch_size([16, 32, 64]): A list of discrete (categorical) values. For each trial, Ray Tune will select one of these batch sizes.epochs(50): The number of training epochs. Here it is a fixed value for all trials, but you can easily modify the Python script to treat this as a variable search space as well.
This section instructs Ray Tune and the underlying Optuna algorithm on how to evaluate model performance.
metric("accuracy"): The name of the metric your Python training loop (inray_tune.py) reports back to Ray Tune viatune.report(). This acts as the compass guiding the optimization algorithm.mode("max"): The optimization objective. Since our metric is "accuracy", we want the algorithm to maximize it ("max"). If your metric was a loss function (e.g., "loss" or "MSE"), you would change this parameter to"min".
This script handles the machine learning lifecycle and Ray integration:
- Data & Model: Uses a toy Breast Cancer dataset and a simple Multi-Layer Perceptron (
SimpleMLP) for demonstration.
Disclaimer: The
SimpleMLPmodel and the dataset loading logic provided in this tutorial were generated by AI and serve only as placeholders to demonstrate the distributed infrastructure.
- Trial Execution: The
train_and_evaluatefunction is executed for every hyperparameter combination, reporting metrics back viatune.report. - Search & Scheduling: Implements
OptunaSearchfor intelligent parameter sampling andASHASchedulerfor early stopping of underperforming trials. These two parameters can also be modified depending on your need. See tune.search to select another Tune’s Search Algorithms. See tune.schedulers to select another Tune’s Search Algorithms. - Fault Tolerance: Includes logic to detect and restore unfinished experiments from a storage path via
Tuner.restore, allowing jobs to resume if interrupted by SLURM time limits.
This Bash script acts as the bridge between the Jean Zay supercomputer infrastructure and your Python code. It handles resource allocation and the initialization of the distributed Ray cluster.
For all the GPU and CPU directive linked, make sure that your project allows you to ask for these hardwares.
These directives request the necessary hardware resources from Jean Zay:
--nodes=2: Requests the allocation of 2 distinct compute nodes.-C h100and--gres=gpu:4: Specifies the use of nodes equipped with 4 H100 GPUs each, for a total of 8 GPUs across the cluster. You can also ask fora100orv100for example.--cpus-per-task=96: Allocates 96 CPU cores per node to massively handle data loading and processing.--hint=nomultithread: Disables hyperthreading (SMT) to guarantee more predictable compute performance. It is a good practice on Jean Zay.--exclusive: Guarantees exclusive use of the allocated nodes, preventing other users from impacting your performance.--time=00:05:00: Sets the execution time limit (Walltime) of the experiment to 5 minutes. It is extremely voluntarily low only for this tutorial purpose. However, you can go up to20:00:00hours.--signal=B:SIGUSR1@120: Sets the number of seconds before the time limit where a signalSIGUSR1is sent to the slurm script. It is mandatory since it will perform a gracefull ending of Ray. (See section 4.4 for more explanations).
Before submitting your job via sbatch, you must update these lines with your own information:
- SLURM Project Account (
-A name-project@h100): Replacename-project@h100with your own Jean Zay compute hour allocation ID. Example :-A abc@h100for demanding h100 GPUs or-A abc@a100for demanding a100 GPUs. Be only carefull to be alined with what you ask in the-Cdirective. - Log Files (
--outputand--error): Modify the hardcoded paths/lustre/fsn1/projects/rech/name-project/jean-zay-id/path/to/logs/to point to your project's own storage space. - Virtual Environment: Update the activation commands (
cd $SCRATCH/hpo_raytuneandsource venv_hpo_raytune/bin/activate) to match the exact path where you installed your Python dependencies.
The script automates the setup of a Master/Slave (Head/Worker) architecture, which is essential for Ray Tune to distribute training across multiple physical machines:
-
The Head Node (Ray Head):
- The script first identifies the first node in the list allocated by SLURM (
HEAD_NODE) and extracts its IP address (HEAD_IP).
nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST)) HEAD_NODE=${nodes[0]} HEAD_IP=$(getent hosts $HEAD_NODE | awk '{print $1}')
- It then launches the Ray master process (
ray start --head) on this specific node, opening the communication port6379.
srun --nodes=1 --ntasks=1 -w $HEAD_NODE \ ray start --head \ --node-ip-address=$HEAD_IP \ --port=6379 \ --num-cpus=$SLURM_CPUS_PER_TASK \ --disable-usage-stats \ --block &
- This "Head" node acts as the orchestrator: it will execute the
ray_tune.pyPython script and will be responsible for scheduling and distributing the different trials (hyperparameter combinations) to the worker nodes.
python -u ray_tune.py --config config.yaml & - The script first identifies the first node in the list allocated by SLURM (
-
The Worker Nodes (Ray Workers):
- Once the head node is active, the script uses a
forloop to iterate through the rest of the nodes allocated by Jean Zay.
WORKER_PIDS=() for node in "${nodes[@]:1}"; do WORKER_IP=$(getent hosts $node | awk '{ print $1 }') echo "Starting Ray worker on $node ($WORKER_IP)" [..] done
- On each remaining node, it launches a slave process (
ray start).
WORKER_PIDS=() for node in "${nodes[@]:1}"; do WORKER_IP=$(getent hosts $node | awk '{ print $1 }') echo "Starting Ray worker on $node ($WORKER_IP)" srun --nodes=1 --ntasks=1 -w $node \ ray start \ --address=$HEAD_IP:6379 \ --num-cpus=$SLURM_CPUS_PER_TASK \ --disable-usage-stats \ --block & WORKER_PIDS+=($!) sleep 5 done
- The crucial argument here is
--address=$HEAD_IP:6379: it tells the worker nodes how to connect to the head node to form a unified cluster. These workers are the ones that will concretely execute the model training loops.
- Once the head node is active, the script uses a
Once the Ray cluster is formed and all connections are established, the Bash script launches the Python optimization in the background and waits for its completion using the wait $PYTHON_PID. Once the Python script is finished (regardless of the exit code RC), it cleanly shuts down all processes in the distributed cluster using the ray stop.
As you experiments can be very long, your Ray Tune experiment should end properly even if the time limit has been exceeded. As a result, the function propagate_signal catch the signal SIGUS1 sent by SLURM and redirect it to Ray ( Ray has it's own SIGURS1 signal handler that shuts down gracefully).
function propagate_signal() {
echo "Signal SIGUSR1 reçu à T-120s. Envoi de SIGINT à Python..." | tee -a $RAY_LOG_FILE
kill -SIGINT $PYTHON_PID || true
}
trap propagate_signal SIGUSR1If you want your slrum script to relaunch the job automatically if all the trials are not yet be performed, you can go at the bottom of the script and decomment the line :
if [ $RC -eq 12 ]; then
echo "Ray Tune incomplete (exit code 12) → should automatically requeue Slurm job to complete remaining trials!"
--> #scontrol requeue $SLURM_JOB_ID
exit 0This feature is very powerfull since it allows you automatize the experiment.
WARNING: If you set a huge number of
num_samples, the job will be requeued again and again wich can leads to a high computation ressource consumption.
To use this tutorial, follow these steps:
-
Prepare the Environment: Create a virtual environment and install the dependencies from
requirements.py. Exemple using python venv:python -m venv venv_hpo_raytune source ./venv_hpo_raytune/bin/activate pip install -r requirements.py deactivate -
Configure Paths: Update the
STORAGE_PATHinray_tune.pyand the#SBATCHoutput paths inlaunch_ray_tune.slurmto point to your specific project directories on/lustreor$SCRATCH. -
Submit the Job:
sbatch launch_ray_tune.slurm
-
Monitor Results: Ray Tune will log results to the directory specified in the configuration, which can be viewed via TensorBoard.
You can also monitor the results using
watch squeue -u you-jean-zay-idbut be carefull not to seize the jean zay frontal ressources.
- If you experiment hasn't finished before the timelimit
--timeset in thelaunch_ray_tune.slurm, you will just have to resubmit you slurm sript with the same configuration since it as a restoring feature in the pyhton scipt.
- If you you want to perform an experiment but you would be willing to persue it later, provide a huge number of
num_samplesin theconfig.yaml. In that case, even if the timelimit--timereached, you will have the possibility to relaunch it later (again with the same parameters and configuration).