Skip to content

neurospin/hpo_raytune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Hyperparameter Optimization with Ray Tune on Jean Zay

This repository provides a comprehensive tutorial for performing distributed Hyperparameter Optimization (HPO) using Ray Tune on the Jean Zay supercomputer. It demonstrates how to orchestrate a multi-node Ray cluster within a SLURM environment to parallelize machine learning / deep learning experiments efficiently.


📂 Project Structure

The project consists of four main components designed to work together on an HPC cluster:

  • requirements.py: Contains the list of Python dependencies, including torch, ray[tune], and optuna.
  • multi_node/config.yaml: The central configuration file for managing the search space and Ray resources.
  • multi_node/ray_tune.py: The execution script that defines the data loading, the model, the training loop, and the Ray Tune experiment logic.
  • multi_node/launch_ray_tune.slurm: The SLURM submission script that initializes the Ray cluster across multiple nodes.

We will explain in detail the different components, variables etc.


🛠️ Detailed Component Breakdown

1. Environment Configuration (requirements.py)

This file lists the essential libraries needed to run the experiment:

  • Deep Learning: torch and tensorboard.
  • Optimization: ray[tune] and optuna.
  • Data Processing: numpy, pandas, and scikit-learn.
  • Utilities: PyYAML and pydantic.

These are the minimum required so run a Ray Tune experiment. If you want to change le data loading, the model and add other features, you will probably have to add specific packages.


2. Experiment Settings (config.yaml)

This YAML file aims to setup the parameters needed for the Ray Tune experiment.

The file is divided into three main sections:

2.1. Ray Tune Configuration (tune)

This section manages the logistics of the experiment and how resources are allocated on the supercomputer.

  • experiment_name ("ray_tune_multinode_jeanzay"): The name of your experiment. Ray Tune uses this to create a dedicated output directory containing all logs, results, and model checkpoints.
  • num_samples (20): The total number of trials Ray Tune will execute. Each trial represents a complete training run using a unique combination of hyperparameters.

Warning: For large models, a high number of samples will require massive compute time.

  • resources_per_trial: Defines the computational resources allocated to a single trial.
    • cpu (10): Number of CPU cores dedicated to data loading and preprocessing for this specific trial.
    • gpu (1): Number of GPUs dedicated to this trial.
  • max_concurrent_trials: Caps the number of trials running in parallel to prevent overloading your SLURM allocation.
    • num_nodes (2): The total number of compute nodes you requested from Jean Zay.
    • num_gpu_per_node (4): The number of GPUs available on each node (e.g., H100 or A100 quad-GPU nodes).

Note: With this setup (2 nodes x 4 GPUs), Ray Tune will run a maximum of 8 trials concurrently.

2.2. Hyperparameter Search Space (hyperparameters)

This section defines the ranges and values the optimization algorithm will explore.

  • lr_min (1e-4) and lr_max (1e-1): The lower and upper bounds for the learning rate. The Python script is configured to sample this value on a logarithmic scale (log-uniform) between these two boundaries.
  • batch_size ([16, 32, 64]): A list of discrete (categorical) values. For each trial, Ray Tune will select one of these batch sizes.
  • epochs (50): The number of training epochs. Here it is a fixed value for all trials, but you can easily modify the Python script to treat this as a variable search space as well.

2.3. Search Algorithm Parameters (search)

This section instructs Ray Tune and the underlying Optuna algorithm on how to evaluate model performance.

  • metric ("accuracy"): The name of the metric your Python training loop (in ray_tune.py) reports back to Ray Tune via tune.report(). This acts as the compass guiding the optimization algorithm.
  • mode ("max"): The optimization objective. Since our metric is "accuracy", we want the algorithm to maximize it ("max"). If your metric was a loss function (e.g., "loss" or "MSE"), you would change this parameter to "min".

3. The Training Logic (ray_tune.py)

This script handles the machine learning lifecycle and Ray integration:

  • Data & Model: Uses a toy Breast Cancer dataset and a simple Multi-Layer Perceptron (SimpleMLP) for demonstration.

Disclaimer: The SimpleMLP model and the dataset loading logic provided in this tutorial were generated by AI and serve only as placeholders to demonstrate the distributed infrastructure.

  • Trial Execution: The train_and_evaluate function is executed for every hyperparameter combination, reporting metrics back via tune.report.
  • Search & Scheduling: Implements OptunaSearch for intelligent parameter sampling and ASHAScheduler for early stopping of underperforming trials. These two parameters can also be modified depending on your need. See tune.search to select another Tune’s Search Algorithms. See tune.schedulers to select another Tune’s Search Algorithms.
  • Fault Tolerance: Includes logic to detect and restore unfinished experiments from a storage path via Tuner.restore, allowing jobs to resume if interrupted by SLURM time limits.

4. SLURM Orchestration (launch_ray_tune.slurm)

This Bash script acts as the bridge between the Jean Zay supercomputer infrastructure and your Python code. It handles resource allocation and the initialization of the distributed Ray cluster.

For all the GPU and CPU directive linked, make sure that your project allows you to ask for these hardwares.

4.1. SBATCH Allocation Parameters

These directives request the necessary hardware resources from Jean Zay:

  • --nodes=2: Requests the allocation of 2 distinct compute nodes.
  • -C h100 and --gres=gpu:4: Specifies the use of nodes equipped with 4 H100 GPUs each, for a total of 8 GPUs across the cluster. You can also ask for a100 or v100 for example.
  • --cpus-per-task=96: Allocates 96 CPU cores per node to massively handle data loading and processing.
  • --hint=nomultithread: Disables hyperthreading (SMT) to guarantee more predictable compute performance. It is a good practice on Jean Zay.
  • --exclusive: Guarantees exclusive use of the allocated nodes, preventing other users from impacting your performance.
  • --time=00:05:00: Sets the execution time limit (Walltime) of the experiment to 5 minutes. It is extremely voluntarily low only for this tutorial purpose. However, you can go up to 20:00:00 hours.
  • --signal=B:SIGUSR1@120: Sets the number of seconds before the time limit where a signal SIGUSR1 is sent to the slurm script. It is mandatory since it will perform a gracefull ending of Ray. (See section 4.4 for more explanations).

4.2. ⚠️ Mandatory Modifications

Before submitting your job via sbatch, you must update these lines with your own information:

  • SLURM Project Account (-A name-project@h100): Replace name-project@h100 with your own Jean Zay compute hour allocation ID. Example : -A abc@h100 for demanding h100 GPUs or -A abc@a100 for demanding a100 GPUs. Be only carefull to be alined with what you ask in the -C directive.
  • Log Files (--output and --error): Modify the hardcoded paths /lustre/fsn1/projects/rech/name-project/jean-zay-id/path/to/logs/ to point to your project's own storage space.
  • Virtual Environment: Update the activation commands (cd $SCRATCH/hpo_raytune and source venv_hpo_raytune/bin/activate) to match the exact path where you installed your Python dependencies.

4.3. Ray Architecture: Head and Workers

The script automates the setup of a Master/Slave (Head/Worker) architecture, which is essential for Ray Tune to distribute training across multiple physical machines:

  • The Head Node (Ray Head):

    • The script first identifies the first node in the list allocated by SLURM (HEAD_NODE) and extracts its IP address (HEAD_IP).
      nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST))
      HEAD_NODE=${nodes[0]}
      HEAD_IP=$(getent hosts $HEAD_NODE | awk '{print $1}')
    • It then launches the Ray master process (ray start --head) on this specific node, opening the communication port 6379.
      srun --nodes=1 --ntasks=1 -w $HEAD_NODE \
          ray start --head \
          --node-ip-address=$HEAD_IP \
          --port=6379 \
          --num-cpus=$SLURM_CPUS_PER_TASK \
          --disable-usage-stats \
          --block &
    • This "Head" node acts as the orchestrator: it will execute the ray_tune.py Python script and will be responsible for scheduling and distributing the different trials (hyperparameter combinations) to the worker nodes.
    python -u ray_tune.py --config config.yaml &
  • The Worker Nodes (Ray Workers):

    • Once the head node is active, the script uses a for loop to iterate through the rest of the nodes allocated by Jean Zay.
      WORKER_PIDS=()
      for node in "${nodes[@]:1}"; do
          WORKER_IP=$(getent hosts $node | awk '{ print $1 }')
          echo "Starting Ray worker on $node ($WORKER_IP)"
          [..]
      done
    • On each remaining node, it launches a slave process (ray start).
      WORKER_PIDS=()
      for node in "${nodes[@]:1}"; do
          WORKER_IP=$(getent hosts $node | awk '{ print $1 }')
          echo "Starting Ray worker on $node ($WORKER_IP)"
    
          srun --nodes=1 --ntasks=1 -w $node \
              ray start \
              --address=$HEAD_IP:6379 \
              --num-cpus=$SLURM_CPUS_PER_TASK \
              --disable-usage-stats \
              --block &
          
          WORKER_PIDS+=($!)
          sleep 5
      done
    • The crucial argument here is --address=$HEAD_IP:6379: it tells the worker nodes how to connect to the head node to form a unified cluster. These workers are the ones that will concretely execute the model training loops.

Once the Ray cluster is formed and all connections are established, the Bash script launches the Python optimization in the background and waits for its completion using the wait $PYTHON_PID. Once the Python script is finished (regardless of the exit code RC), it cleanly shuts down all processes in the distributed cluster using the ray stop.

4.4. Ray Graceful ending

As you experiments can be very long, your Ray Tune experiment should end properly even if the time limit has been exceeded. As a result, the function propagate_signal catch the signal SIGUS1 sent by SLURM and redirect it to Ray ( Ray has it's own SIGURS1 signal handler that shuts down gracefully).

function propagate_signal() {
    echo "Signal SIGUSR1 reçu à T-120s. Envoi de SIGINT à Python..." | tee -a $RAY_LOG_FILE
    kill -SIGINT $PYTHON_PID || true
}
trap propagate_signal SIGUSR1

If you want your slrum script to relaunch the job automatically if all the trials are not yet be performed, you can go at the bottom of the script and decomment the line :

if [ $RC -eq 12 ]; then
    echo "Ray Tune incomplete (exit code 12) → should automatically requeue Slurm job to complete remaining trials!" 
    --> #scontrol requeue $SLURM_JOB_ID
    exit 0

This feature is very powerfull since it allows you automatize the experiment.

WARNING: If you set a huge number of num_samples, the job will be requeued again and again wich can leads to a high computation ressource consumption.


🚀 Getting Started on Jean Zay

To use this tutorial, follow these steps:

  1. Prepare the Environment: Create a virtual environment and install the dependencies from requirements.py. Exemple using python venv:

    python -m venv venv_hpo_raytune
    source ./venv_hpo_raytune/bin/activate
    pip install -r requirements.py
    deactivate
  2. Configure Paths: Update the STORAGE_PATH in ray_tune.py and the #SBATCH output paths in launch_ray_tune.slurm to point to your specific project directories on /lustre or $SCRATCH.

  3. Submit the Job:

    sbatch launch_ray_tune.slurm
  4. Monitor Results: Ray Tune will log results to the directory specified in the configuration, which can be viewed via TensorBoard.

    You can also monitor the results using watch squeue -u you-jean-zay-id but be carefull not to seize the jean zay frontal ressources.


💡 Somes tips

  • If you experiment hasn't finished before the timelimit --time set in the launch_ray_tune.slurm, you will just have to resubmit you slurm sript with the same configuration since it as a restoring feature in the pyhton scipt.
  • If you you want to perform an experiment but you would be willing to persue it later, provide a huge number of num_samples in the config.yaml. In that case, even if the timelimit --time reached, you will have the possibility to relaunch it later (again with the same parameters and configuration).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors