Automated Hyperparameter Optimization (HPO) with Ray Tune#

Ray is an open-source framework designed to scale Python and ML/DL workloads seamlessly across clusters and cloud environments. Ray Tune, part of the Ray ecosystem, enables efficient hyperparameter tuning at any scale. It integrates with popular machine learning frameworks such as PyTorch, TensorFlow, Keras, and XGBoost, leveraging proven search algorithms to identify optimal hyperparameter combinations for objectives like minimizing loss or maximizing accuracy. Additionally, Ray Tune provides flexible scheduling and resource management to optimize computational efficiency and reduce experiment time.

Introduction#

This guide focuses on Hyperparameter Optimization (HPO) for fine-tuning large language models.

You will experiment with both manual and automated approaches to explore how different hyperparameters affect model performance and training cost.

Baseline manual experiment
Automated HPO with Ray Tune using ASHA scheduler
Additional Experiment using Population-Based Training (PBT) and Bayesian Optimization

Note

This guide focuses on running the experiments step by step. For deeper insights into the scripts, internal configurations, and implementation details, please see the accompanying repository Kaust-rccl/HPO with Ray

Key Concepts#

Before we proceed, let’s review some essential concepts

What is Hyperparameter Optimization (HPO)?#

It’s the process of finding the best set of hyperparameters (like learning rate, batch size, weight decay) that lead to the best model performance.
Proper tuning can significantly reduce training time, GPU usage, and improve evaluation metrics.

Hyperparameter	Objective	Details
Learning Rate (lr)	Controls how much the model updates weights after each training step.	Too high → unstable training.Too low → very slow convergence.
Weight Decay (wd)	A form of regularization that prevents over-fitting by penalizing large weights.
Batch Size (bs)	Number of samples processed before updating model weights.	Larger batches can speed up training but need more GPU memory.

Evaluation Metrics#

Metric	Details
Evaluation Loss (eval_loss)	Measures how well the model predicts on the validation dataset. Lower is better (indicates better generalization).
GPU Hours	Total amount of GPU time consumed. 1 GPU hour = 1 GPU used for 1 hour (e.g., 4 GPUs × 30 minutes = 2 GPU hours) Useful for comparing cost-efficiency of different methods.

Metric

Details

Evaluation Loss (eval_loss)

Measures how well the model predicts on the validation dataset.

Lower is better (indicates better generalization).

GPU Hours

Total amount of GPU time consumed.

1 GPU hour = 1 GPU used for 1 hour (e.g., 4 GPUs × 30 minutes = 2 GPU hours)

Useful for comparing cost-efficiency of different methods.

Introduction to Ray Tune (Automated HPO Framework)#

It’s a Python library for distributed hyperparameter optimization, that automates running multiple experiments in parallel and selecting the best configurations. Saves time and GPU resources by intelligently stopping poor-performing trials early and focusing on promising ones.

Ray Tune Schedulers Overview#

Schedulers decide how trials are run, paused, or stopped:

Scheduler	Details
ASHA (Asynchronous Successive Halving Algorithm)	It starts many trials with different hyperparameter combinations, then periodically stops the worst-performing trials early to free resources for better ones. Best used for fast, exploratory HPO where you want to test many configurations quickly.
PBT (Population-Based Training)	It starts with a “population” of trials, then periodically copies the weights and hyperparameter from top-performing trials to worse ones. Also mutates (perturbs) hyperparameter dynamically during training. Best used for long-running training where hyperparameter might need to change over time. The “best hyperparameters” at the end are only the final phase’s values; the best weights depend on the entire sequence of changes.
Bayesian (Bayesian Optimization with HyperBand - BOHB)	Combines Bayesian Optimization (learns from past trials to suggest better hyperparameters) with HyperBand (efficient early-stopping strategy). Starts with many short trials and promotes the best ones for longer training, refining them with probabilistic guidance. Best used when you have a limited GPU budget and want to balance smart search with efficient resource use. Unlike ASHA, BOHB doesn’t just explore randomly — it builds a model of the search space and chooses configurations based on predicted performance.

Scheduler

Details

ASHA (Asynchronous Successive Halving Algorithm)

It starts many trials with different hyperparameter combinations, then periodically stops the worst-performing trials early to free resources for better ones.

Best used for fast, exploratory HPO where you want to test many configurations quickly.

PBT (Population-Based Training)

It starts with a “population” of trials, then periodically copies the weights and hyperparameter from top-performing trials to worse ones.

Also mutates (perturbs) hyperparameter dynamically during training.

Best used for long-running training where hyperparameter might need to change over time.

The “best hyperparameters” at the end are only the final phase’s values; the best weights depend on the entire sequence of changes.

Bayesian (Bayesian Optimization with HyperBand - BOHB)

Combines Bayesian Optimization (learns from past trials to suggest better hyperparameters) with HyperBand (efficient early-stopping strategy).

Starts with many short trials and promotes the best ones for longer training, refining them with probabilistic guidance.

Best used when you have a limited GPU budget and want to balance smart search with efficient resource use.

Unlike ASHA, BOHB doesn’t just explore randomly — it builds a model of the search space and chooses configurations based on predicted performance.

Introduction to DeepSpeed#

DeepSpeed is an optimization library designed for scaling and accelerating deep learning training, especially for large models like BLOOM.

In this workshop, DeepSpeed is used to:

Reduce memory usage via ZeRO Stage 3 optimization.
Enable mixed precision training (fp16) for faster computation and lower memory.
Automatically scale batch sizes to maximize GPU utilization.
Train large models efficiently on limited hardware (e.g., 1–2 GPUs).

It integrates seamlessly with Hugging Face’s Trainer and requires only a config file — no modification to training code is needed.

Note

For a full breakdown of the DeepSpeed config used in this workshop, navigate through the repo Deepspeed configuration in kaust-rccl/hpo-with-ray

Initial Setup#

This repository Kaust-rccl/HPO with Ray is organized into modular directories for code, configuration, and experiments.

Starting with cloning the repo:

git clone https://github.com/kaust-rccl/hpo-with-ray.git
cd hpo-with-ray/deepspeed

Repository Structure

.
├── deepspeed/
│   ├── config/                  # DeepSpeed configuration files
│   │   ├── ds_config.json       # ZeRO-3 + FP16 training config
│   │   └── README.md            # Explanation of the config fields
│
├── experiments/                # SLURM job scripts and run setups
│   ├── manual/                 # Manual grid search HPO
│   │   ├── bloom_hpo_manual.slurm
│   │   └── README.md
│   └── raytune/
│       ├── scheduler/
│       │   ├── asha/           # ASHA-based Ray Tune setup
│       │   │   ├── head_node_raytune_asha_hpo.slurm
│       │   │   ├── worker_node_raytune_asha_hpo.slurm
│       │   │   └── README.md
│       │   ├── bayesian/       # BOHB setup (Bayesian Optimization with HyperBand)
│       │   │   └── README.md
│       │   └── pbt/            # Population-Based Training setup
│       │       └── README.md
│       └── README.md           # Ray Tune general overview
│
├── scripts/                    # Python training scripts
│   ├── manual/
│   │   ├── bloom_hpo_manual.py # Runs single grid search config
│   │   └── logs_parser.py      # Parses manual run logs into CSV
│   └── raytune/
│       ├── scheduler/
│       │   ├── asha/raytune_asha_hpo.py
│       │   ├── bayesian/README.md
│       │   └── pbt/README.md
│       └── README.md           # Ray Tune script overview
│
└── README.md                   # Main workshop overview and grouping instructions

Environment Setup#

To run the Ray Tune experiments, you’ll need a properly configured Conda environment.

If you haven’t installed conda yet, please follow using conda on Ibex guide to get started.
Build the conda environment required using the recommended yml file in the project directory, using command:

conda env create -f environment/hpo-raytune.yml

Note

The Conda environment should be built on an allocated GPU node. Please ensure you allocate a GPU node before starting the build.

Running Experiments with Ray Tune#

In this project, you will go through experimenting the three schedulers [asha, bayesian, pbt]

Note

All runs were performed using the same SQuAD subset, model configuration, and DeepSpeed setup for fair comparison.

Baseline: Manual Experiment#

In this baseline experiment, we manually perform hyperparameter optimization (HPO) by iterating through a predefined grid of parameters, including learning rate, batch size, and weight decay.

Experiment Setup#

The SLURM script bloom_hpo_manual.slurm manages job submission, environment setup, and iteration control.
While the Python script (bloom_hpo_manual.py) handles data preprocessing, model fine-tuning, and evaluation.

Running the Experiment#

To run the manual experiment job:

cd hpo-with-ray/deepspeed/experiments/manual
sbatch bloom_hpo_manual.slurm

Results#

The outputs are logged under directory logs/-.out

Here are example output for reference:

#	lr	bs	wd	eval_loss	runtime_s
1	1e-05	1	0.0	9.720768928527832	963.49
2	1e-05	1	0.01	9.720768928527832	962.06
3	1e-05	1	0.01	9.720768928527832	962.63
4	1e-05	2	0.0	10.004271507263184	600.88
5	1e-05	2	0.0	10.004271507263184	600.64
6	1e-05	2	0.01	10.004271507263184	603.96
7	1e-05	2	0.01	10.004271507263184	604.13
8	0.0002	1	0.0	30.220291137695312	1109.93
9	0.0002	1	0.0	30.220291137695312	1110.14
10	0.0002	1	0.01	30.220291137695312	1088.39
11	0.0002	1	0.01	30.220291137695312	1088.09
12	0.0002	2	0.0	21.152585983276367	725.37
13	0.0002	2	0.01	21.152585983276367	727.05
14	5e-06	1	0.0	9.334717750549316	911.04
15	5e-06	1	0.01	9.334717750549316	917.87
16	5e-06	2	0.0	9.569835662841797	513.53
17	5e-06	2	0.01	9.569835662841797	518.2

Additional Tuning#

You can modify the defined Hyperparameter grid array defined inside the bloom_hpo_manual.slurm
```
LRs=(1e-5 2e-4 5e-6)
BSs=(1 2)
WDs=(0.0 0.01)
```

Automated HPO with Ray Tune Using ASHA Scheduler#

In this exercise, we perform automated hyperparameter optimization using Ray Tune’s ASHA (Asynchronous Successive Halving Algorithm) scheduler. Unlike the manual grid search, ASHA runs multiple trials concurrently and stops poor-performing trials early, freeing resources for more promising ones.

Experiment Setup#

The training is handled by a Python script raytune_asha_hpo.py.
While job orchestration on the HPC cluster is handled by the two SLURM scripts:
1. Head node launcher: head_node_raytune_asha_hpo.slurm
2. Worker node launcher: worker_node_raytune_asha_hpo.slurm

Breaking Down the Building Block#

Component	Description
Training Python File	Loads and preprocesses the dataset. Sets up the model. Integrates Ray Tune. Defines the hyperparameter search space: "train_loop_config": { "lr": tune.loguniform(5e-6, 2e-4), "per_device_bs": tune.choice([1, 2]), "wd": tune.choice([0.0, 0.01]) } Configures the ASHA scheduler: scheduler = tune.schedulers.ASHAScheduler( metric="eval_loss", mode="min", grace_period=1, max_t=5, reduction_factor=2 )
SLURM Scripts	Follows similar preparation steps as the manual run (environment, CUDA, Conda, logging). Adjusted to enable distributed and concurrent Ray trials. Head Node Script: Logs start/end times via `trap`. Starts a Ray head node with dynamic dashboard/worker ports. Spawns worker jobs automatically (`worker_node_v100.slurm`). Runs `bloom_ray_tune.py` once; Ray schedules trials. Worker Node Script: Not used in manual HPO. Joins the Ray head node to run concurrent trials. Allocates full node resources (e.g., 8×V100 GPUs per worker).

Component

Description

Training Python File

Loads and preprocesses the dataset.
Sets up the model.
Integrates Ray Tune.

Defines the hyperparameter search space:

"train_loop_config": {
    "lr": tune.loguniform(5e-6, 2e-4),
    "per_device_bs": tune.choice([1, 2]),
    "wd": tune.choice([0.0, 0.01])
}

Configures the ASHA scheduler:

scheduler = tune.schedulers.ASHAScheduler(
    metric="eval_loss",
    mode="min",
    grace_period=1,
    max_t=5,
    reduction_factor=2
)

SLURM Scripts

Follows similar preparation steps as the manual run (environment, CUDA, Conda, logging).
Adjusted to enable distributed and concurrent Ray trials.
Head Node Script:
- Logs start/end times via trap.
- Starts a Ray head node with dynamic dashboard/worker ports.
- Spawns worker jobs automatically (worker_node_v100.slurm).
- Runs bloom_ray_tune.py once; Ray schedules trials.
Worker Node Script:
- Not used in manual HPO.
- Joins the Ray head node to run concurrent trials.
- Allocates full node resources (e.g., 8×V100 GPUs per worker).

Note

For more additional details, refer to the Github repo for Ray-Tune (ASHA Scheduler) HPO.

Running the Experiment#

To run the manual experiment job, make sure you are in the same directory of the experiment:
```
cd hpo-with-ray/deepspeed/experiments/raytune/scheduler/asha
```

Submit the job using sbatch, and optionally override the search space hyperparameter using environment variables:

LR_LOWER=1e-5 \
LR_UPPER=2e-4 \
BS_CHOICES="1 2" \
WD_CHOICES="0.0 0.01" \
sbatch head_node_raytune_asha_hpo.slurm

Monitor the job in the queue with:
```
squeue --me
```

Results#

Navigate and open the Ray Tune logs file (produced by the head SLURM script):
```
cd ./logs
cat ray_head_bloom_5epochs-<jobid>.out
```

Find the logged job start and finish time, it should look like:

===== JOB 39567495 START  : yyyy-mm-dd hh:mm:ss +03 =====
...
===== JOB 39567495 FINISH : yyyy-mm-dd hh:mm:ss +03 =====

Scroll inside the log to locate the Ray Tune trials table (ASHA prints it automatically), it will looks similar to:

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name              status         train_loop_config/lr     ...fig/per_device_bs     train_loop_config/wd     iter     total time (s)     eval_loss │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_c89517f2   TERMINATED              5.39429e-05                        2                     0           5            484.505      10.2314  │
│ TorchTrainer_46b3bb6c   TERMINATED              5.64985e-06                        1                     0.01        5            793.556       9.27868 │
│ ...                                                                                                                                                      │
╰─────────────────

Extract trial details to fill the following table:

Combo ID

Learning Rate (lr)

Batch Size (bs)

Weight Decay (wd)

Eval Loss

Runtime (s)

1

Here are example output for reference:

Combo ID	Learning Rate (lr)	Batch Size (bs)	Weight Decay (wd)	Eval Loss	Runtime (s)
1	3.32647e-05	1	0	13.8639	198.073
2	2.59232e-05	1	0	12.8137	180.903
3	2.2648e-05	2	0	9.95289	454.752
4	1.19538e-05	1	0.01	9.85759	642.327
5	1.6357e-05	2	0	9.74721	338.861
6	1.01433e-05	2	0	9.75011	366.073
7	5.15709e-05	1	0	16.5181	192.285
8	0.000127644	1	0	19.1133	180.044
9	0.000187984	2	0	17.7984	108.882
10	1.20387e-05	1	0	9.49014	336.217
11	3.23901e-05	1	0.01	13.7576	184.294
12	0.000153832	2	0.01	16.9441	107.262

At the bottom of the log, find the Best Trial Result printed by Ray Tune, it should be similar to:

{'eval_loss': 9.490140914916992, 'eval_runtime': 1.1291, 'eval_samples_per_second': 88.564, 'eval_steps_per_second': 6.2, 'epoch': 2.0, 'timestamp': 1755163992, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 2, 'trial_id': '9e128c22', 'date': '2025-08-14_12-33-12', 'time_this_iter_s': 155.61834907531738, 'time_total_s': 336.2171709537506, 'pid': 1152287, 'hostname': 'gpu211-18', 'node_ip': '10.109.25.103', 'config': {'train_loop_config': {'lr': 1.2038662726466814e-05, 'per_device_bs': 1, 'wd': 0.0}}, 'time_since_restore': 336.2171709537506, 'iterations_since_restore': 2, 'experiment_tag': '10_lr=0.0000,per_device_bs=1,wd=0.0000'}

Filling it to a Table:

Best Learning Rate (lr)

Best Batch Size (bs)

Best Weight Decay (wd)

Best Eval Loss

Total Runtime (s)

Epochs

1.20387e-05

1

0.0

9.49014

336.217

2

Best Learning Rate (lr)	Best Batch Size (bs)	Best Weight Decay (wd)	Best Eval Loss	Total Runtime (s)	Epochs
1.20387e-05	1	0.0	9.49014	336.217	2

Additional Experiments#

For the complete set of HPO experiments, including Population-Based Training (PBT) and Bayesian Optimization, please refer to the workshop section below.

The following results are shared as a reference to illustrate the expected outcomes from the (PBT, and Bayesian) schedulers experiments:

Note

These experiments were run on A100 GPUs.

Results for Automated HPO with Ray Tune Using Population-Based Training (PBT)

Combo ID	Learning Rate (lr)	Batch Size (bs)	Weight Decay (wd)	Eval Loss	Runtime (s)
1	1.47127e-05	1	0.01	10.3883	1899.7
2	9.7152e-06	1	0	9.70347	1033.4
3	7.80525e-06	1	0	9.57514	1139.95
4	1.4623e-05	2	0.01	9.8301	714.587
5	5.02773e-05	1	0	20.8425	1178.73
6	0.000112314	2	0.01	11.7429	776.113
7	1.11141e-05	2	0	10.0171	1055.79
8	1.61802e-05	1	0.01	11.3779	1165.33
9	2.6012e-05	2	0.01	10.0428	779.346
10	2.55566e-05	1	0	14.8689	1217.76
11	9.26179e-06	2	0	9.97798	630.755
12	2.75884e-05	1	0.01	15.4906	1137.58

Best Learning Rate (lr)	Best Batch Size (bs)	Best Weight Decay (wd)	Best Eval Loses	Total Runtime (s)
7.805253063551074e-06	1	0.0	9.575139045715332	1139.949450492859

Result For Automated HPO with Ray Tune Using Bayesian Optimization

Combo ID	Learning Rate (lr)	Batch Size (bs)	Weight Decay (wd)	Eval Loss	Runtime (s)
1	1.85055e-05	1	0	10.2312	472.107
2	6.96989e-05	2	0	10.4519	349.592
3	1.64635e-05	2	0.01	9.47936	379.546
4	1.12689e-05	1	0	9.3748	439.334
5	7.12641e-06	1	0	9.56968	834.863
6	0.000110208	2	0.01	11.7313	463.481
7	5.41714e-05	1	0	16.9458	177.505
8	8.38166e-05	1	0.01	19.6679	187.143
9	7.2306e-06	1	0	9.52462	665.797
10	5.23056e-05	1	0.01	16.5229	186.682
11	2.50368e-05	1	0	13.0254	851.923
12	0.000116277	2	0	19.1532	421.786

Best Learning Rate (lr)	Best Batch Size (bs)	Best Weight Decay (wd)	Best Eval Loses	Total Runtime (s)	Epochs
1.1268857461796244e-05	1	0.0	9.374804496765137	439.33418583869934	1

Workshop Reference and Next Steps#

Overview#

This workshop focuses on Hyperparameter Optimization (HPO) for fine-tuning large language models. You will experiment with both manual and automated approaches to explore how different hyperparameters affect model performance and training cost.

Note

Please follow the workshop Kaust-rccl/HPO with Ray GitHub repo.

Team Grouping & HPO Assignment Instructions#

In this workshop, you’ll work in teams of 3 students. Each group will:

Choose a hyperparameter range for:
- Learning Rate (lr)
- Weight Decay (wd)
- Batch Size (bs)
Divide up the HPO strategies as follows:
- Member 1: Automated HPO with ASHA Scheduler
- Member 2: Automated HPO with Population-Based Training (PBT)
- Member 3: Automated HPO with Bayesian Training (BOHB)
Run the experiments using your assigned method.
At the end, collect results, compare them as a team, and fill in the provided group summary.

Group Submission Checklist#

Each group must submit the following:

☐ A filled results table from each method.
☐ Quiz answers from each scheduler’s README.
☐ A 5–7 line comparison discussing:
- Which method found the best configuration?
- Which used fewer GPU-hours?
- Which was faster overall?
- What would you use for real-world tuning?

Cost Comparison (Fill-in Template)#

You can use this format to summarize and compare results across methods, and to justify your preferred tuning strategy.

Run Type	Eval Loss (30 Epochs)	Runtime (to find best HP)	# GPUs	GPU Minutes	Cost Ratio (Ray/Manual)
Manual Best	11.7463	177	2	354	1 (reference)
Ray Best (ASHA)
Ray Best (PBT)
Ray Best (Bayesian)

Note

Cost ratio is based on total GPU time consumed to find the best configuration (e.g., Ray GPU-minutes / Manual GPU-minutes).

Warning

⚠️ Do not run multiple experiments simultaneously in parallel. This may cause job failures.

Automated Hyperparameter Optimization (HPO) with Ray Tune#

Introduction#

Key Concepts#

What is Hyperparameter Optimization (HPO)?#

Evaluation Metrics#

Introduction to Ray Tune (Automated HPO Framework)#

Ray Tune Schedulers Overview#

Introduction to DeepSpeed#

Initial Setup#

Environment Setup#

Running Experiments with Ray Tune#

Baseline: Manual Experiment#

Experiment Setup#

Running the Experiment#

Results#

Additional Tuning#

Automated HPO with Ray Tune Using ASHA Scheduler#

Experiment Setup#

Breaking Down the Building Block#

Running the Experiment#

Results#

Additional Experiments#

Workshop Reference and Next Steps#

Overview#

Team Grouping & HPO Assignment Instructions#

Navigation to Instructions#

Group Submission Checklist#

Cost Comparison (Fill-in Template)#