Automated Hyperparameter Optimization (HPO) with Ray Tune#
Ray is an open-source framework designed to scale Python and ML/DL workloads seamlessly across clusters and cloud environments. Ray Tune, part of the Ray ecosystem, enables efficient hyperparameter tuning at any scale. It integrates with popular machine learning frameworks such as PyTorch, TensorFlow, Keras, and XGBoost, leveraging proven search algorithms to identify optimal hyperparameter combinations for objectives like minimizing loss or maximizing accuracy. Additionally, Ray Tune provides flexible scheduling and resource management to optimize computational efficiency and reduce experiment time.
Introduction#
This guide focuses on Hyperparameter Optimization (HPO) for fine-tuning large language models.
You will experiment with both manual and automated approaches to explore how different hyperparameters affect model performance and training cost.
Note
This guide focuses on running the experiments step by step. For deeper insights into the scripts, internal configurations, and implementation details, please see the accompanying repository Kaust-rccl/HPO with Ray
Key Concepts#
Before we proceed, let’s review some essential concepts
What is Hyperparameter Optimization (HPO)?#
It’s the process of finding the best set of hyperparameters (like learning rate, batch size, weight decay) that lead to the best model performance.
Proper tuning can significantly reduce training time, GPU usage, and improve evaluation metrics.
Hyperparameter |
Objective |
Details |
|---|---|---|
Learning Rate (lr) |
Controls how much the model updates weights after each training step. |
Too high → unstable training.Too low → very slow convergence. |
Weight Decay (wd) |
A form of regularization that prevents over-fitting by penalizing large weights. |
|
Batch Size (bs) |
Number of samples processed before updating model weights. |
Larger batches can speed up training but need more GPU memory. |
Evaluation Metrics#
Metric |
Details |
|---|---|
Evaluation Loss (eval_loss) |
Measures how well the model predicts on the validation dataset. Lower is better (indicates better generalization). |
GPU Hours |
Total amount of GPU time consumed. 1 GPU hour = 1 GPU used for 1 hour (e.g., 4 GPUs × 30 minutes = 2 GPU hours) Useful for comparing cost-efficiency of different methods. |
Introduction to Ray Tune (Automated HPO Framework)#
It’s a Python library for distributed hyperparameter optimization, that automates running multiple experiments in parallel and selecting the best configurations. Saves time and GPU resources by intelligently stopping poor-performing trials early and focusing on promising ones.
Ray Tune Schedulers Overview#
Schedulers decide how trials are run, paused, or stopped:
Scheduler |
Details |
|---|---|
ASHA (Asynchronous Successive Halving Algorithm) |
It starts many trials with different hyperparameter combinations, then periodically stops the worst-performing trials early to free resources for better ones. Best used for fast, exploratory HPO where you want to test many configurations quickly. |
PBT (Population-Based Training) |
It starts with a “population” of trials, then periodically copies the weights and hyperparameter from top-performing trials to worse ones. Also mutates (perturbs) hyperparameter dynamically during training. Best used for long-running training where hyperparameter might need to change over time. The “best hyperparameters” at the end are only the final phase’s values; the best weights depend on the entire sequence of changes. |
Bayesian (Bayesian Optimization with HyperBand - BOHB) |
Combines Bayesian Optimization (learns from past trials to suggest better hyperparameters) with HyperBand (efficient early-stopping strategy). Starts with many short trials and promotes the best ones for longer training, refining them with probabilistic guidance. Best used when you have a limited GPU budget and want to balance smart search with efficient resource use. Unlike ASHA, BOHB doesn’t just explore randomly — it builds a model of the search space and chooses configurations based on predicted performance. |
Introduction to DeepSpeed#
DeepSpeed is an optimization library designed for scaling and accelerating deep learning training, especially for large models like BLOOM.
In this workshop, DeepSpeed is used to:
Reduce memory usage via ZeRO Stage 3 optimization.
Enable mixed precision training (
fp16) for faster computation and lower memory.Automatically scale batch sizes to maximize GPU utilization.
Train large models efficiently on limited hardware (e.g., 1–2 GPUs).
It integrates seamlessly with Hugging Face’s Trainer and requires only a config file — no modification to training code is needed.
Note
For a full breakdown of the DeepSpeed config used in this workshop, navigate through the repo Deepspeed configuration in kaust-rccl/hpo-with-ray
Initial Setup#
This repository Kaust-rccl/HPO with Ray is organized into modular directories for code, configuration, and experiments.
Starting with cloning the repo:
git clone https://github.com/kaust-rccl/hpo-with-ray.git
cd hpo-with-ray/deepspeed
Repository Structure
.
├── deepspeed/
│ ├── config/ # DeepSpeed configuration files
│ │ ├── ds_config.json # ZeRO-3 + FP16 training config
│ │ └── README.md # Explanation of the config fields
│
├── experiments/ # SLURM job scripts and run setups
│ ├── manual/ # Manual grid search HPO
│ │ ├── bloom_hpo_manual.slurm
│ │ └── README.md
│ └── raytune/
│ ├── scheduler/
│ │ ├── asha/ # ASHA-based Ray Tune setup
│ │ │ ├── head_node_raytune_asha_hpo.slurm
│ │ │ ├── worker_node_raytune_asha_hpo.slurm
│ │ │ └── README.md
│ │ ├── bayesian/ # BOHB setup (Bayesian Optimization with HyperBand)
│ │ │ └── README.md
│ │ └── pbt/ # Population-Based Training setup
│ │ └── README.md
│ └── README.md # Ray Tune general overview
│
├── scripts/ # Python training scripts
│ ├── manual/
│ │ ├── bloom_hpo_manual.py # Runs single grid search config
│ │ └── logs_parser.py # Parses manual run logs into CSV
│ └── raytune/
│ ├── scheduler/
│ │ ├── asha/raytune_asha_hpo.py
│ │ ├── bayesian/README.md
│ │ └── pbt/README.md
│ └── README.md # Ray Tune script overview
│
└── README.md # Main workshop overview and grouping instructions
Environment Setup#
To run the Ray Tune experiments, you’ll need a properly configured Conda environment.
If you haven’t installed conda yet, please follow using conda on Ibex guide to get started.
Build the conda environment required using the recommended yml file in the project directory, using command:
conda env create -f environment/hpo-raytune.yml
Note
The Conda environment should be built on an allocated GPU node. Please ensure you allocate a GPU node before starting the build.
Running Experiments with Ray Tune#
In this project, you will go through experimenting the three schedulers [asha, bayesian, pbt]
Note
All runs were performed using the same SQuAD subset, model configuration, and DeepSpeed setup for fair comparison.
Baseline: Manual Experiment#
In this baseline experiment, we manually perform hyperparameter optimization (HPO) by iterating through a predefined grid of parameters, including learning rate, batch size, and weight decay.
Experiment Setup#
The SLURM script bloom_hpo_manual.slurm manages job submission, environment setup, and iteration control.
While the Python script (bloom_hpo_manual.py) handles data preprocessing, model fine-tuning, and evaluation.
Running the Experiment#
To run the manual experiment job:
cd hpo-with-ray/deepspeed/experiments/manual
sbatch bloom_hpo_manual.slurm
Results#
The outputs are logged under directory logs/-.out
Here are example output for reference:
#
lr
bs
wd
eval_loss
runtime_s
1
1e-05
1
0.0
9.720768928527832
963.49
2
1e-05
1
0.01
9.720768928527832
962.06
3
1e-05
1
0.01
9.720768928527832
962.63
4
1e-05
2
0.0
10.004271507263184
600.88
5
1e-05
2
0.0
10.004271507263184
600.64
6
1e-05
2
0.01
10.004271507263184
603.96
7
1e-05
2
0.01
10.004271507263184
604.13
8
0.0002
1
0.0
30.220291137695312
1109.93
9
0.0002
1
0.0
30.220291137695312
1110.14
10
0.0002
1
0.01
30.220291137695312
1088.39
11
0.0002
1
0.01
30.220291137695312
1088.09
12
0.0002
2
0.0
21.152585983276367
725.37
13
0.0002
2
0.01
21.152585983276367
727.05
14
5e-06
1
0.0
9.334717750549316
911.04
15
5e-06
1
0.01
9.334717750549316
917.87
16
5e-06
2
0.0
9.569835662841797
513.53
17
5e-06
2
0.01
9.569835662841797
518.2
Additional Tuning#
You can modify the defined Hyperparameter grid array defined inside the bloom_hpo_manual.slurm
LRs=(1e-5 2e-4 5e-6) BSs=(1 2) WDs=(0.0 0.01)
Automated HPO with Ray Tune Using ASHA Scheduler#
In this exercise, we perform automated hyperparameter optimization using Ray Tune’s ASHA (Asynchronous Successive Halving Algorithm) scheduler. Unlike the manual grid search, ASHA runs multiple trials concurrently and stops poor-performing trials early, freeing resources for more promising ones.
Experiment Setup#
The training is handled by a Python script raytune_asha_hpo.py.
While job orchestration on the HPC cluster is handled by the two SLURM scripts:
Head node launcher: head_node_raytune_asha_hpo.slurm
Worker node launcher: worker_node_raytune_asha_hpo.slurm
Breaking Down the Building Block#
Component |
Description |
|---|---|
Training Python File |
|
SLURM Scripts |
|
Note
For more additional details, refer to the Github repo for Ray-Tune (ASHA Scheduler) HPO.
Running the Experiment#
To run the manual experiment job, make sure you are in the same directory of the experiment:
cd hpo-with-ray/deepspeed/experiments/raytune/scheduler/asha
Submit the job using sbatch, and optionally override the search space hyperparameter using environment variables:
LR_LOWER=1e-5 \ LR_UPPER=2e-4 \ BS_CHOICES="1 2" \ WD_CHOICES="0.0 0.01" \ sbatch head_node_raytune_asha_hpo.slurm
Monitor the job in the queue with:
squeue --me
Results#
Navigate and open the Ray Tune logs file (produced by the head SLURM script):
cd ./logs cat ray_head_bloom_5epochs-<jobid>.out
Find the logged job start and finish time, it should look like:
===== JOB 39567495 START : yyyy-mm-dd hh:mm:ss +03 ===== ... ===== JOB 39567495 FINISH : yyyy-mm-dd hh:mm:ss +03 =====
Scroll inside the log to locate the Ray Tune trials table (ASHA prints it automatically), it will looks similar to:
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name status train_loop_config/lr ...fig/per_device_bs train_loop_config/wd iter total time (s) eval_loss │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ TorchTrainer_c89517f2 TERMINATED 5.39429e-05 2 0 5 484.505 10.2314 │ │ TorchTrainer_46b3bb6c TERMINATED 5.64985e-06 1 0.01 5 793.556 9.27868 │ │ ... │ ╰─────────────────
Extract trial details to fill the following table:
Combo ID
Learning Rate (lr)
Batch Size (bs)
Weight Decay (wd)
Eval Loss
Runtime (s)
1
Here are example output for reference:
Combo ID
Learning Rate (lr)
Batch Size (bs)
Weight Decay (wd)
Eval Loss
Runtime (s)
1
3.32647e-05
1
0
13.8639
198.073
2
2.59232e-05
1
0
12.8137
180.903
3
2.2648e-05
2
0
9.95289
454.752
4
1.19538e-05
1
0.01
9.85759
642.327
5
1.6357e-05
2
0
9.74721
338.861
6
1.01433e-05
2
0
9.75011
366.073
7
5.15709e-05
1
0
16.5181
192.285
8
0.000127644
1
0
19.1133
180.044
9
0.000187984
2
0
17.7984
108.882
10
1.20387e-05
1
0
9.49014
336.217
11
3.23901e-05
1
0.01
13.7576
184.294
12
0.000153832
2
0.01
16.9441
107.262
At the bottom of the log, find the Best Trial Result printed by Ray Tune, it should be similar to:
{'eval_loss': 9.490140914916992, 'eval_runtime': 1.1291, 'eval_samples_per_second': 88.564, 'eval_steps_per_second': 6.2, 'epoch': 2.0, 'timestamp': 1755163992, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 2, 'trial_id': '9e128c22', 'date': '2025-08-14_12-33-12', 'time_this_iter_s': 155.61834907531738, 'time_total_s': 336.2171709537506, 'pid': 1152287, 'hostname': 'gpu211-18', 'node_ip': '10.109.25.103', 'config': {'train_loop_config': {'lr': 1.2038662726466814e-05, 'per_device_bs': 1, 'wd': 0.0}}, 'time_since_restore': 336.2171709537506, 'iterations_since_restore': 2, 'experiment_tag': '10_lr=0.0000,per_device_bs=1,wd=0.0000'}
- Filling it to a Table:
Best Learning Rate (lr)
Best Batch Size (bs)
Best Weight Decay (wd)
Best Eval Loss
Total Runtime (s)
Epochs
1.20387e-05
1
0.0
9.49014
336.217
2
Additional Experiments#
For the complete set of HPO experiments, including Population-Based Training (PBT) and Bayesian Optimization, please refer to the workshop section below.
The following results are shared as a reference to illustrate the expected outcomes from the (PBT, and Bayesian) schedulers experiments:
Note
These experiments were run on A100 GPUs.
Results for Automated HPO with Ray Tune Using Population-Based Training (PBT)
Combo ID
Learning Rate (lr)
Batch Size (bs)
Weight Decay (wd)
Eval Loss
Runtime (s)
1
1.47127e-05
1
0.01
10.3883
1899.7
2
9.7152e-06
1
0
9.70347
1033.4
3
7.80525e-06
1
0
9.57514
1139.95
4
1.4623e-05
2
0.01
9.8301
714.587
5
5.02773e-05
1
0
20.8425
1178.73
6
0.000112314
2
0.01
11.7429
776.113
7
1.11141e-05
2
0
10.0171
1055.79
8
1.61802e-05
1
0.01
11.3779
1165.33
9
2.6012e-05
2
0.01
10.0428
779.346
10
2.55566e-05
1
0
14.8689
1217.76
11
9.26179e-06
2
0
9.97798
630.755
12
2.75884e-05
1
0.01
15.4906
1137.58
Best Learning Rate (lr)
Best Batch Size (bs)
Best Weight Decay (wd)
Best Eval Loses
Total Runtime (s)
7.805253063551074e-06
1
0.0
9.575139045715332
1139.949450492859
Result For Automated HPO with Ray Tune Using Bayesian Optimization
Combo ID
Learning Rate (lr)
Batch Size (bs)
Weight Decay (wd)
Eval Loss
Runtime (s)
1
1.85055e-05
1
0
10.2312
472.107
2
6.96989e-05
2
0
10.4519
349.592
3
1.64635e-05
2
0.01
9.47936
379.546
4
1.12689e-05
1
0
9.3748
439.334
5
7.12641e-06
1
0
9.56968
834.863
6
0.000110208
2
0.01
11.7313
463.481
7
5.41714e-05
1
0
16.9458
177.505
8
8.38166e-05
1
0.01
19.6679
187.143
9
7.2306e-06
1
0
9.52462
665.797
10
5.23056e-05
1
0.01
16.5229
186.682
11
2.50368e-05
1
0
13.0254
851.923
12
0.000116277
2
0
19.1532
421.786
Best Learning Rate (lr)
Best Batch Size (bs)
Best Weight Decay (wd)
Best Eval Loses
Total Runtime (s)
Epochs
1.1268857461796244e-05
1
0.0
9.374804496765137
439.33418583869934
1
Workshop Reference and Next Steps#
Overview#
This workshop focuses on Hyperparameter Optimization (HPO) for fine-tuning large language models. You will experiment with both manual and automated approaches to explore how different hyperparameters affect model performance and training cost.
Note
Please follow the workshop Kaust-rccl/HPO with Ray GitHub repo.
Team Grouping & HPO Assignment Instructions#
In this workshop, you’ll work in teams of 3 students. Each group will:
- Choose a hyperparameter range for:
Learning Rate (lr)
Weight Decay (wd)
Batch Size (bs)
- Divide up the HPO strategies as follows:
Member 1: Automated HPO with ASHA Scheduler
Member 2: Automated HPO with Population-Based Training (PBT)
Member 3: Automated HPO with Bayesian Training (BOHB)
Run the experiments using your assigned method.
At the end, collect results, compare them as a team, and fill in the provided group summary.
Group Submission Checklist#
Each group must submit the following:
☐ A filled results table from each method.
☐ Quiz answers from each scheduler’s README.
☐ A 5–7 line comparison discussing:
Which method found the best configuration?
Which used fewer GPU-hours?
Which was faster overall?
What would you use for real-world tuning?
Cost Comparison (Fill-in Template)#
You can use this format to summarize and compare results across methods, and to justify your preferred tuning strategy.
Run Type |
Eval Loss (30 Epochs) |
Runtime (to find best HP) |
# GPUs |
GPU Minutes |
Cost Ratio (Ray/Manual) |
|---|---|---|---|---|---|
Manual Best |
11.7463 |
177 |
2 |
354 |
1 (reference) |
Ray Best (ASHA) |
|||||
Ray Best (PBT) |
|||||
Ray Best (Bayesian) |
Note
Cost ratio is based on total GPU time consumed to find the best
configuration (e.g., Ray GPU-minutes / Manual GPU-minutes).
Warning
⚠️ Do not run multiple experiments simultaneously in parallel. This may cause job failures.