Accelerate API by Hugginface#
Accelerate provides an easy API to make your scripts run on any kind of distributed setting (multi-GPU on one node, multi-GPU on several nodes) while still letting you write your own training loop.
Accelerate on Ibex#
Installing Accelerate#
You’ll need to install conda first, please check
You can save the following as a file named env.yml
name: acc_env
channels:
- conda-forge
- pytorch
- nvidia
- anaconda
- defaults
dependencies:
- datasets=2.3.2
- python=3.9
- pip=23.2.1
- accelerate=0.29
- cudatoolkit=11.8
- transformers=4.33.1
- pytorch=2.0.1
- torchvision=0.15.2
- torchaudio=2.0.2
- pytorch-cuda=11.8
- scikit-learn=1.2.2
- evaluate=0.4.0
- pytest
Once you created the file, run the following command to create the conda environment:
conda env create -f env.yml
Running Accelerate#
You can find an example Python training file in: complete_nlp_example.py
Launching accelerate in interactive session#
You can start by requesting an interactive session from slurm with the desired number of GPUs. Ex:
$ srun -N 1 --gres=gpu:v100:8 --time=3:0:0 --pty bash
You’ll then need to activate the conda environment.
conda activate acc_env
Finally, you can start the training process by calling accelerator’s launcher.
accelerate launch --multi_gpu complete_nlp_example.py
Optionally you can add --checkpointing_steps epoch
at the end to create checkpoints after each epoch.
The output should look like the following.
The following values were not passed to accelerate launch
and had defaults used instead:
--num_processes
was set to a value of8
--num_machines
was set to a value of1
--mixed_precision
was set to a value ofno
--dynamo_backend
was set to a value ofno
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
.
epoch 0: {'accuracy': 0.6862745098039216, 'f1': 0.8134110787172011}
epoch 1: {'accuracy': 0.75, 'f1': 0.840625}
epoch 2: {'accuracy': 0.8137254901960784, 'f1': 0.8671328671328671}
MultiGPU on single node as batch job#
You can simply run accelerator through a slurm jobscript
Change <conda_installation_path> with the installation path for your conda.
#!/bin/bash
#SBATCH --job-name=multiGPU
#SBATCH --tasks-per-node=1
#SBATCH --gpus=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=32
#SBATCH --time=00:10:00
source <conda_installation_path>/bin/activate <acc_env>
accelerate launch --multi_gpu complete_nlp_example.py --checkpointing_steps epoch
The output should be redirected to slurm-####.out
file.
$ cat gpu210-10-27293860.out
epoch 0: {'accuracy': 0.6862745098039216, 'f1': 0.8134110787172011}
epoch 1: {'accuracy': 0.7352941176470589, 'f1': 0.8291139240506329}
epoch 2: {'accuracy': 0.7647058823529411, 'f1': 0.8426229508196722}
MultiGPU on single node as batch job#
To scale out and run on more GPUs, we will need to request multiple nodes with multiple GPUs on each node. The following example jobscirpt demonstrate how to scale to 16 V100 GPUs on 4 nodes such that each node has 4 GPUs.
#!/bin/bash
#SBATCH --job-name=multiGPU
#SBATCH --tasks-per-node=1
#SBATCH --gpus=16
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00
source <conda_installation_path>/bin/activate <acc_env>
export master_ip=$(/bin/hostname -I | cut -d " " -f 2 )
export master_port=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
export HF_METRICS_CACHE=$PWD/cache
mkdir -p $HF_METRICS_CACHE
export TOKENIZERS_PARALLELISM=false
srun -l ./wrapper.sh
Whereas the wrapper.sh
, as shown below, is an executable script that launches the processes on each node. It is important to relocate the cache directory to somewhere in /ibex/user/$USER
directory where Hugginface evaluate
library will maintain the metrics files. The library manages file locking to avoid race conditions. We have found that having this directory in default location on HOME filesystem has caused issues and the /ibex/user/$USER
directory resolves this due to lower latency to fullfil frequent metadata queries.
#!/bin/bash
export LAUNCHER="accelerate launch \
--num_processes ${SLURM_GPUS} \
--num_machines ${SLURM_NNODES} \
--rdzv_backend static \
--machine_rank ${SLURM_NODEID} \
--main_process_ip ${master_ip} \
--main_process_port ${master_port} \
--same_network \
"
export SCRIPT="../complete_nlp_example.py"
export SCRIPT_ARGS=" \
--mixed_precision no
"
echo $LAUNCHER $SCRIPT $SCRIPT_ARGS
$LAUNCHER $SCRIPT $SCRIPT_ARGS