Nsight-systems#
Nsight Systems is a suite of profiling tools that replaces nvprof as CUDA releases progress. It provides a more detailed view of the workload than nvprof, and can be used to identify bottlenecks and optimize performance.
To collect the profiling information, submit a job as follows. This is the same job as above, but using Nsight Systems to profile and the machine learning module.
#!/bin/bash -l
#SABTCH --job-name=nsys
#SBATCH --time=00:30:00
#SBATCH --gres=gpu:1
#SBATCH --constraint=v100
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j-slurm.out
#SBATCH --error=%x-%j-slurm.err
module load machine_learning/2023.04
cmd="python ./train_nvtx.py"
nsys profile --trace='cuda','cublas','cudnn','osrt' --stats='true' --sample=none --export=sqlite -o profile.${SLURM_JOBID} ${cmd}
The above jobscript launches our python training using nsys
profiler. Notice that we are loading only the machine learning module. As options, the command line also accepts the tracers you would like to use to trace different API calls by your code. In the jobscript above, we are choosing to trace cuda,cublas,cudnn
API calls, and also osrt
or OS Runtime calls (e.g. I/O calls). --stats=true
allows printing a concise report in your SLURM output file for quick examination. In addition to this, the jobscript also instructs nsys
to export the output collected and a SQLlite database which the Nsight-systems visual tool can easily search.
Note
To launch the visualization of the profile on ibex glogin node (required OpenGL support)
nsight-sys profile.11264040.nsys-rep
Where profile.11264040.nsys-rep
is our profile.
The output is a stacked time series of all the resources and events traced. Hover your mouse on the event profile bar of CUDA HW(0000:b2:00.0Tesla V100-SXM2-32GB)
and you will notice how busy you GPU has been. The time series can be zoomed in to inspect the events at short time scales down to micro, even nanoseconds. You can expand the above tab to show more event in finer granularity to see timing and sequence of different kernels. (Right click on CUDA HW(0000:b2:00.0Tesla V100-SXM2-32GB)
tab and choose Show in Events View
to inspect the table of the kernels profiled).
Nsight-systems with NVTX instrumentation#
In a typical epoch of DL training, multiple mini-batches are trained, and often it is tricky to demarcate a mini-batch where it ends and the next one starts. NVIDIA Tools Extension or NVTX is a way to instrument your training script to annotate different operations of the training of a mini-batch. The code requires minimal change:
If you are using the machine learning model, you can directly add this line to your code
#load nvtx package
import nvtx
Annotate various operations of your training process
for epoch in range(5):
for i, (images, labels) in enumerate(train_loader):
with nvtx.annotate("Batch" + str(i), color="green"):
#load images and labels to device
with nvtx.annotate("Copy to device", color="red"):
images, labels = images.to(device), labels.to(device)
# Forward pass
with nvtx.annotate("Forward Pass", color="yellow"):
outputs = model(images)
# Calculate the loss
loss = criterion(outputs, labels)
# Backpropagate the loss
optimizer.zero_grad()
with nvtx.annotate("Backward Pass", color="blue"):
loss.backward()
with nvtx.annotate("Optimizer step", color="orange"):
optimizer.step()
To instruct nsys
profiler to collect the annotated profile in the training loop, the launch command will add nvtx
tracer.
nsys profile --trace='cuda','cublas','cudnn','osrt','nvtx' --stats='true' --sample=none --export=sqlite -o profile.${SLURM_JOBID} ${cmd}
Upon visualizing, you can see an annotated training profile that is easier to track with the labels and colors you selected on the script