Profiling an NVIDIA RAPIDS workflow using NSight Systems#
This documentation explains how to profile a deep learning workload using RAPIDS cuML and Nsight Systems. Example scripts are available at this repository. See the src folder for rapids_tsne.py
and rapids_tsne.sh
.
Quick Start#
To collect profiling information, submit a job as follows. This is similar to the job in NSight Systems with NVTX instrumentation, but uses Nsight Systems to profile a t-SNE dimensionality reduction with the RAPIDS cuML library.
#!/bin/bash
#SBATCH --job-name=Nsys_rapids_tsne
#SBATCH --output=rapids_tsne_output.%j.out
#SBATCH --error=rapids_tsne_error.%j.err
#SBATCH --time=00:30:00
#SBATCH --gres=gpu:1
#SBATCH --constraint=v100
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
module load machine_learning
module load nvidia-sdk/nvhpc/25.5
# Run the RAPIDS t-SNE script with Nsight Systems profiling
cmd="python ./rapids_tsne.py"
nsys profile --trace='cuda','cublas','cudnn','osrt','nvtx' --stats=true --sample=none --export=sqlite -o profile.${SLURM_JOBID} ${cmd}
The above job script launches Python training with the nsys
profiler. We load the machine learning and NVIDIA SDK modules. The command line options allow you to select which API calls to trace. In this example, we trace cuda
, cublas
, cudnn
, and osrt
(OS Runtime, e.g., I/O calls). --stats=true
prints a concise report in your SLURM output file. The job script also exports the collected output and a SQLite database for easy searching in the Nsight Systems visual tool.
Note
To launch the visualization of the profile on the Ibex glogin node (requires OpenGL support), use:
nsys-ui profile.40118944.nsys-rep
Where profile.40118944.nsys-rep
is your profile output file.
The output is a stacked time series of all resources and events traced. Hover your mouse over the event profile bar of CUDA HW(0000:8a:00.0-Tesla V100-SXM2-32GB)
to see how busy your GPU has been. You can zoom in to inspect events at micro- or nanosecond scales. Expand the tab to show more events in finer granularity and see the timing and sequence of different kernels. (Right-click on CUDA HW(0000:b2:00.0-Tesla V100-SXM2-32GB)
and choose Show in Events View
to inspect the table of kernels profiled).

Adding NVTX Instrumentation#
NVIDIA Tools Extension (NVTX) allows you to instrument your training script to annotate different operations. The code requires minimal changes:
If you are using a machine learning model, add this line to your code:
# Load nvtx package
import nvtx
You can annotate various operations in your training process. Decorate complete functions or use the context manager to annotate a block of code. For example, to annotate a function:
# Use the nvtx package to annotate the t-SNE operation for profiling
@nvtx.annotate("TSNE", color="blue")
def run_tsne(X, n_components=2, perplexity=30.0, n_iter=1000):
"""
Run t-SNE on the dataset using RAPIDS cuML implementation.
Parameters:
X: Input data
n_components: Number of dimensions for embedding
perplexity: t-SNE perplexity parameter
n_iter: Number of optimization iterations
Returns:
Embedded data in lower dimensions
Note:
Number of Nearest Neighbors should be at least 3 * perplexity.
"""
n_neighbors = max(90, int(3 * perplexity)) # Ensure n_neighbors >= 3 * perplexity
tsne = TSNE(n_components=n_components, perplexity=perplexity, n_iter=n_iter,
random_state=23, method='fft', n_neighbors=n_neighbors)
X_embedded = tsne.fit_transform(X)
return X_embedded
Alternatively, use the context manager to annotate a block of code:
with nvtx.annotate("Main Execution", color="yellow"):
# Load the Fashion-MNIST dataset from the specified directory
X, y = load_mnist_train('data/fashion')
# Run t-SNE dimensionality reduction on the dataset
X_embedded = run_tsne(X, n_components=2, perplexity=30.0, n_iter=1000)
# Print the shape of the embedded data and the first 5 points for inspection
print("Shape of embedded data:", X_embedded.shape)
print("First 5 embedded points:\n", X_embedded[:5])