Compute Nodes#

Compute nodes are the HPC optimized servers where the job scheduler schedules the jobs and runs it on users’ behalf. Ibex cluster is composed of the compute nodes of various microarchitectures of CPUs and GPUs e.g. Intel Cascade Lake, Skylake, AMD Rome, NVIDIA RTX2080ti, V100, A100 etc. The allocatable resource include CPU cores or GPUs, CPU memory, local fast storage on a node and duration or wall time.

The heterogeneity of compute nodes allow users to submit various types of applications and workflows. At times, Ibex becomes a defacto choice for workflows those are not suitable to run on other KSL systems. Users should submit CPU jobs from ilogin and GPU jobs from glogin login nodes.

CPU Compute nodes#

Ibex has multiple CPU architectures available in different compute nodes. A single compute node, however, is always homogeneous in the CPU architecture. These CPU compute nodes different in the processor architecture, the memory capacity and other features such as availability of local disk or large memory capacity.

If you wish to understand more about the available processors types and their features, please see Available CPU architecture on KSL systems .

All CPU nodes are connected to the HDR 100 Infiniband high speed network with a theoretical peak of 100 gigabits per second (or 12.5GB/s). Also, each compute node has access to the shared parallel filesystem and home filesystem. More technical information please refer to Filesystems section.

The table below summarizes the CPU nodes available in Ibex cluster. The values in constraint column suggests how to specific a type of compute node in your SLURM jobs. For more details on how to do this, please see CPU jobs section for Ibex.

**CPU Compute nodes in Ibex cluster**#
CPU Family	CPU	Nodes	Cores/node	Clock (GHz)	FLOPS	Memory	SLURM constraints	local storage
Intel Skylake	skylake	106	40	2.60	32	350GB	intel, skylake	744GB
Intel Cascade Lake	cascadelake	106	40	2.50	32	350GB	intel, cascadelake	744GB
AMD Rome	Rome	108	128	2.00	32	475GB	amd, rome	744GB

Some nodes have larger memory for workloads which require loading big data in memory, e.g. some bioinformatics workloads, or data processing/wrangling creating input data for Machine Learning and Deep Learning training jobs.

**Large memory Compute nodes in Ibex cluster**#
CPU Family	CPU	Nodes	Cores/node	Clock (GHz)	FLOPS	Memory	local Storage	SLURM constraints
Intel Cascade Lake	cascadelake	18	48	4.20	32	3TB	6TB	intel, largemem, cascadelake
Intel Skylake	skylake	4	32	3.70	32	3TB	10TB	intel, largemem, skylake

For submitting a job to a particular compute node, a set of constraints must be used to help SLURM pick the correct one. Users can either add them to your jobscript as a SLURM directive or pass it as command line argument to sbatch command.

When submitting a job, the user is able to select the desired resources with precise constraints. For example,#

sbatch --constraint="intel&cascadelake" jobscript.slurm

The above specifies to SLURM that the job should run on an Intel Cascade Lake node.

GPU Compute Nodes#

There are GPU nodes in Ibex cluster with GPUs of different microarchitecture. Note that all the GPUs on a single node are always of the same microarchitecture, there is no heterogeneity there.

At present all GPUs in Ibex cluster are by NVIDIA. All compute nodes with GPUs have multiple GPUs, minimum of 4 and maximum of 8.

If you are new to using GPUs or would like to refresh your understanding about how a GPU works, then Understanding GPU architecture is a good article to start with. It is a common place to compare the performance and software capabilities of available GPUs and match them with the bounds of your application to the appropriate one for your jobs. If you have questions about which GPU to use, please for guideline.

All GPU nodes on Ibex cluster are connected to the HDR Infiniband. Some nodes are capable of achieving 200 gigabits per second or 25GB/s (e.g. nodes with A100 GPUs) and the other are connects via 100 gigabits per second (12.5GB/s). Some have more Network Interface Cards (NICs) than the others. With more NICs on these compute nodes, the aggregate bandwidth for operations such as GPU Direct RDMA will be higher when using multiple GPUs on each node.

Also, each compute node has access to the shared parallel filesystem and home filesystem. More technical information please refer to Filesystems section.

The table below summarizes the GPU nodes available in Ibex cluster. The values in constraint column suggests how to specific a type of compute node in your SLURM jobs. For more details on how to do this, please see GPU jobs section.

**GPU Compute nodes in Ibex cluster**#
Model	GPU Arch	Host CPU	Nodes	GPUs/ node	Cores/ node	GPU Mem	GPU Mem type	CPU Mem	GPU Clock (GHz)	CPU Clock (GHz)
P6000	Pascal	Intel Haswell	3	2	36(34)	24GB	GDDR5X	256GB	1.5	2.3
P100	Pascal	Intel Haswell	5	4	36(34)	16GB	HBM2	256GB	1.19	2.3
GTX-1080Ti	Pascal	Intel Haswell	8	4	36(34)	11GB	GDDR5X	256GB	1.48	2.3
GTX-1080Ti	Pascal	Intel Skylake	4	8	32(30)	11GB	GDDR5X	256GB	1.48	2.6
RTX-2080Ti	Turing	Intel Skylake	3	8	32(30)	11GB	GDDR6	383G	1.35	2.6
V100	Volta	Intel Skylake	6	4	32(30)	32GB	HBM2	383G	1.29	2.6
V100	Volta	Intel Cascade Lake	1	2	40(38)	32GB	HBM2	383G	1.23	2.5
V100	Volta	Intel Cascade Lake	30	8	48(46)	32GB	HBM2	383G	1.29	2.6
A100	Ampere	AMD Milan	46	4	64(62)	80GB	HBM2	512G	1.16	1.99
A100	Ampere	AMD Milan	8	8	128(126)	80GB	HBM2	1T	1.16	1.5

Note

Allocatable cores per node on GPU compute nodes are less than the total available in hardware. Ibex cluster uses two cores per node to run high performance shared parallel filesystem called WekaIO. On compute nodes with V100 and A100 GPUs, these are pinned cores whereas on others, they are float (i.e. weka process will take precedence on cores 1 and 2). SLURM scheduler can allocate a maximum number of cpu cores per node as listed in parenthesis in column 6 Cores/node in the table above.

Some additional details about the compute nodes with GPUs is necessary to know when choose them to run your jobs. The following table describes the maximum possible CUDA capability the GPU will work on, the interconnect between GPUs on the same node, and between CPUs and GPUs. Also listed is the whether the node is capable of GPU Direct RDMA, which by-passes the need of CPUs when communicating with a GPU of a different compute node in Ibex cluster. In addition to the parallel filesystem, some compute nodes have storage available which is local to the compute node.

**CUDA capability, networking and filesystem information about GPU compute nodes in Ibex cluster**#
GPU Arch	GPUs/node	CUDA Cap	GPU-GPU	CPU-GPU	NICs	GDRDMA	local storage	SLURM constraints
P6000	2	6.0	PCIe	PCIe	1	IB	400G	p6000
P100	4	6.0	PCIe	PCIe	1	IB	70G	p100
GTX-1080Ti	4	6.1	PCIe	PCIe	1	IB	70G	gtx1080ti & cpu_intel_e5_2699_v3
GTX-1080Ti	8	6.1	PCIe	PCIe	1	IB	700G	gtx1080ti & cpu_intel_gold_6142
RTX-2080Ti	8	7.5	PCIe	PCIe	1	IB	700G	rtx2080ti
V100	4	7.0	NVLink2.0	PCIe	1	IB	400G	v100, cpu_intel_gold_6142
V100	2	7.0	PCIe	PCIe	1	IB	400G	v100, cpu_intel_gold_6248
V100	8	7.0	NVLINK 2.0	PCIe	4	IB	7TB	v100, cpu_intel_platinum_8260, gpu_ai
A100	4	8.0	NVLINK 3.0	PCIe	2	IB	5TB	a100, 4gpus
A100	8	8.0	NVLINK 3.0	PCIe	4	IB	11TB	a100, 8gpus