GPU Cluster Capella¶
Overview¶
The Lenovo multi-GPU cluster Capella
has been installed by MEGWARE for
AI-related computations and traditional
HPC simulations. Capella is fully integrated into the ZIH HPC infrastructure.
Therefore, the usage should be similar to the other clusters.
In November 2024, Capella was ranked #51 in the TOP500, which is #3 of German systems, and #5 in the GREEN500 lists of the world's fastest computers. Background information on how Capella reached these positions can be found in this Golem article.
Hardware Specifications¶
The hardware specification is documented on the page HPC Resources.
Access and Login Nodes¶
You use login[1-2].capella.hpc.tu-dresden.de
to access the cluster Capella
from the campus
(or VPN) network.
In order to verify the SSH fingerprints of the login nodes, please refer to the page
Key Fingerprints.
On the login nodes you have access to the same filesystems and the software stack as on the compute node. GPUs are not available there.
In the subsections Filesystems and Software and Modules we provide further information on these two topics.
Filesystems¶
As with all other clusters, your /home
directory is also available on Capella
.
For reasons of convenience, the filesystems horse
and walrus
are also accessible.
Please note, that the filesystem horse
should not be used as working
filesystem at the cluster Capella
because we have something better.
Cluster-Specific Filesystem cat
¶
With Capella
comes the new filesystem cat
designed to meet the high I/O requirements of AI
and ML workflows. It is a WEKAio filesystem and mounted under /data/cat
. It is only available
on the cluster Capella
and the Datamover nodes.
The filesystem cat
should be used as the
main working filesystem and has to be used with workspaces.
Workspaces on the filesystem cat
can only be created on the login and compute nodes, not on
the other clusters since cat
is not available there.
cat
has only limited capacity, hence workspace duration is significantly shorter than
in other filesystems. We recommend that you only store actively used data there.
To transfer input and result data from and to the filesystems horse
and walrus
, respectively,
you will need to use the Datamover nodes. Regardless of the
direction of transfer, you should pack your data into archives (,e.g., using dttar
command)
for the transfer.
Do not invoke data transfer to the filesystems horse
and walrus
from login nodes.
Both login nodes are part of the cluster. Failures, reboots and other work
might affect your data transfer resulting in data corruption.
All other share filesystems
(/home
, /software
, /data/horse
, /data/walrus
, etc.) are also mounted.
Software and Modules¶
The most straightforward method for utilizing the software is through the well-known
module system.
All software available from the module system has been specifically build for the cluster
Capella
i.e., with optimization for Zen4 (Genoa) microarchitecture and CUDA-support enabled.
Python Virtual Environments¶
Virtual environments allow you to install
additional Python packages and create an isolated runtime environment. We recommend using
venv
for this purpose.
Virtual environments in workspaces
We recommend to use workspaces for your virtual environments.
Batch System¶
The batch system Slurm may be used as usual. Please refer to the page Batch System Slurm for detailed information. In addition, the page Job Examples with GPU provides examples on GPU allocation with Slurm.
You can find out about upcoming reservations (,e.g., for acceptance benchmarks) via sinfo -T
.
Acceptance has priority, so your reservation requests can currently not be considered.
Slurm limits and job runtime
Although, each compute node is equipped with 64 CPU cores in total, only a maximum of 56 can be requested via Slurm (cf. Slurm Resource Limits Table).
The maximum runtime of jobs and interactive sessions is currently 24 hours. However, to
allow for greater fluctuation in testing, please make the jobs shorter if possible. You can use
Chain Jobs to split a long running job exceeding the batch queues
limits into parts and chain these parts. Applications with build-in check-point-restart
functionality are very suitable for this approach! If your application provides
check-point-restart, please use /data/cat
for temporary data. Remove these data afterwards!
The partition capella-interactive
can be used for your small tests and compilation of software.
In addition, JupyterHub instances that require low GPU utilization or only use GPUs for a short
period of time in their allocation are intended to use this partition.
You need to add #SBATCH --partition=capella-interactive
to your job file and
--partition=capella-interactive
to your sbatch
, srun
and salloc
command line, respectively,
to address this partition.
The partition capella-interactive
is configured to use MIG configuration of 1/7.
Virtual GPUs-MIG¶
Starting with the Capella cluster, we introduce virtual GPUs. They are based on
Nvidia's MIG technology.
From an application point of view, each virtual GPU looks like a normal physical GPU, but offers
only a fraction of the compute resources and the maximum allocatable memory on the device.
We also only account you a fraction of a full GPU hour.
By using virtual GPUs, we expect to improve overall system utilization for jobs that cannot take
advantage of a full H100 GPU.
In addition, we can provide you with more resources and therefore shorter waiting times.
We intend to use these partitions for all applications that cannot use a full H100 GPU, such as
Jupyter-Notebooks.
Users can check the usage of compute and memory usage of the GPU with the help of
job monitoring system PIKA.
Since a GPU in the Capella
cluster offers 3.2-3.5x more peak performance compared to an A100 GPU
in the cluster Alpha Centauri
, a 1/7 shard of a GPU in
Capella is about half the performance of a GPU in Alpha Centauri
.
At the moment we only have a partitioning of 7 in the capella-interactive
partition,
but we are free to create more configurations in the future.
For this, users' demands and expected high utilization of the smaller GPUS are essential.
Configuration Name | Compute Resources | Memory in GiB | Accounted GPU hour |
---|---|---|---|
capella-interactive |
1 / 7 | 11 | 1/7 |