GPU Cluster Capella¶
Acceptance phase
The cluster Capella
is currently in the acceptance phase, i.e.,
interruptions and reboots without notice, node failures are possible. Furthermore, the systems
configuration might be adjusted further.
Do not yet move your "production" to Capella
, but feel free to test it using moderate sized
workloads. Please read this page carefully to understand, what you need to adopt in your
existing workflows w.r.t. filesystem, software and
modules and batch jobs.
We highly appreciate your hints and would be pleased to receive your comments and experiences
regarding its operation via e-mail to
hpc-support@tu-dresden.de using the subject
Capella:
Please understand that we are current priority is the acceptance, configuration and rollout of the system. Consequently, we are unable to address any support requests at this time.
Overview¶
The multi-GPU cluster Capella
has been installed for AI-related computations and traditional
HPC simulations. Capella is fully integrated into the ZIH HPC infrastructure.
Therefore, the usage should be similar to the other clusters.
Hardware Specifications¶
The hardware specification is documented on the page HPC Resources.
Access and Login Nodes¶
You use login[1-2].capella.hpc.tu-dresden.de
to access the cluster Capella
from the campus
(or VPN) network.
In order to verify the SSH fingerprints of the login nodes, please refer to the page
Key Fingerprints.
On the login nodes you have access to the same filesystems and the software stack as on the compute node. GPUs are not available there.
In the subsections Filesystems and Software and Modules we provide further information on these two topics.
Filesystems¶
As with all other clusters, your /home
directory is also available on Capella
.
For reasons of convenience, the filesystems horse
and walrus
are also accessible.
Please note, that the filesystem horse
is not to be used as working filesystem at the cluster
Capella
.
With Capella
comes the new filesystem cat
designed to meet the high I/O requirements of AI
and ML workflows. It is a WEKAio filesystem and mounted under /data/cat
. It is only available
on the cluster Capella
and the Datamover nodes.
Main working filesystem is cat
The filesystem cat
should be used as the
main working filesystem and has to be used with workspaces.
Workspaces on the filesystem cat
can only be created on the login and compute nodes, not on
the other clusters since cat
is not available there.
Although all other filesystems
(/home
, /software
, /data/horse
, /data/walrus
, etc.) are also available.
Datatransfer to and from /data/cat
Please utilize the new filesystem cat
as the working filesystem on Capella
. It has limited
capacity, so we advise you to only hold hot data on cat
.
To transfer input and result data from and to the filesystems horse
and walrus
, respectively,
you will need to use the Datamover nodes. Regardless of the
direction of transfer, you should pack your data into archives (,e.g., using dttar
command)
for the transfer.
Do not invoke data transfer to the filesystems horse
and walrus
from login nodes.
Both login nodes are part of the cluster. Failures, reboots and other work
might affect your data transfer resulting in data corruption.
Software and Modules¶
The most straightforward method for utilizing the software is through the well-known
module system.
All software available from the module system has been specifically build for the cluster
Capella
i.e., with optimization for Zen4 (Genoa) microarchitecture and CUDA-support enabled.
Python Virtual Environments¶
Virtual environments allow you to install
additional Python packages and create an isolated runtime environment. We recommend using
venv
for this purpose.
Virtual environments in workspaces
We recommend to use workspaces for your virtual environments.
Batchsystem¶
The batch system Slurm may be used as usual. Please refer to the page Batch System Slurm for detailed information. In addition, the page Job Examples provides examples on GPU allocation with Slurm.
You can find out about upcoming reservations (,e.g., for acceptance benchmarks) via sinfo -T
.
Acceptance has priority, so your reservation requests can currently not be considered.
Slurm limits and job runtime
Although, each compute node is equipped with 64 CPU cores in total, only a maximum of 56 can be requested via Slurm (cf. Slurm Resource Limits Table).
The maximum runtime of jobs and interactive sessions is currently 24 hours. However, to
allow for greater fluctuation in testing, please make the jobs shorter if possible. You can use
Chain Jobs to split a long running job exceeding the batch queues
limits into parts and chain these parts. Applications with build-in check-point-restart
functionality are very suitable for this approach! If your application provides
check-point-restart, please use /data/cat
for temporary data. Remove these data afterwards!
The partition capella-interactive
can be used for your small tests and compilation of software.
You need to add #SBATCH --partition=capella-interactive
to your jobfile and
--partition=capella-interactive
to your sbatch
, srun
and salloc
command line, respectively,
to address this partition. The partitions configuration might be adopted within acceptance phase.
You get the current settings via scontrol show partitions capella-interactive
.