GPU Cluster Alpha Centauri¶

Overview¶

The multi-GPU cluster Alpha Centauri has been installed for AI-related computations (ScaDS.AI).

Hardware Specification¶

The hardware specification is documented on the page HPC Resources.

Filesystems¶

Since 5th July 2024, Alpha Centauri is fully integrated in the InfiniBand infrastructure of Barnard. With that, all filesystems (/home, /software, /data/horse, /data/walrus, etc.) are available.

Cluster-Specific Filesystem `quokka`¶

Similar to the filesystem cat on Capella, quokka is designed to meet the high I/O requirements of AI and ML workflows. It is a Quobyte filesystem mounted under /data/quokka. It is also available on Romeo for pre- and post-processing where no GPUs are required, as well as on the Datamover and Dataport nodes for data transfer.

quokka has only limited capacity, hence workspace duration is significantly shorter than in other filesystems. We recommend that you only store actively used data there. To transfer input and result data from and to the filesystems horse and walrus, respectively, you will need to use the Datamover nodes. Regardless of the direction of transfer, you should pack your data into archives (,e.g., using dttar command) for the transfer.

Do not invoke data transfer to the filesystems horse and walrus from login nodes. Both login nodes are part of the cluster. Failures, reboots and other work might affect your data transfer resulting in data corruption.

Usage¶

Note

The NVIDIA A100 GPUs may only be used with CUDA 11 or later. Earlier versions do not recognize the new hardware properly. Make sure the software you are using is built with CUDA11.

There is a total of 48 physical cores in each node. SMT is also active, so in total, 96 logical cores are available per node. Each node on the cluster Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4 GPUs, 1 TB RAM and 3.5 TB local space (/tmp) on an NVMe device.

Note

Multithreading is disabled per default in a job. See the Slurm page on how to enable it.

To enable a fair share of the resources, per-job resource limits are enforced. Please refer to the subsection QOS Resource Limits for further reference.

Modules¶

The easiest way is using the module system. All software available from the module system has been specifically build for the cluster Alpha i.e., with optimization for Zen2 microarchitecture and CUDA-support enabled.

To check the available modules for Alpha, use the command

marie@login.alpha$ module spider <module_name>

Example: Searching and loading PyTorch

For example, to check which PyTorch versions are available you can invoke

marie@login.alpha$ module spider PyTorch
-------------------------------------------------------------------------------------------------------------------------
  PyTorch:
-------------------------------------------------------------------------------------------------------------------------
    Description:
      Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
      that puts Python first.

     Versions:
        PyTorch/1.12.0
        PyTorch/1.12.1-CUDA-11.7.0
        PyTorch/1.12.1
[...]

Not all modules can be loaded directly. Most modules are build with a certain compiler or toolchain that need to be loaded beforehand. Luckely, the module system can tell us, what we need to do for a specific module or software version

marie@login.alpha$ module spider PyTorch/1.12.1-CUDA-11.7.0

-------------------------------------------------------------------------------------------------------------------------
  PyTorch: PyTorch/1.12.1-CUDA-11.7.0
-------------------------------------------------------------------------------------------------------------------------
    Description:
      Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
      that puts Python first.


    You will need to load all module(s) on any one of the lines below before the "PyTorch/1.12.1" module is available to load.

      release/23.04  GCC/11.3.0  OpenMPI/4.1.4
[...]

Finaly, the commandline to load the PyTorch/1.12.1-CUDA-11.7.0 module is

marie@login.alpha$ module load release/23.04  GCC/11.3.0  OpenMPI/4.1.4 PyTorch/1.12.1-CUDA-11.7.0
Module GCC/11.3.0, OpenMPI/4.1.4, PyTorch/1.12.1-CUDA-11.7.0 and 64 dependencies loaded.

Now, you can verify with the following command that the pytorch module is available

marie@login.alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.12.1
True

Python Virtual Environments¶

Virtual environments allow you to install additional Python packages and create an isolated runtime environment. We recommend using virtualenv for this purpose.

Hint

We recommend to use workspaces for your virtual environments.

Example: Creating a virtual environment and installing torchvision package

As a first step, you should allocate a workspace

marie@login.alpha$ srun --nodes=1 --cpus-per-task=1 --gres=gpu:1 --time=01:00:00 --pty bash -l
marie@alpha$ ws_allocate --name python_virtual_environment --duration 1
Info: creating workspace.
/horse/ws/marie-python_virtual_environment
remaining extensions  : 2
remaining time in days: 1

Now, you can load the desired modules and create a virtual environment within the allocated workspace.

marie@alpha$ module load release/23.04 GCCcore/11.3.0 GCC/11.3.0 OpenMPI/4.1.4 Python/3.10.4
Module GCC/11.3.0, OpenMPI/4.1.4, Python/3.10.4 and 21 dependencies loaded.
marie@alpha$ module load PyTorch/1.12.1-CUDA-11.7.0
Module PyTorch/1.12.1-CUDA-11.7.0 and 42 dependencies loaded.
marie@alpha$ which python
/software/rome/r23.04/Python/3.10.4-GCCcore-11.3.0/bin/python
marie@alpha$ pip list
[...]
marie@alpha$ virtualenv --system-site-packages /data/horse/ws/marie-python_virtual_environment/my-torch-env
created virtual environment CPython3.8.6.final.0-64 in 42960ms
  creator CPython3Posix(dest=/horse/.global1/ws/marie-python_virtual_environment/my-torch-env, clear=False, global=True)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=~/.local/share/virtualenv)
    added seed packages: pip==21.1.3, setuptools==57.2.0, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
marie@alpha$ source /data/horse/ws/marie-python_virtual_environment/my-torch-env/bin/activate
(my-torch-env) marie@alpha$ pip install torchvision==0.13.1
[...]
Installing collected packages: torchvision
Successfully installed torchvision-0.13.1
[...]
(my-torch-env) marie@alpha$ python -c "import torchvision; print(torchvision.__version__)"
0.13.1+cu102
(my-torch-env) marie@alpha$ deactivate

JupyterHub¶

JupyterHub can be used to run Jupyter notebooks on Alpha Centauri cluster. You can either use the standard profiles for Alpha or use the advanced form and define the resources for your JupyterHub job. The "Alpha GPU (NVIDIA Ampere A100)" preset is a good starting configuration.

Containers¶

Singularity containers enable users to have full control of their software environment. For more information, see the Singularity container details.

Nvidia NGC containers can be used as an effective solution for machine learning related tasks. (Downloading containers requires registration). Nvidia-prepared containers with software solutions for specific scientific problems can simplify the deployment of deep learning workloads on HPC. NGC containers have shown consistent performance compared to directly run code.