Skip to content

GPU Programming

Available GPUs

The full hardware specifications of the GPU-compute nodes may be found in the HPC Resources page. Note that the clusters may have different modules available:

E.g. the available CUDA versions can be listed with

marie@compute$ module spider CUDA

Note that some modules use a specific CUDA version which is visible in the module name, e.g. GDRCopy/2.1-CUDA-11.1.1 or Horovod/0.28.1-CUDA-11.7.0-TensorFlow-2.11.0.

This especially applies to the optimized CUDA libraries like cuDNN, NCCL and magma.

CUDA-aware MPI

When running CUDA applications using MPI for interprocess communication you need to additionally load the modules that enable CUDA-aware MPI which may provide improved performance. Those are UCX-CUDA and UCC-CUDA which supplement the UCX and UCC modules respectively. Some modules, like NCCL, load those automatically.

Using GPUs with Slurm

For general information on how to use Slurm, read the respective page in this compendium. When allocating resources on a GPU-node, you must specify the number of requested GPUs by using the --gres=gpu:<N> option, like this:

#!/bin/bash                           # Batch script starts with shebang line

#SBATCH --ntasks=1                    # All #SBATCH lines have to follow uninterrupted
#SBATCH --time=01:00:00               # after the shebang line
#SBATCH --account=p_number_crunch     # Comments start with # and do not count as interruptions
#SBATCH --job-name=fancyExp
#SBATCH --output=simulation-%j.out
#SBATCH --error=simulation-%j.err
#SBATCH --gres=gpu:1                  # request GPU(s) from Slurm

module purge                          # Set up environment, e.g., clean modules environment
module load module/version module2    # and load necessary modules

srun ./application [options]          # Execute parallel application with srun

Alternatively, you can work on the clusters interactively:

marie@login.<cluster_name>$ srun --nodes=1 --gres=gpu:<N> --runtime=00:30:00 --pty bash
marie@compute$ module purge; module switch release/<env>

Directive Based GPU Programming

Directives are special compiler commands in your C/C++ or Fortran source code. They tell the compiler how to parallelize and offload work to a GPU. This section explains how to use this technique.


OpenACC is a directive based GPU programming model. It currently only supports NVIDIA GPUs as a target.

Please use the following information as a start on OpenACC:


OpenACC can be used with the PGI and NVIDIA HPC compilers. The NVIDIA HPC compiler, as part of the NVIDIA HPC SDK, supersedes the PGI compiler.

The nvc compiler (NOT the nvcc compiler, which is used for CUDA) is available for the NVIDIA Tesla V100 and Nvidia A100 nodes.

Using OpenACC with PGI compilers

  • Load the latest version via module load PGI or search for available versions with module search PGI
  • For compilation, please add the compiler flag -acc to enable OpenACC interpreting by the compiler
  • -Minfo tells you what the compiler is actually doing to your code
  • Add -ta=nvidia:ampere to enable optimizations for the A100 GPUs
  • You may find further information on the PGI compiler in the user guide and in the reference guide, which includes descriptions of available command line options

Using OpenACC with NVIDIA HPC compilers

  • Switch into the correct module environment for your selected compute nodes (see list of available GPUs)
  • Load the NVHPC module for the correct module environment. Either load the default (module load NVHPC) or search for a specific version.
  • Use the correct compiler for your code: nvc for C, nvc++ for C++ and nvfortran for Fortran
  • Use the -acc and -Minfo flag as with the PGI compiler
  • To create optimized code for either the V100 or A100, use -gpu=cc70 or -gpu=cc80, respectively
  • Further information on this compiler is provided in the user guide and the reference guide, which includes descriptions of available command line options
  • Information specific the use of OpenACC with the NVIDIA HPC compiler is compiled in a guide

OpenMP target offloading

OpenMP supports target offloading as of version 4.0. A dedicated set of compiler directives can be used to annotate code-sections that are intended for execution on the GPU (i.e., target offloading). Not all compilers with OpenMP support target offloading, refer to the official list for details. Furthermore, some compilers, such as GCC, have basic support for target offloading, but do not enable these features by default and/or achieve poor performance.

On the ZIH system, compilers with OpenMP target offloading support are provided on the clusters power9 and alpha. Two compilers with good performance can be used: the NVIDIA HPC compiler and the IBM XL compiler.

Using OpenMP target offloading with NVIDIA HPC compilers

  • Load the module environments and the NVIDIA HPC SDK as described in the OpenACC section
  • Use the -mp=gpu flag to enable OpenMP with offloading
  • -Minfo tells you what the compiler is actually doing to your code
  • The same compiler options as mentioned above are available for OpenMP, including the -gpu=ccXY flag as mentioned above.
  • OpenMP-specific advice may be found in the respective section in the user guide

Using OpenMP target offloading with the IBM XL compilers

The IBM XL compilers (xlc for C, xlc++ for C++ and xlf for Fortran (with sub-version for different versions of Fortran)) are only available on the cluster power9 with NVIDIA Tesla V100 GPUs.

Native GPU Programming


Native CUDA programs can sometimes offer a better performance. NVIDIA provides some introductory material and links. An introduction to CUDA is provided as well. The toolkit documentation page links to the programming guide and the best practice guide. Optimization guides for supported NVIDIA architectures are available, including for Volta (V100) and Ampere (A100).

In order to compile an application with CUDA use the nvcc compiler command, which is described in detail in nvcc documentation. This compiler is available via several CUDA packages, a default version can be loaded via module load CUDA. Additionally, the NVHPC modules provide CUDA tools as well.

For using CUDA with Open MPI at multiple nodes, the OpenMPI module loaded shall have be compiled with CUDA support. If you aren't sure if the module you are using has support for it you can check it as following:

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}'

Usage of the CUDA Compiler

The simple invocation nvcc <> will compile a valid CUDA program. nvcc differentiates between the device and the host code, which will be compiled in separate phases. Therefore, compiler options can be defined specifically for the device as well as for the host code. By default, the GCC is used as the host compiler. The following flags may be useful:

  • --generate-code (-gencode): generate optimized code for a target GPU (caution: these binaries cannot be used with GPUs of other generations).
    • For Volta (V100): --generate-code arch=compute_70,code=sm_70,
    • For Ampere (A100): --generate-code arch=compute_80,code=sm_80
  • -Xcompiler: pass flags to the host compiler. E.g., generate OpenMP-parallel host code: -Xcompiler -fopenmp. The -Xcompiler flag has to be invoked for each host-flag

Performance Analysis

Consult NVIDIA's Best Practices Guide and the performance guidelines for possible steps to take for the performance analysis and optimization.

Multiple tools can be used for the performance analysis. For the analysis of applications on the newer GPUs (V100 and A100), we recommend the use of the newer NVIDIA Nsight tools, Nsight Systems for a system-wide sampling and tracing and Nsight Compute for a detailed analysis of individual kernels.

NVIDIA nvprof & Visual Profiler

The nvprof command line and the Visual Profiler are available once a CUDA module has been loaded. For a simple analysis, you can call nvprof without any options, like such:

marie@compute$ nvprof ./application [options]

For a more in-depth analysis, we recommend you use the command line tool first to generate a report file, which you can later analyze in the Visual Profiler. In order to collect a set of general metrics for the analysis in the Visual Profiler, use the --analysis-metrics flag to collect metrics and --export-profile to generate a report file, like this:

marie@compute$ nvprof --analysis-metrics --export-profile  <output>.nvvp ./application [options]

Transfer the report file to your local system and analyze it in the Visual Profiler (nvvp) locally. This will give the smoothest user experience. Alternatively, you can use X11-forwarding. Refer to the documentation for details about the individual features and views of the Visual Profiler.

Besides these generic analysis methods, you can profile specific aspects of your GPU kernels. nvprof can profile specific events. For this, use

marie@compute$ nvprof --query-events

to get a list of available events. Analyze one or more events by using specifying one or more events, separated by comma:

marie@compute$ nvprof --events <event_1>[,<event_2>[,...]] ./application [options]

Additionally, you can analyze specific metrics. Similar to the profiling of events, you can get a list of available metrics:

marie@compute$ nvprof --query-metrics

One or more metrics can be profiled at the same time:

marie@compute$ nvprof --metrics <metric_1>[,<metric_2>[,...]] ./application [options]

If you want to limit the profiler's scope to one or more kernels, you can use the --kernels <kernel_1>[,<kernel_2>] flag. For further command line options, refer to the documentation on command line options.

NVIDIA Nsight Systems

Use NVIDIA Nsight Systems for a system-wide sampling of your code. Refer to the NVIDIA Nsight Systems User Guide for details. With this, you can identify parts of your code that take a long time to run and are suitable optimization candidates.

Use the command-line version to sample your code and create a report file for later analysis:

marie@compute$ nsys profile [--stats=true] ./application [options]

The --stats=true flag is optional and will create a summary on the command line. Depending on your needs, this analysis may be sufficient to identify optimizations targets.

The graphical user interface version can be used for a thorough analysis of your previously generated report file. For an optimal user experience, we recommend a local installation of NVIDIA Nsight Systems. In this case, you can transfer the report file to your local system. Alternatively, you can use X11-forwarding. The graphical user interface is usually available as nsys-ui.

Furthermore, you can use the command line interface for further analyses. Refer to the documentation for a list of available command line options.

NVIDIA Nsight Compute

Nsight Compute is used for the analysis of individual GPU-kernels. It supports GPUs from the Volta architecture onward (on the ZIH system: V100 and A100). If you are familiar with nvprof, you may want to consult the Nvprof Transition Guide, as Nsight Compute uses a new scheme for metrics. We recommend those kernels as optimization targets that require a large portion of you run time, according to Nsight Systems. Nsight Compute is particularly useful for CUDA code, as you have much greater control over your code compared to the directive based approaches.

Nsight Compute comes in a command line and a graphical version. Refer to the Kernel Profiling Guide to get an overview of the functionality of these tools.

You can call the command line version (ncu) without further options to get a broad overview of your kernel's performance:

marie@compute$ ncu ./application [options]

As with the other profiling tools, the Nsight Compute profiler can generate report files like this:

marie@compute$ ncu --export <report> ./application [options]

The report file will automatically get the file ending .ncu-rep, you do not need to specify this manually.

This report file can be analyzed in the graphical user interface profiler. Again, we recommend you generate a report file on a compute node and transfer the report file to your local system. Alternatively, you can use X11-forwarding. The graphical user interface is usually available as ncu-ui or nv-nsight-cu.

Similar to the nvprof profiler, you can analyze specific metrics. NVIDIA provides a Metrics Guide. Use --query-metrics to get a list of available metrics, listing them by base name. Individual metrics can be collected by using

marie@compute$ ncu --metrics <metric_1>[,<metric_2>,...] ./application [options]

Collection of events is no longer possible with Nsight Compute. Instead, many nvprof events can be measured with metrics.

You can collect metrics for individual kernels by specifying the --kernel-name flag.