Compare System Performance with SPEChpc¶
SPEChpc 2021 is a benchmark suite developed by the Standard Performance Evaluation Corporation
(SPEC) for the evaluation of various, heterogeneous HPC systems. Documentation and released
benchmark results can be found on their web page. In fact, our
system Taurus (partition haswell
) is the benchmark's reference system and thus represents
the baseline score.
The tool includes nine real-world scientific applications (see benchmark table) with different workload sizes ranging from tiny, small, medium to large, and different parallelization models including MPI only, MPI+OpenACC, MPI+OpenMP and MPI+OpenMP with target offloading. With this benchmark suite you can compare the performance of different HPC systems and furthermore, evaluate parallel strategies for applications on a target HPC system. When you e.g. want to implement an algorithm, port an application to another platform or integrate acceleration into your code, you can determine from which target system and parallelization model your application performance could benefit most. Or this way you can check whether an acceleration scheme can be deployed and run on a given system, since there could be software issues restricting a capable hardware (see this CUDA issue).
Since TU Dresden is a member of the SPEC consortium, the HPC benchmarks can be requested by anyone interested. Please contact Holger Brunst for access.
Installation¶
The target partition determines which of the parallelization models can be used, and vice versa. For example, if you want to run a model including acceleration, you would have to use a partition with GPUs.
Once the target partition is determined, follow SPEC's Installation Guide. It is straight-forward and easy to use.
Building for partition ml
The partition ml
is a Power9 architecture. Thus, you need to provide the -e ppc64le
switch
when installing.
Building with NVHPC for partition alpha
To build the benchmark for partition alpha
, you don't need an interactive session
on the target architecture. You can stay on the login nodes as long as you set the
flag -tp=zen
. You can add this compiler flag to the configuration file.
If you are facing errors during the installation process, check the solved and unresolved issues sections for our systems. The problem might already be listed there.
Configuration¶
The behavior in terms of how to build, run and report the benchmark in a particular environment is
controlled by a configuration file. There are a few examples included in the source code.
Here you can apply compiler tuning and porting, specify the runtime environment and describe the
system under test. SPEChpc 2021 has been deployed on the partitions haswell
, ml
and
alpha
. Configurations are available, respectively:
No matter which one you choose as a starting point,
double-check the line that defines the submit command and make sure it says srun [...]
, e.g.
submit = srun $command
Otherwise this can cause trouble (see Slurm Bug). You can also put Slurm options in the configuration but it is recommended to do this in a job script (see chapter Execution). Use the following to apply your configuration to the benchmark run:
runhpc --config <configfile.cfg> [...]
For more details about configuration settings check out the following links:
Execution¶
The SPEChpc 2021 benchmark suite is executed with the runhpc
command, which also sets it's
configuration and controls it's runtime behavior. For all options, see SPEC's documentation about
runhpc
options.
First, execute source shrc
in your SPEC installation directory. Then use a job script to submit a
job with the benchmark or parts of it.
In the following there are job scripts shown for partitions haswell
, ml
and alpha
,
respectively. You can use them as a template in order to reproduce results or to transfer the
execution to a different partition.
- Replace
<p_number_crunch>
(line 2) with your project name - Replace
ws=</scratch/ws/spec/installation>
(line 15/18) with your SPEC installation path
Submit SPEChpc Benchmarks with a Job File¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Solved Issues¶
Fortran Compilation Error¶
PGF90-F-0004-Corrupt or Old Module file
Explanation
If this error arises during runtime, it means that the benchmark binaries and the MPI module do not fit together. This happens when you have built the benchmarks written in Fortran with a different compiler than which was used to build the MPI module that was loaded for the run.
Solution
- Use the correct MPI module
- The MPI module in use must be compiled with the same compiler that was used to build the
benchmark binaries. Check the results of
module avail
and choose a corresponding module.
- The MPI module in use must be compiled with the same compiler that was used to build the
benchmark binaries. Check the results of
- Rebuild the binaries
- Rebuild the binaries using the same compiler as for the compilation of the MPI module of choice.
- Request a new module
- Ask the HPC support to install a compatible MPI module.
- Build your own MPI module (as a last resort)
- Download and build a private MPI module using the same compiler as for building the benchmark binaries.
pmix Error¶
PMIX ERROR
It looks like the function `pmix_init` failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
mix_progress_thread_start failed
--> Returned value -1 instead of PMIX_SUCCESS
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Explanation
This is most probably a MPI related issue. If you built your own MPI module, PMIX support might be configured wrong.
Solution
Use configure --with-pmix=internal
during the cmake
configuration routine.
ORTE Error (too many processes)¶
Error: system limit exceeded on number of processes that can be started
ORTE_ERROR_LOG: The system limit on number of children a process can have was reached.
Explanation
There are too many processes spawned, probably due to a wrong job allocation and/or invocation.
Solution
Check the invocation command line in your job script. It must not say srun runhpc [...]
there, but only runhpc [...]
. The submit command in the configuration file
already contains srun
. When srun
is called in both places, too many parallel processes are
spawned.
Error with OpenFabrics Device¶
There was an error initializing an OpenFabrics device
Explanation
"I think it’s just trying to find the InfiniBand libraries, which aren’t used, but can’t. It’s probably safe to ignore."
Matthew Colgrove, Nvidia
Solution
This is just a warning which cannot be suppressed, but can be ignored.
Out of Memory¶
Out of memory
Out of memory allocating [...] bytes of device memory
call to cuMemAlloc returned error 2: Out of memory
Explanation
- When running on a single node with all of its memory allocated, there is not enough memory for the benchmark.
- When running on multiple nodes, this might be a wrong resource distribution caused by Slurm.
Check the
$SLURM_NTASKS_PER_NODE
environment variable. If it says something like15,1
when you requested 8 processes per node, Slurm was not able to hand over the resource distribution tompirun
.
Solution
- Expand your job from single node to multiple nodes.
- Reduce the workload (e.g. form small to tiny).
- Make sure to use
srun
instead ofmpirun
as the submit command in your configuration file.
Unresolved Issues¶
CUDA Reduction Operation Error¶
There was a problem while initializing support for the CUDA reduction operations.
Explanation
For OpenACC, NVHPC was in the process of adding OpenMP array reduction support which is needed
for the pot3d
benchmark. An Nvidia driver version of 450.80.00 or higher is required. Since
the driver version on partiton ml
is 440.64.00, it is not supported and not possible to run
the pot3d
benchmark in OpenACC mode here.
Workaround
As for the partition ml
, you can only wait until the OS update to CentOS 8 is carried out,
as no driver update will be done beforehand. As a workaround, you can do one of the following:
- Exclude the
pot3d
benchmark. - Switch the partition (e.g. to partition
alpha
).
Slurm Bug¶
Wrong resource distribution
When working with multiple nodes on partition ml
or alpha
, the Slurm parameter
$SLURM_NTASKS_PER_NODE
does not work as intended when used in conjunction with mpirun
.
Explanation
In the described case, when setting e.g. SLURM_NTASKS_PER_NODE=8
and calling mpirun
, Slurm
is not able to pass on the allocation settings correctly. With two nodes, this leads to a
distribution of 15 processes on the first node and 1 process on the second node instead. In
fact, none of the proposed methods of Slurm's man page (like --distribution=plane=8
) will
give the result as intended in this case.
Workaround
Benchmark Hangs Forever¶
The benchmark runs forever and produces a timeout.
Explanation
The reason for this is not known, however, it is caused by the flag -DSPEC_ACCEL_AWARE_MPI
.
Workaround
Remove the flag -DSPEC_ACCEL_AWARE_MPI
from the compiler options in your configuration file.
Other Issues¶
For any further issues you can consult SPEC's FAQ page, search through their known issues or contact their support.