MEGWARE PC-Farm Atlas (Outdated)¶
Warning
This page is deprecated! Atlas is a former system!
System¶
The PC farm Atlas
is a heterogeneous, general purpose cluster based on multicore chips AMD Opteron
6274 ("Bulldozer"). The nodes are operated by the Linux operating system SUSE SLES 11 with a 2.6
kernel. Currently, the following hardware is installed:
Component | Count |
---|---|
CPUs | AMD Opteron 6274 |
number of cores | 5120 |
th. peak performance | 45 TFLOPS |
compute nodes | 4-way nodes Saxonid with 64 cores |
nodes with 64 GB RAM | 48 |
nodes with 128 GB RAM | 12 |
nodes with 512 GB RAM | 8 |
Mars and Deimos users: Please read the migration hints.
All nodes share the /home
and /fastfs
filesystem with our other HPC systems. Each
node has 180 GB local disk space for scratch mounted on /tmp
. The jobs for the compute nodes are
scheduled by the Platform LSF batch system from the login nodes
atlas.hrsk.tu-dresden.de
.
A QDR InfiniBand interconnect provides the communication and I/O infrastructure for low latency / high throughput data traffic.
Users with a login on the SGI Altix can access their home directory via NFS
below the mount point /hpc_work
.
CPU AMD Opteron 6274¶
Component | Count |
---|---|
Clock rate | 2.2 GHz |
cores | 16 |
L1 data cache | 16 KB per core |
L1 instruction cache | 64 KB shared in a module (i.e. 2 cores) |
L2 cache | 2 MB per module |
L3 cache | 12 MB total, 6 MB shared between 4 modules = 8 cores |
FP units | 1 per module (supports fused multiply-add) |
th. peak performance | 8.8 GFLOPS per core (w/o turbo) |
The CPU belongs to the x86_64 family. Since it is fully capable of running x86-code, one should compare the performances of the 32 and 64 bit versions of the same code.
For more architectural details, see the AMD Bulldozer block diagram and topology of Atlas compute nodes.
Usage¶
Compiling Parallel Applications¶
When loading a compiler module on Atlas, the module for the MPI implementation OpenMPI is also
loaded in most cases. If not, you should explicitly load the OpenMPI module with module load
openmpi
. This also applies when you use the system's (old) GNU compiler.
Use the wrapper commands mpicc
, mpiCC
, mpif77
, or mpif90
to compile MPI source code. They
use the currently loaded compiler. To reveal the command lines behind the wrappers, use the option
-show
.
For running your code, you have to load the same compiler and MPI module as for compiling the program. Please follow the outlined guidelines to run your parallel program using the batch system.
Batch System¶
Applications on an HPC system can not be run on the login node. They have to be submitted to compute nodes with dedicated resources for the user's job. Normally a job can be submitted with these data:
- number of CPU cores,
- requested CPU cores have to belong on one node (OpenMP programs) or can distributed (MPI),
- memory per process,
- maximum wall clock time (after reaching this limit the process is killed automatically),
- files for redirection of output and error messages,
- executable and command line parameters.
LSF¶
The batch system on Atlas is LSF, see also the general information on LSF.
Submission of Parallel Jobs¶
To run MPI jobs ensure that the same MPI module is loaded as during compile-time. In doubt, check
you loaded modules with module list
. If you code has been compiled with the standard OpenMPI
installation, you can load the OpenMPI module via module load openmpi
.
Please pay attention to the messages you get loading the module. They are more up-to-date than this manual. To submit a job the user has to use a script or a command-line like this:
bsub -n <N> mpirun <program name>
Memory Limits¶
Memory limits are enforced. This means that jobs which exceed their per-node memory limit may be killed automatically by the batch system.
The default limit is 300 MB per job slot (bsub -n
).
Atlas has sets of nodes with different amount of installed memory which affect where your job may be run. To achieve the shortest possible waiting time for your jobs, you should be aware of the limits shown in the following table and read through the explanation below.
Nodes | No. of Cores | Avail. Memory per Job Slot | Max. Memory per Job Slot for Oversubscription |
---|---|---|---|
n[001-047] |
3008 |
940 MB |
1880 MB |
n[049-072] |
1536 |
1950 MB |
3900 MB |
n[085-092] |
512 |
8050 MB |
16100 MB |
Explanation¶
The amount of memory that you request for your job (-M ) restricts to which nodes it will be scheduled. Usually, the column Avail. Memory per Job Slot shows the maximum that will be allowed on the respective nodes.
However, we allow for oversubscribing of job slot memory. This means that jobs which use -n32 or less may be scheduled to smaller memory nodes.
Have a look at the examples below.
Monitoring Memory Usage¶
At the end of the job completion mail there will be a link to a website which shows the memory usage over time per node. This will only be available for longer running jobs (>10 min).
Examples¶
Job Spec. | Nodes Allowed | Remark |
---|---|---|
bsub -n 1 -M 500 |
All nodes | <= 940 Fits everywhere |
bsub -n 64 -M 700 |
All nodes | <= 940 Fits everywhere |
bsub -n 4 -M 1800 |
All nodes | Is allowed to oversubscribe on small nodes n[001-047] |
bsub -n 64 -M 1800 |
n[049-092] |
64*1800 will not fit onto a single small node and is therefore restricted to running on medium and large nodes |
bsub -n 4 -M 2000 |
-n[049-092] |
Over limit for oversubscribing on small nodes n[001-047] , but may still go to medium nodes |
bsub -n 32 -M 2000 |
-n[049-092] |
Same as above |
bsub -n 32 -M 1880 |
All nodes | Using max. 1880 MB, the job is eligible for running on any node |
bsub -n 64 -M 2000 |
-n[085-092] |
Maximum for medium nodes is 1950 per slot - does the job really need 2000 MB per process? |
bsub -n 64 -M 1950 |
n[049-092] |
When using 1950 as maximum, it will fit to the medium nodes |
bsub -n 32 -M 16000 |
n[085-092] |
Wait time might be very long |
bsub -n 64 -M 16000 |
n[085-092] |
Memory request cannot be satisfied (64*16 MB = 1024 GB), cannot schedule job |
Batch Queues¶
Batch queues are subject to (mostly minor) changes anytime. The scheduling policy prefers short running jobs over long running ones. This means that short jobs get higher priorities and are usually started earlier than long running jobs.
Batch Queue | Admitted Users | Max. Cores | Default Runtime | Max. Runtime |
---|---|---|---|---|
interactive |
all |
n/a | 12h 00min | 12h 00min |
short |
all |
1024 | 1h 00min | 24h 00min |
medium |
all |
1024 | 24h 01min | 72h 00min |
long |
all |
1024 | 72h 01min | 120h 00min |
rtc |
on request |
4 | 12h 00min | 300h 00min |