High Performance Computing¶

Warning

this text is not yet ready…

Introduction¶

The HPC servers are available for calculations which cannot be executed on a desktop or laptop within an acceptable amount of time. These servers are equipped with a large amount of cores and memory. Some of them also have powerful GPU cards or high-speed network-connections (Infiniband).

Multiple users can use these servers simultaneously and can start multiple jobs in parallel. Via a load balancing mechanism computation time is allotted fairly to these jobs. We strongly encourage our users to also use social load balancing (a.k.a. social queueing). With that people monitor the current usage of the system by their co-workers and use the system accordingly trying to share the resources evenly. It also means people communicate with each other on how to use the system. If you plan to do huge calculations you’re invited to ask your fellow co-workers if they can spare you some calculation-time or resources for this.

Important

In the end it’s all about sharing the available resources, respecting your co-workers and helping each other out.

A distinction is made between standalone servers, head-nodes and cluster-nodes. Although all these computers are more or less the same, their usage and network interconnection differ:

Standalone servers: servers that allow direct login and can be used interactively. Jobs are not executed by a workload manager (e.g. Slurm) used in clusters. These servers provide basic load balancing of the jobs and users should be nice to each other by applying social queueing (communicate with fellow users about the use of a server) and starting all jobs with nice. Thereby distributing cpu-time more fairly between users while maintaining enough cpu-time for the system to run smoothly (see more information with man nice).
Head-node: server is only used for sending jobs to a cluster and small pre- and/or post-processing of the data used in the cluster. Users login to the server via SSH and start jobs on the cluster with the workload manager (e.g. Slurm). The head-node is also used for monitoring the progress of these jobs on the cluster.
Cluster-nodes: or simply ‘nodes’ are only accessible by sending jobs via the workload manager on the head-node. These servers cannot be accessed directly and are typically not used interactively. In some cases the nodes have a high speed network connection like Infiniband which can be used for very fast transportation of data between jobs on different nodes (MPI).

To use the cpu power and the available memory as much as possible for research calculation the use of GUI (Graphic User Interface) should be kept to a minimum or at best not be used at all. Preferably graphic viewing of the results should be done on your local workstation (desktop or laptop) by transporting output data to your workstation.

For all HPC servers access is granted on a per-user basis. Contact the system administrator of MI or CI for more information.

HPC usage¶

Concerning computational needs users often apply the following workflow for their research:

preliminary research: for easy and quick testing of an algorithm the local desktop or laptop is used with Matlab or a script language like Python
proof of concept: depending on speed requirements and available tools and libraries the code is implemented in a programming language like C++ still on the desktop or laptop
testing real data: as the required compute power and memory grow for larger datasets the HPC server is used to continue research with tools like OpenMP and CUDA
scaling: for even more computer power and higher granularity (parallel computing) the HPC cluster is used with tools like MPI

Of course there are many other solutions using computer power for research. Sometimes users decide not to use the HPC but instead use a local workstation with more memory and cores and the possibility to rapidly change the hardware and software (OS, libraries, applications) without consulting a third party like the HPC administrators. On the other hand users can decide to scale up even further and resort to other clusters outside the university like SURFsara.

Warning

to be continued…