k-Wave on HPC¶

Contribution by Rick Waasdorp

If you don’t want to go to the hassle of compiling these binaries, shoot me an email :)

Introduction¶

This manual will help you to run k-Wave simulations on the MI/CI HPC servers with GPU CUDA support or C++ with OpenMP. Follow this link for an overview of the HPC servers available. Follow this link for information on how to access these servers.

This manual is written for k-Wave version 1.3, released 28^th of February 2020.

Download k-Wave¶

First, download the k-Wave MATLAB toolbox and linux source to compile the CPU or GPU optimization. First login on the k-Wave website, then download:

Transfer the files to a directory accessible from the HPC cluster, e.g. your directory in the BULK storage: /tudelft.net/staff-bulk/tnw/IST/AK/hpc/<netid>. To make this directory a bit more easy to access, we can setup a symbolic link from our home directory:

$ ln -s /tudelft.net/staff-bulk/tnw/IST/AK/hpc/<netid>/ ~/tudbulk

Now this directory is accessible with ~/tudbulk or $HOME/tudbulk (where the $HOME variable equals /home/nfs/<netid>).

Extract the zip archives using

$ unzip k-wave-toolbox-version-1.3.zip -d kwave-matlab
$ unzip k-wave-toolbox-version-1.3-cpp-linux-source.zip -d kwave-linux-source

For now we will assume that we can find the just downloaded files at the following locations:

~/tudbulk/kwave-matlab
~/tudbulk/kwave-linux-source

Download and compile dependencies¶

To use the GPU optimization, you first have to make sure the dependencies are available.

First create a folder on your home on the storage BULK called local:

$ mkdir local

So we have the following path: /tudelft.net/staff-bulk/tnw/IST/AK/hpc/<netid>/local

We have to download and compile the following dependencies for both CPU and GPU:

HDF5 1.8.16
zlib (tested with 1.2.11)
Szip (tested with 2.1.1)

If you want to compile the CPU optimization, you have to download and compile the following as well:

fftw 3.3.x (tested with 3.3.8)

Download and copy HDF5 (CPU and GPU)¶

To make our life easy, we will download compiled binaries of HDF5 1.8.16. Go to https://portal.hdfgroup.org/display/support/HDF5+1.8.16 and find the link that will redirect you to the ftp index. Browse the index and find the following file:

HDF5-1.8.16 for Linux 3.10 CentOS 7 x86_64

Copy the link (link should end with .tar.gz) and open a SSH connection to the HPC cluster. There in your home directory, run:

# Download the file:
$ curl -o HDF5-1.8.16-binaries.tar.gz <link-you-copied>
# Extract file contents
$ tar xfv HDF5-1.8.16-binaries.tar.gz

A folder with the contents of the tar file will be created. Copy the bin, include, lib and share folders to the local folder we created earlier.

Now your local folder will look like this

/local
├-- bin
├-- include
├-- lib
└-- share

That’s all for setting up HDF5. On to the next dependency.

Download and compile zlib (CPU and GPU)¶

Go to https://zlib.net/ and copy the link to the tar.gz for the source of the 1.2.11 release. Through SSH, run

# Download the file:
$ curl -o zlib-1.2.11.tar.gz <link-you-copied>
# Extract file contents
$ tar xfv zlib-1.2.11.tar.gz

Now we have to compile zlib.

$ module use /opt/insy/modulefiles/
$ module load devtoolset/8
# go in zlib dir
$ cd zlib-1.2.11
# configure, setup output path for install
$ ./configure --prefix=$HOME/tudbulk/local --static
$ make
$ make install

To check that everything went as intended, check that libz.a is in the $HOME/tudbulk/local/lib folder

$ ls -al $HOME/tudbulk/local/lib

When there are other libz files, but not libz.a, you probably forgot the --static in the configuration step.

Download and compile Szip (CPU and GPU)¶

Go to https://support.hdfgroup.org/doc_resource/SZIP/ and copy the link on the top right ‘SZIP Source Code’. In your terminal go to a download or home directory and do

$ curl -o szip-2.1.1.tar.gz <link-you-copied>
$ tar xfv szip-2.1.1.tar.gz
$ cd szip-2.1.1
# make sure devtoolset/8 is loaded before running make
$ ./configure --prefix=$HOME/tudbulk/local
$ make
$ cp src/.libs/libsz.a ~/tudbulk/local/lib

This will copy the static libsz library to the lib folder, where we can later easily find it.

Download and compile fftw (CPU only)¶

Go to http://www.fftw.org/download.html and copy the link for the source of fftw-3.3.x (should look like http://www.fftw.org/fftw-3.3.versionnumber.tar.gz). In your terminal go to a download or home directory and do

$ curl -o fftw-3.3.8.tar.gz <link-you-copied>
$ tar xfv fftw-3.3.8.tar.gz
$ cd fftw-3.3.8
# make sure devtoolset/8 is loaded before running make
$ ./configure --enable-single --enable-avx --enable-openmp \ 
              --enable-shared --prefix=$HOME/fftw3
$ make
$ make install

Compile k-Wave GPU linux source¶

We now should have compiled all the dependencies, and put them in folders where we can easily find them for our main purpose: compiling k-Wave to run on the GPU.

On the login node of the HPC cluster, browse to:

$ cd $HOME/tudbulk/kwave-linux-source/kspaceFirstOrder-CUDA

We will be using CUDA 11.0 to build the source, and GCC 8.3.

$ module use /opt/insy/modulefiles/
$ module load devtoolset/8
$ module load cuda/11.0

We will use the Makefile in the kspaceFirstOrder-CUDA to build the project, but first we will have to make some adjustments to let it point to the correct paths of the dependencies. Using your favorite text editor (vi or vim) open the Makefile

$ vi Makefile

Make the following adjustments to the Makefile

--- a/Makefile
+++ b/Makefile
@@ -88,10 +88,10 @@ LINKING = SEMI

 # Set up paths: If using modules, the paths are set up automatically,
 #               otherwise, set paths manually
-CUDA_DIR = $(CUDA_HOME)
-HDF5_DIR = $(EBROOTHDF5)
-ZLIB_DIR = $(EBROOTZLIB)
-SZIP_DIR = $(EBROOTSZIP)
+CUDA_DIR = $(CUDA_PATH)
+HDF5_DIR = $(HOME)/tudbulk/local
+ZLIB_DIR = $(HOME)/tudbulk/local
+SZIP_DIR = $(HOME)/tudbulk/local

 # Select CPU architecture (what instruction set to be used).
 # The native architecture will compile and optimize the code for the underlying
@@ -110,8 +110,7 @@ GIT_HASH       = -D__KWAVE_GIT_HASH__=\"468dc31c2842a7df5f2a07c3a13c16c9b0b2b770
 .RECIPEPREFIX +=

 # What CUDA GPU architectures to include in the binary
-CUDA_ARCH = --generate-code arch=compute_30,code=sm_30 \
-            --generate-code arch=compute_32,code=sm_32 \
+CUDA_ARCH =  \
             --generate-code arch=compute_35,code=sm_35 \
             --generate-code arch=compute_37,code=sm_37 \
             --generate-code arch=compute_50,code=sm_50 \

That should make us ready to compile! Now it’s just a matter of

$ make

and wait.

Compiling...

After a while, the file kspaceFirstOrder-CUDA should have been created. Check that you can run it, by typing:

$ ./kspaceFirstOrder-CUDA
┌---------------------------------------------------------------┐
│                  kspaceFirstOrder-CUDA v1.3                   │
├---------------------------------------------------------------┤
│                             Usage                             │
├---------------------------------------------------------------┤
│                     Mandatory parameters                      │
├---------------------------------------------------------------┤
│ -i <file_name>                │ HDF5 input file               │
│ -o <file_name>                │ HDF5 output file              │
├-------------------------------┴-------------------------------┤
│                      Optional parameters                      │
├-------------------------------┬-------------------------------┤
│ -t <num_threads>              │ Number of CPU threads         │
│                               │  (default =   8)              │
│ -g <device_number>            │ GPU device to run on          │
│                               │   (default = the first free)  │
│ -r <interval_in_%>            │ Progress print interval       │
│                               │   (default =  5%)             │
│ -c <compression_level>        │ Compression level <0,9>       │
│                               │   (default = 0)               │
│ --benchmark <time_steps>      │ Run only a specified number   │
│                               │   of time steps               │
│ --verbose <level>             │ Level of verbosity <0,2>      │
│                               │   0 - basic, 1 - advanced,    │
│                               │   2 - full                    │
│                               │   (default = basic)           │
│ -h, --help                    │ Print help                    │
│ --version                     │ Print version and build info  │
├-------------------------------┼-------------------------------┤
│ --checkpoint_file <file_name> │ HDF5 checkpoint file          │
│ --checkpoint_interval <sec>   │ Checkpoint after a given      │
│                               │   number of seconds           │
│ --checkpoint_timesteps <step> │ Checkpoint after a given      │
│                               │   number of time steps        │
├-------------------------------┴-------------------------------┤
│                          Output flags                         │
├-------------------------------┬-------------------------------┤
│ -p                            │ Store acoustic pressure       │
│                               │   (default output flag)       │
│                               │   (the same as --p_raw)       │
│ --p_raw                       │ Store raw time series of p    │
│ --p_rms                       │ Store rms of p                │
│ --p_max                       │ Store max of p                │
│ --p_min                       │ Store min of p                │
│ --p_max_all                   │ Store max of p (whole domain) │
│ --p_min_all                   │ Store min of p (whole domain) │
│ --p_final                     │ Store final pressure field    │
├-------------------------------┼-------------------------------┤
│ -u                            │ Store ux, uy, uz              │
│                               │    (the same as --u_raw)      │
│ --u_raw                       │ Store raw time series of      │
│                               │    ux, uy, uz                 │
│ --u_non_staggered_raw         │ Store non-staggered raw time  │
│                               │   series of ux, uy, uz        │
│ --u_rms                       │ Store rms of ux, uy, uz       │
│ --u_max                       │ Store max of ux, uy, uz       │
│ --u_min                       │ Store min of ux, uy, uz       │
│ --u_max_all                   │ Store max of ux, uy, uz       │
│                               │   (whole domain)              │
│ --u_min_all                   │ Store min of ux, uy, uz       │
│                               │   (whole domain)              │
│ --u_final                     │ Store final acoustic velocity │
├-------------------------------┼-------------------------------┤
│ -s <time_step>                │ When data collection begins   │
│                               │   (default = 1)               │
│ --copy_sensor_mask            │ Copy sensor mask to the       │
│                               │    output file                │
└-------------------------------┴-------------------------------┘
┌---------------------------------------------------------------┐
│            !!! K-Wave experienced a fatal error !!!           │
├---------------------------------------------------------------┤
│ Error: Input file was not specified.                          │
├---------------------------------------------------------------┤
│                      Execution terminated                     │
└---------------------------------------------------------------┘

Now, for easy use of the binary with Matlab, you can copy the binary to the k-Wave MATLAB folder:

$ cp kspaceFirstOrder-CUDA $HOME/tudbulk/kwave-matlab/binaries

How to use¶

Now create some k-Wave MATLAB file. Make sure that this MATLAB file adds the kwave-matlab directory to the MATLAB path!

To use the GPU, make sure you do the following

% calls default MATLAB routine
sensor_data = kspaceFirstOrder2D(kgrid, medium, source, sensor)
% Will call the CUDA binary
sensor_data = kspaceFirstOrder2DG(kgrid, medium, source, sensor)
%                               ^ note G here!

Now to use the GPU on the HPC cluster, setup a batch file as follows:

#!/bin/sh

#SBATCH --partition=general
#SBATCH --qos=<your qos>
#SBATCH --time=<your time>
#SBATCH --ntasks=1
#SBATCH --mem=<estimated memory>
#SBATCH --cpus-per-task=4
#SBATCH --mail-type=END

# to use the awi node:
#SBATCH --nodelist=awi01

# to use the Tesla V100
#SBATCH --gres=gpu:v100

# ==============================================================================
# Your job commands go below here
# ==============================================================================

# Uncomment these lines when your job requires this software
module use /opt/insy/modulefiles
module load matlab/R2019b
module load cuda/11.0
# Complex or heavy commands should be started with 'srun' (see 'man srun' for more information)
srun matlab -nodesktop -r "your_kwave_file"

Important

You have to load cuda, otherwise the binary will not work!

That’s about it. Enjoy your fast k-Wave simulations!

Compile k-Wave CPU linux source¶

We now should have compiled all the dependencies, and put them in folders where we can easily find them for our main purpose: compiling k-Wave to run with C++ OpenMP optimizations.

On the login node of the HPC cluster, browse to:

$ cd $HOME/tudbulk/kwave-linux-source/kspaceFirstOrder-OMP

We will be using GCC 8.3 build the source.

$ module use /opt/insy/modulefiles/
$ module load devtoolset/8
$ module load openmpi

We will use the Makefile in the kspaceFirstOrder-OMP to build the project, but first we will have to make some adjustments to let it point to the correct paths of the dependencies. Using your favorite text editor (vi or vim) open the Makefile

$ vi Makefile

Make the following adjustments to the Makefile

# LINKING = STATIC
LINKING = DYNAMIC

# Set up paths: If using modules, the paths are set up automatically,
#               otherwise, set paths manually
MKL_DIR  = $(EBROOTMKL)
FFT_DIR  = $(HOME)/fftw3
HDF5_DIR = $(HOME)/tudbulk/local
ZLIB_DIR = $(HOME)/tudbulk/local
SZIP_DIR = $(HOME)/tudbulk/local

That should make us ready to compile! Now it’s just a matter of

$ make

Check if we were successful:

$ ./kspaceFirstOrder-OMP

If you see the following stuff is working:

┌---------------------------------------------------------------┐
│                   kspaceFirstOrder-OMP v1.3                   │
├---------------------------------------------------------------┤
│                             Usage                             │
├---------------------------------------------------------------┤
│                     Mandatory parameters                      │
├---------------------------------------------------------------┤
│ -i <file_name>                │ HDF5 input file               │
│ -o <file_name>                │ HDF5 output file              │
├-------------------------------┴-------------------------------┤
│                      Optional parameters                      │
├-------------------------------┬-------------------------------┤
│ -t <num_threads>              │ Number of CPU threads         │
│                               │  (default =   1)              │
│ -r <interval_in_%>            │ Progress print interval       │
│                               │   (default =  5%)             │
│ -c <compression_level>        │ Compression level <0,9>       │
│                               │   (default = 0)               │
│ --benchmark <time_steps>      │ Run only a specified number   │
│                               │   of time steps               │
│ --verbose <level>             │ Level of verbosity <0,2>      │
│                               │   0 - basic, 1 - advanced,    │
│                               │   2 - full                    │
│                               │   (default = basic)           │
│ -h, --help                    │ Print help                    │
│ --version                     │ Print version and build info  │
├-------------------------------┼-------------------------------┤
│ --checkpoint_file <file_name> │ HDF5 checkpoint file          │
│ --checkpoint_interval <sec>   │ Checkpoint after a given      │
│                               │   number of seconds           │
│ --checkpoint_timesteps <step> │ Checkpoint after a given      │
│                               │   number of time steps        │
├-------------------------------┴-------------------------------┤
│                          Output flags                         │
├-------------------------------┬-------------------------------┤
│ -p                            │ Store acoustic pressure       │
│                               │   (default output flag)       │
│                               │   (the same as --p_raw)       │
│ --p_raw                       │ Store raw time series of p    │
│ --p_rms                       │ Store rms of p                │
│ --p_max                       │ Store max of p                │
│ --p_min                       │ Store min of p                │
│ --p_max_all                   │ Store max of p (whole domain) │
│ --p_min_all                   │ Store min of p (whole domain) │
│ --p_final                     │ Store final pressure field    │
├-------------------------------┼-------------------------------┤
│ -u                            │ Store ux, uy, uz              │
│                               │    (the same as --u_raw)      │
│ --u_raw                       │ Store raw time series of      │
│                               │    ux, uy, uz                 │
│ --u_non_staggered_raw         │ Store non-staggered raw time  │
│                               │   series of ux, uy, uz        │
│ --u_rms                       │ Store rms of ux, uy, uz       │
│ --u_max                       │ Store max of ux, uy, uz       │
│ --u_min                       │ Store min of ux, uy, uz       │
│ --u_max_all                   │ Store max of ux, uy, uz       │
│                               │   (whole domain)              │
│ --u_min_all                   │ Store min of ux, uy, uz       │
│                               │   (whole domain)              │
│ --u_final                     │ Store final acoustic velocity │
├-------------------------------┼-------------------------------┤
│ -s <time_step>                │ When data collection begins   │
│                               │   (default = 1)               │
│ --copy_sensor_mask            │ Copy sensor mask to the       │
│                               │    output file                │
└-------------------------------┴-------------------------------┘
┌---------------------------------------------------------------┐
│            !!! K-Wave experienced a fatal error !!!           │
├---------------------------------------------------------------┤
│ Error: Input file was not specified.                          │
├---------------------------------------------------------------┤
│                      Execution terminated                     │
└---------------------------------------------------------------┘

Now, for easy use of the binary with Matlab, you can copy the binary to the k-Wave MATLAB folder:

$ cp kspaceFirstOrder-OMP $HOME/tudbulk/kwave-matlab/binaries

How to use¶

Now create some k-Wave MATLAB file. Make sure that this MATLAB file adds the kwave-matlab directory to the MATLAB path!

To use the CPU optimizations, make sure you do the following

% calls default MATLAB routine
sensor_data = kspaceFirstOrder2D(kgrid, medium, source, sensor)
% Will call the CUDA binary
sensor_data = kspaceFirstOrder2DC(kgrid, medium, source, sensor)
%                               ^ note C here!

Now to use the GPU on the HPC cluster, setup a batch file as follows:

#!/bin/sh

#SBATCH --partition=general
#SBATCH --qos=<your qos>
#SBATCH --time=<your time>
#SBATCH --ntasks=1
#SBATCH --mem=<estimated memory>

# maybe increase number of CPUs for MPI:
#SBATCH --cpus-per-task=4 

# ========================================================================
# Your job commands go below here
# ========================================================================

# Uncomment these lines when your job requires this software
module use /opt/insy/modulefiles
module load matlab/R2019b
module load openmpi

# Complex or heavy commands should be started with 'srun' (see 'man srun' for more information)
srun matlab -nodesktop -r "your_kwave_file"

Important

You have to load openmpi, otherwise the binary will not work!

That’s about it. Enjoy your fast k-Wave simulations!