The IST Austria HPC cluster handbook

This handbook is intended to give you an overview of HPC (High Performance Computing) in general and the HPC cluster at IST Austria in particular.

Our system is based on (Debian-)Linux, please be aware, that some basic UNIX/Linux knowledge is necessary to use our cluster.

Some (free) resources for learning Linux:

Course held by IT (in preparation)
https://linuxsurvival.com/linux-tutorial-introduction/
https://www.edx.org/course/introduction-to-linux (even suggested by Linus Torvalds, the creator of Linux)
http://swcarpentry.github.io/shell-novice/ (Part of The Carpentries)

Topics of the handbook

Why working in a cluster environment?
Supercomputers for beginners
Connecting to the IST HPC cluster
File storage options
Module environment and available software
Installing software individually
Working with the Workload Manager Slurm
Introduction
Submitting jobs
What is my job doing?
Hands-on example
Special hardware (GPUs, large memory nodes)

Scope of the IST HPC cluster

Computers have been integral part of science for quite a while now. More and more research questions are tackled by sophisticated data analysis approaches or in-silico experiments. Therefore requirements of the computational infrastructure are steadily growing and must be economically managed. That is why the IST Austria maintains a centralized computation infrastructure to serve the needs of the scientist: an HPC cluster or High-performance computing cluster. Although the present infrastructure is certainly adequate to be called High-performance, nobody should feel intimidated by its name. Intended use cases for the cluster are all scenarios where the run time, the RAM requirements or the available processors are challenging your individual laptop/workstation or interferes with their normal usage.

Cluster usage in a nutshell

Here we present a short documentation of the general principles and most useful commands and practices for carrying out calculations on the computational cluster at IST. The cluster itself is a collection of different computers, also called nodes, some of which have specialized purposes. The cluster is well connected to, but distinct from the general IST network. This means before you can make use of the computational resources of the cluster nodes, you have to connect to the cluster. Ideally you do not want to be bothered which individual nodes are performing the calculation, but you want to specify what has to be done, and something else takes care of the tedious details. This something else is the Workload manager Slurm. So ultimately running the cluster means instruct Slurm what should be done.

General workflow on using the IST cluster

To perform calculations on the cluster, these are the main steps:

1.) You will need an account on the cluster Accounts on the cluster and general IST system accounts are distinct. If you need access to computational resources of the cluster for the first time, please contact us via the IT ticketing system. (it@ist.ac.at).

2.) Log into the head node using SSH From your workstation or laptop you need to log into one of the head nodes

bea81.ista.local
gpu62.ista.local

Head nodes are server dedicated to submit jobs to the remaining compute nodes. The connection is typically established via the command line tool SSH and invoked by running

ssh [YourUserName]@bea81.ista.local

Each time you establish a connection you will be prompted for your password. To prevent this you can also generate an SSH key which handles the authentication automatically in the background. In order to generate your SSH keys, please run once the script mkclustersshkey.sh which is available after you logged into the head node.

After running the script once, your SSH key will be generated and stored for a couple of months, which means no password is needed to establish a connection via SSH or transfer data via SCP.

3.) Upload your data to the cluster using SCP or SFTP In case you want to perform calculations on your existing data sets which are not yet on the cluster storage, first you will need to copy the data to the cluster into your home directory there. From Linux and Mac workstations use the scp or the sftp command line tools, wheres for Windows use Winscp.

From the head nodes you also have read/write access to your data available under:

/fs3/home/YourUserName (this is the H: drive under Windows)

and also your group data under

/fs3/group/YourGroupName (this is the K: drive under Windows)

Which means you can copy data from or to there using the Linux command cp. Compute nodes do not have write access to these paths. Therefore, having access to your group data and to your system-wide data from the head node is mainly for convenience, so that you can easily transfer data between different storage. However, output data generated by your executables running on the cluster have to be written into your local home directory on the cluster. And only after finishing the analysis can be copied to your regular storage location. This last step is also highly recommended since the storage in the local cluster home is not as secure as the dedicated storage server.

4.) Executable for your code performing the computation On the cluster there is a wide range of software packages compiled and ready for use. Executables for scientific software can be easily loaded through the module system. To list the currently available modules, issue the command:

module avail

and hit the space bar to list the next screen of modules. Modules can be loaded with the module load command and the module name. For example, to load Mathematica version 11.1.0:

module load mathematica/11.1.0

Since the calculations will be performed using job scripts, the respective module load command need to be present in the script (see next step) before you call the executable.

5.) Prepare a Slurm job script The cluster at IST is not used interactively like a traditional workstation, but instead jobs need to be submitted to a queuing system. We use the Slurm queuing system, where Slurm is managing the available computer resources and schedules jobs for execution. The basic idea ist that you specify what kind of resources you need and Slurm manages to allocate this resources once available. For details how to set up a Slurm script, see the sections below. In the case you experience any problems preparing your job scripts, please do not hesitate to contact us via the ticketing system.

6.) Submit your job script to Slurm Once prepared job scripts can be submitted to the queue from your shell with the command:

sbatch YourJobScriptNameHere

In the case you have not specified a working directory in your Slurm job script, the job script will start from the current directory from where it was submitted.

7.) Monitor the status of your jobs After the job is submitted, it might start executing right-away, or wait in the queue depending on the current usage of the compute nodes. To monitor the status of your jobs, the command squeue can be used. The squeue command accepts several command line parameters which can be used to modify the output. A useful combination of such parameters is presented by the following command:

squeue --user=`whoami` -l -o %.7i %.8u %.8j %.9a %.9P %.7T %.10M %.9L %.4C %.7m %.6D %R

The above command will print useful information regarding your current jobs which are either actively running or waiting in the queue. If you wish to use similar commands more often, you can create an alias for them in your Bash profile.

8.) Post-processing, plotting graphs, analyzing results For short pre- and post-processing steps which are not resource intensive you can use the head node Bjoern22 interactively. However, for steps which take several CPU cores, substantial amount of RAM or run for long hours should be carried out on compute nodes by submitting such jobs to Slurm.

9.) Coping your result data back to your or your groups storage server. Revisit point 3.) for details.

10.) Billing As any service provided by an SSU, usage of the cluster will be billed to your groups account. This happens automatically quarterly based on the logged cluster usage. Please visit the Price list for an up-to-date price list.

Slurm commands for submitting, monitoring and cancelling jobs

To submit an existing job script to Slurm:

sbatch YourJobScriptNameHere

To monitor the status of your own jobs on the IST cluster:

squeue --sort=+i --user=`whoami` -l -o %.7i %.8u %.8j %.9a %.9P %.7T %.10M %.9L %.4C %.7m %.6D %R %b

This will generate an output similar to the one below, where the ID of your jobs and the status (under the column STATE) will be listed together with lots of other useful information:

JOBID   USER     NAME    ACCOUNT PARTITION   STATE   TIME TIME_LEFT  CPUS MIN_MEM  NODES NODELIST(REASON)
1173   jkiss   YourJobN    itgrp2  defaultp PENDING  0:00 1-12:00:00   20      4K      1 (Resources)
1172   jkiss   YourJobN    itgrp2  defaultp RUNNING  0:31 1-11:59:29   20      4K      2 nick[02,42]

In the above output the job with JOBID 1172 is actively running, and the other job with JOBID 1173 is waiting in the queue for resources to become available for the second job. In addition, the output displays the job name in the queue, your group account, the machine partition on which the job is running or it is being scheduled, the time since the job is running, the remaining time relative to the maximum runtime you defined, number of CPU cores in use, the amount of RAM requested in MegaBytes (this case 4K refers to 4*1024MB=4GB), the number of nodes on which the job is running or it is being scheduled, and the name of the nodes on which the job is running. Most probably the easiest is that one creates an alias for the above command in their .bashrc or equivalent file.

To get an overview of all jobs running/waiting on the IST cluster:

squeue --sort=+i -l -o %.7i %.8u %.8j %.9a %.9P %.7T %.10M %.9L %.4C %.7m %.6D %R %b

To cancel/kill a job, first look up the JOBID of the particular job you intend to kill, next issue the command scancel YourJOBID, where YourJOBID is the ID of a currently running job or a job which is still waiting in the queue. For example, to kill the job with JOBID 1173:

scancel 1173

To display full information of the paths and other related environment settings of a given job with an ID (1174 in the example), issue the command:

scontrol show jobid -dd 1174

To estimate when your job will start: when you have a job with a given JOBID (1174 in the example) waiting in the queue, Slurm can provide you a guesstimate of when this job might start based on the currently running jobs and the resources used by those jobs. For this, issue the command:

squeue --start --jobs 1174

which will print out something similar to:

JOBID  PARTITION   NAME      USER    ST START_TIME          NODES SCHEDNODES           NODELIST(REASON)
1174   gpu         Relion_4  jkiss   PD 2017-07-15T05:14:41      1 (null)               (Resources)

The date under START_TIME is when your job is likely to start, because this is the time when the requested resources will likely become available. However, keep in mind that this is a guesstimate, since we have multiple user groups running a very heterogeneous range of jobs from single CPU to large distributed parallel jobs. Also, some jobs might finish way earlier than the requested time, and the scheduler might fit into the freed up slot an other job, to use the resources more efficiently.

To get a quick overview of the resources/partitions currently used and still available, issue the command:

sinfo

This will return output similar to:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
nick         up 10-00:00:0      3  down* nick[02-03,07]
nick         up 10-00:00:0      2   idle nick06,nick08
hybrid       up 10-00:00:0     23  alloc nick[23-34,36-46]
gpu          up 10-00:00:0      2    mix gpu[62,114]
sazangpu     up 10-00:00:0      1    mix gpu113

In the above output, those nodes which are either defective or under maintenance work are in the down state, nodes which are idle, waiting for jobs are in idle state, nodes in mix state have a part of their CPU/RAM resources already used by an active job, and nodes in alloc state have all their CPU resources in use, i.e. they are full.

To list the currently used/free resources in the compute nodes:

scontrol -o show nodes | awk '{ print $1, $6, $4, $15, $16, $17}' | sort -n

this will print an output similar to:

NodeName=gpu62  CPUTot=12 CPUAlloc=4  RealMemory=127000 AllocMem=20480 FreeMem=30993
NodeName=gpu113 CPUTot=40 CPUAlloc=16 RealMemory=254000 AllocMem=180224 FreeMem=65350
NodeName=gpu114 CPUTot=40 CPUAlloc=20 RealMemory=254000 AllocMem=225280 FreeMem=43386
NodeName=nick36 CPUTot=32 CPUAlloc=32 RealMemory=62000 AllocMem=32768 FreeMem=4904
NodeName=nick37 CPUTot=32 CPUAlloc=32 RealMemory=62000 AllocMem=32768 FreeMem=4376
NodeName=nick38 CPUTot=32 CPUAlloc=32 RealMemory=62000 AllocMem=32768 FreeMem=6445
NodeName=nick39 CPUTot=32 CPUAlloc=32 RealMemory=62000 AllocMem=32768 FreeMem=14584
NodeName=nick40 CPUTot=32 CPUAlloc=32 RealMemory=62000 AllocMem=32768 FreeMem=26392
NodeName=nick41 CPUTot=32 CPUAlloc=32 RealMemory=62000 AllocMem=32768 FreeMem=358

indicating the name of the compute node, the number of CPUs in total, the number of CPUs allocated out of the total, the amount of total RAM in Megabyte, the amount of RAM allocated to jobs in MB, and the amount of free RAM in MB. In the case of the free RAM, the value of the buffered/cached RAM used by the underlying Linux operating system is already extracted.

To list the current software and hardware features on all compute nodes of the cluster which can be used as a constraint, please issue the following Slurm command from your shell:

sinfo  -o %.5a %.10l %.6D %.6t %.20N %.48f

To look up the resources used by a job which finished you will need the JOBID of this finished job (1836 in the example). You will find the JOBID most of time printed in your output files which were generated by the job. To list the most important resources (both those which you allocated to the job and those which were actually used side-by-side) issue the following command:

sacct -j 1836 --format=User,Account,State,JobID,JobName,AllocCPUS,AllocNodes,CPUTime,ReqMem,MaxRSS,Elapsed,AllocGRES

This will print something similar to:

     User    Account      State        JobID    JobName  AllocCPUS AllocNodes    CPUTime     ReqMem     MaxRSS    Elapsed    AllocGRES 
--------- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ------------ 
   jkiss      itgrp   COMPLETED 1836          script.sh         32          1 37-02:57:04     1024Mc            1-03:50:32              
              itgrp   COMPLETED 1836.batch        batch         32          1 37-02:57:04     1024Mc     10052K 1-03:50:32              
              itgrp   COMPLETED 1836.0        linpack.+         32          1 37-02:56:32     1024Mc    158132K 1-03:50:31

The above output shows, that the job with JOBID 1836 was completed, it was allocated 32 CPUs by the user, it was allocated 1024M of RAM per core, and the maximum peak resident memory size (MaxRSS column) reached by the job was about 158M and it run for 1 days 3 hours 50 minutes and 32 seconds. The overall accumulated active CPU time on the 32 CPUs was 37 days 2 hours 57 minutes and 4 seconds. This means, that the code is highly efficient, producing a hog-factor close to 100%, i.e. most of the CPU time is being used by the code instead of being spent for waiting after I/O or for synchronization between processes. There were no GPU resources allocated to the job, so the AllocGRES column is empty. The job comprised of three steps, where first the original Slurm script was read in, next it was executed by batch, and the actual binary executable was started.

Similar to sbatch there is also the Slurm command srun available. While the former submits the job and leaves your terminal ready for further commands, the latter submits the job and awaits its completion. Therefore srun can also be used to work interactively on a cluster node. For example if you want to allocate 4 CPUs, a maximum of 4GB RAM and will finish your work within the next 8 hours, use the following command.

srun -N 1 --cpus-per-task=4  -p defaultp  --time=08:00:00  --mem=4G  --pty --x11=first bash

Please not that the requested resources, which are specified in this case as command line parameters to srun, are typically specified in the Slurm script itself if it is submitted via sbatch.

Composing a Slurm script

Single CPU serial job (for e.g. Octave, Python, R etc.)

A very common scenario among the users of the IST cluster is to carry out calculations using a single CPU on a single node. For example, one would like to run Octave, Python, a single-threaded serial C or Fortran code, R statistical computing etc.

The following example script can be used to submit such a single-CPU serial job from the head node (Bjoern22) to the Slurm queues on the cluster:

#!/bin/bash
#
#-------------------------------------------------------------
#example script for running a single-CPU serial job via SLURM
#-------------------------------------------------------------
#
#SBATCH --job-name=YourJobNameHere
#SBATCH --output=YourOutputFileNameHere
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=36:00:00
#
#Define the amount of RAM used by your job in GigaBytes
#SBATCH --mem=2G
#
#Send emails when a job starts, it is finished or it exits
#SBATCH --mail-user=YourEmail@ist.ac.at
#SBATCH --mail-type=ALL
#
#Pick whether you prefer requeue or not. If you use the --requeue
#option, the requeued job script will start from the beginning, 
#potentially overwriting your previous progress, so be careful.
#For some people the --requeue option might be desired if their
#application will continue from the last state.
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#for single-CPU jobs make sure that they use a single thread
export OMP_NUM_THREADS=1
#
#load the respective software module you intend to use
module load YourModuleHere
#
#
#run the respective binary through SLURM's srun
srun --cpu_bind=verbose  YourExecutableHere  /Path/To/Your/Input/Files/Here

The above script is asking Slurm for a runtime of 36 hours on a single CPU and for 2GB of RAM to be allocated to the job. The default runtime of Slurm is set to one minute, and the default memory is set to 10MB, i.e. the users have to explicitly define the desired runtime and the amount of RAM in their job scripts. After the job will start executing, it will load a software module through the module system, and start executing the job through srun. Please adjust the name how the job should be displayed in the queue, the name of the output file, maximum runtime, maximum amount of memory, the software module to be loaded, name of the binary to be executed and the path to your input files according to your needs. For example:

#!/bin/bash
#
#-------------------------------------------------------------
#example script for running a single-CPU serial job via SLURM
#-------------------------------------------------------------
#
#SBATCH --ntasks=1
#
#SBATCH --job-name=CircleRadius
#SBATCH --output=CircleOut.txt
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=1:00:00
#
#Define the amount of RAM used by your job in GigaBytes
#SBATCH --mem=1G
#
#Send emails when a job starts, it is finished or it exits
#SBATCH --mail-user=YourEmail@ist.ac.at
#SBATCH --mail-type=ALL
#
#Pick whether you prefer requeue or not. If you use the --requeue
#option, the requeued job script will start from the beginning, 
#potentially overwriting your previous progress, so be careful.
#For some people the --requeue option might be desired if their
#application will continue from the last state.
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#for single-CPU jobs make sure that they use a single thread
export OMP_NUM_THREADS=1
#
#load the respective software module you intend to use
module load octave/4.2.1
#
#run the respective binary through SLURM's srun
srun --cpu_bind=verbose  octave  /cluster/home/username/InputFile

Since there are no other job parameters explicitly defined, after the job is submitted to Slurm, it will be running on the machines belonging to the defaultp partition. To steer a job towards a more specific list of hosts, one can provide additional job parameters. For example, to run a job instead of the defaultp partition, one can explicitly define a partition dedicated to your workgroup. This can be done by adding the job parameter to the above script:

#SBATCH --partition=epsilon

In the case if, for some very specific reason (for e.g. reproducing a crash on a specific node), you wish to schedule a job to a particular compute node, the name of the compute node (epsilon85 in this example) can be defined by adding to the above script the parameter:

#SBATCH --nodelist=epsilon85

Since Slurm is using the short hostnames instead of the long aliases to identify compute nodes, if you wish to specify a given compute node, you should check the output of the sinfo command, and look up in the NODELIST section what is the short hostname corresponding to the given compute node you wish to use.

Array jobs

In case your analysis involves a job, maybe with little resource requirements, which has to be repeated multiple times, maybe with different input data, or different parameters, an array job will be useful. This allows to submit with little effort collections of similar jobs quickly and easily. The job will be executed independently but in parallel, as far as the available resources allow for it. Job arrays with millions of tasks can be submitted in milliseconds. All jobs must have the same initial options (e.g. size, time limit, etc.).

An array Slurm script which is invoked by sbatch --array=1,10,100 array.sh submits three times the commands specified in the body of the script. Within the script special variables are available, probably most importantly $SLURM_ARRAY_TASK_ID, which can be used to alter input data, run parameter or output files between each iteration of the array job. In the above example $SLURM_ARRAY_TASK_ID would take on the values 1, 10, and 100, respectively.

Please find below a very simple executable toy implementation of an array Slurm job. In its body a Perl one-liner generates a random DNA sequence. If you need to generate random sequences with variable length you could invoke it (after saving it as array.sh) with sbatch --array=10-100:15 array.sh. There the array task id is set to every integer between 10-100 with step size 15 (i.e., 10, 25, 40, 55, 70, 85, 100). In each of the iteration a random DNA sequence with the corresponding length is generated.

#!/bin/bash
#
#----------------------------------------------------------------
# running a multithreaded job over multiple CPUs
#----------------------------------------------------------------
#


#  Defining options for slurm how to run
#----------------------------------------------------------------
#
#SBATCH --job-name=arraySlurmExpl
#SBATCH --output=array.log
#
#Number of CPU cores to use within one node
#SBATCH -c 1
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=0:05:00
#
#Define the amount of RAM used by your job in GigaBytes
#In shared memory applications this is shared among multiple CPUs
#SBATCH --mem=1G
#
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV

# load the respective software module(s) you intend to use
#----------------------------------------------------------------
# none needed for this example

# define sequence of jobs to run as you would do in a BASH script
# use variable $SLURM_ARRAY_TASK_ID to address individual behaviour
# in different iteration of the script execution
#----------------------------------------------------------------

perl -wle 'print >length_'${SLURM_ARRAY_TASK_ID}';print map { qw{A C G T}[rand 4] } 0..'${SLURM_ARRAY_TASK_ID}'' > length_${SLURM_ARRAY_TASK_ID}.fa

Shared memory example for multithreaded applications

Some software can take advantage of parallel execution. In a shared memory multi-threaded code the application is running over multiple CPUs located within one compute node, where the memory is being shared among the CPUs. How much CPUs should be used is often indicated to software tool by options such as, -t, --threads, --cores, --cpus. Since the memory is shared, such multi-threaded jobs are limited to the maximum amount of RAM and CPU cores available within a single compute node. To execute a shared memory parallel job running over multiple CPUs, one can use the following Slurm job script:

#!/bin/bash
#
#-------------------------------------------------------------
#running a shared memory (multithreaded) job over multiple CPUs
#-------------------------------------------------------------
#
#SBATCH --job-name=YourJobNameHere
#SBATCH --output=YourOutputFileNameHere
#
#Number of CPU cores to use within one node
#SBATCH -c 4
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=36:00:00
#
#Define the amount of RAM used by your job in GigaBytes
#In shared memory applications this is shared among multiple CPUs
#SBATCH --mem=8G
#
#Send emails when a job starts, it is finished or it exits
#SBATCH --mail-user=YourEmail@ist.ac.at
#SBATCH --mail-type=ALL
#
#Pick whether you prefer requeue or not. If you use the --requeue
#option, the requeued job script will start from the beginning, 
#potentially overwriting your previous progress, so be careful.
#For some people the --requeue option might be desired if their
#application will continue from the last state.
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#Set the number of threads to the SLURM internal variable
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#
#load the respective software module you intend to use
module load YourModuleHere
#
#run the respective binary through SLURM's srun
srun --cpu_bind=verbose  YourBinaryHere  /Path/To/Your/Input/Files/Here

In the above example, we are asking Slurm to schedule a job with a runtime of 36 hours on a machine which has four free CPU cores (-c 4 in the example) and 8GB of RAM. This 8GB of memory will be shared by the four CPUs, and srun will run your job on the number of CPU cores you specified.

Simple executable example

As seen above, a Slurm script consists of specifications for Slurm, lines typically starting with #SBATCH, which tell Slurm how to do something. And lines consisting of executable bash command, to tell the compute node what to do. The following example from the realm of genomics, tells Slurm to allocate 1 GB of RAM on one node, using two CPUs, to download a paired-end NGS sequencing reads from the short read archive (SRA) and the associated bacterial reference genome from Ensembl. It loads all needed software, meaning a short read mapper, and two tools to manipulate SAM files. Eventually it calls from the downloaded data genomic variants and filters them for quality. You can copy/paste this example into a file (maybe name it genomic_Slurm_example.sh) in a working directory on the cluster and execute it with sbatch genomic_Slurm_example.sh.

#!/bin/bash

#  Defining options for slurm how to run
#----------------------------------------------------------------
#
#SBATCH --job-name=bioSlurmExpl
#SBATCH --output=genomicSlurmExample.log
#
#Number of CPU cores to use within one node
#SBATCH -c 2
#
#Define the number of hours the job should run. 
#SBATCH --time=0:05:00
#
#Define the amount of RAM used by your job in GigaBytes
#SBATCH --mem=1G
#
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#Set the number of threads to the SLURM internal variable
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK


# load the respective software module(s) you intend to use
#----------------------------------------------------------------
module load minimap2/2.15
module load samtools/1.8
module load bcftools/1.8

# define sequence of jobs to run as you would do in a BASH script
#----------------------------------------------------------------

## set variables
CPUS=$SLURM_CPUS_PER_TASK
REF=Pasteurella_multocida_subsp_multocida_str_pm70.ASM682v1.dna.chromosome.Chromosome.fa
R1=SRR4124989_1.fastq.gz
R2=SRR4124989_2.fastq.gz

## get data
if [ ! -f $REF ];then
  wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-41/fasta/bacteria_0_collection/pasteurella_multocida_subsp_multocida_str_pm70/dna/${REF}.gz
  gunzip ${REF}.gz
fi
if [ ! -f $R1 ];then
  wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR412/009/SRR4124989/$R1
fi
if [ ! -f $R2 ];then
  wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR412/009/SRR4124989/$R2
fi

## run analysis
minimap2 -a -x sr -t $CPUS $REF $R1 $R2\
  | samtools sort -l 0 --threads $CPUS\
  | bcftools mpileup -Ou -B --min-MQ 60 -f $REF -\
  | bcftools call -Ou -v -m -\
  | bcftools norm -Ou -f $REF -d any -\
  | bcftools filter -Ov -e 'QUAL variants.vcf

## example script stole without permission from  Torsten Seemann's The Genome Factory 
## http://thegenomefactory.blogspot.com/2018/10/a-unix-one-liner-to-call-bacterial.html

Distributed memory example for MPI parallel jobs

Distributed memory parallelization uses a programming model, where messages are passed between processes through the MPI software interface, and the processes are distributed on multiple different compute nodes. For jobs to work in parallel via the distributed MPI model, the software need to be implemented from the ground up with a distributed memory model in mind. Thus, only massively parallel applications have MPI support. Such applications are those, where either the given problem one would like to tackle is way too large to fit into the memory of a single compute node, or the amount of CPU resources in a single compute node are not enough to solve the problem in a reasonable time. Due to the distributed memory programming model, the memory required for a job needs to be specified based on a per-CPU basis. The total amount of memory used will be equal to the number of CPU cores multiplied by the amount of RAM per CPU. For MPI parallel jobs the user must explicitly provide a partition in the job script where the job should be running (in the example this is the epsilon partition).

The following example shows how to run a pure MPI job distributed over two compute nodes (among the bjoern partition) using 64 CPUs, and 1GB of memory per CPU core (i.e. 64*1=64GB of memory in total):

#!/bin/bash
#
#-------------------------------------------------------------
#running a distributed memory MPI job over multiple nodes
#-------------------------------------------------------------
#
#Take full nodes with 32CPUs/threads for large parallel jobs
#SBATCH --ntasks-per-node=32
#
#Define the number of nodes the job should be distributed on
#SBATCH --nodes=2
#
#Total number of CPU cores to be used for the MPI job
#SBATCH --ntasks=64
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=36:00:00
#
#Define the amount of RAM used per CPU in GigaBytes
#In distributed memory applications the total amount of RAM 
#used will be:   number of CPUs * memory per CPU
#SBATCH --mem-per-cpu=1G
#
#Send emails when a job starts, it is finished or it exits
#SBATCH --mail-user=YourEmail@ist.ac.at
#SBATCH --mail-type=ALL
#
#Pick whether you prefer requeue or not. If you use the --requeue
#option, the requeued job script will start from the beginning, 
#potentially overwriting your previous progress, so be careful.
#For some people the --requeue option might be desired if their
#application will continue from the last state.
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Define the partition of nodes for distributed memory jobs
#SBATCH --partition=bjoern
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#load an MPI module with SLURM support, or a software module with MPI support
module load openmpi/3.1.3
#
#for pure MPI jobs the number of threads has to be one
export OMP_NUM_THREADS=1
#
#run the respective binary through SLURM's srun
srun --mpi=pmi2 --cpu_bind=verbose  YourBinaryHere  /Path/To/Your/Input/File/Here

For large distributed memory jobs it is good practice to take full compute nodes in order to improve the load balancing, and to reduce the congestion of the InfiniBand network. In the above example, the job will be running on the partition bjoern, where each compute node has 32 threads (--ntasks-per-node=32). Next, decide how many nodes you will need for your MPI job. In the example, two compute nodes will be used (--nodes=2). Multiply the number of nodes with the number of CPUs per node, and use this number to define the total number of CPUs (--ntasks=64 in the example). Thus, the above example will submit an MPI parallel job, where 64 CPUs/threads will be used in total on the partition bjoern, and the respective MPI processes will be distributed on two compute nodes such, that 32 processes will start on each node, fully using the available CPU resources from the nodes.

In the case if your parallel jobs are not requiring load balancing, you can remove the

#SBATCH --ntasks-per-node=32
#SBATCH --nodes=2

job parameters, and just define the total number of CPUs with the

#SBATCH --ntasks=64

Such jobs will be distributed over multiple compute nodes on the partition you defined, and Slurm will fit the individual processes wherever it finds a free slot. Therefore, this type of scheduling is not optimal for balanced MPI jobs, and the performance will be lower.

Hybrid MPI + OpenMP parallel jobs

For some software packages written for very large problem sizes also hybrid parallelization might be available. In a hybrid job the load is first distributed over multiple MPI processes running on multiple separate compute nodes. In addition to this, on top of every MPI process one also runs multiple threads within a single node using the more traditional OpenMP fork-join model. Because of this complex programming model, only very few software has support for hybrid parallelization.

The total number of CPU cores allocated to a hybrid job can be calculated as the number of MPI tasks multiplied by the number of OpenMP threads (in the following example this is 124=48 CPUs). The total amount of memory used by the job is calculated as the number of MPI tasks multiplied by the amount of memory per MPI task (in the following example 125=60GB of RAM in total). Similar to the pure MPI jobs, also for hybrid jobs the user must specify a partition on which the job should be distributed. To run a hybrid workload one can use a script like fore example:

#!/bin/bash
#
#-----------------------------------------
# Hybrid MPI + OpenMP example for SLURM
#-----------------------------------------
#SBATCH --job-name=YourJobNameHere
#SBATCH --output=YourOutputFileNameHere
#
#Number of MPI tasks (MPI ranks in the hybrid job)
#SBATCH --ntasks=12
#
#Define the amount of RAM used per MPI task in GigaBytes
#The total amount of memory used by the job
#can be calculated as:  number of MPI tasks * memory per CPU
#SBATCH --mem-per-cpu=5G
#
#Send emails when a job starts, it is finished or it exits
#SBATCH --mail-user=YourEmail@ist.ac.at
#SBATCH --mail-type=ALL
#
#Pick whether you prefer requeue or not. If you use the --requeue
#option, the requeued job script will start from the beginning, 
#potentially overwriting your previous progress, so be careful.
#For some people the --requeue option might be desired if their
#application will continue from the last state.
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Number of OpenMP threads per MPI task
#SBATCH --cpus-per-task=4
#The total number of CPUs used by the job
#can be calculated as:  number of MPI tasks * cpus per task
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=36:00:00
#
#Define the partition of nodes for hybrid jobs
#SBATCH --partition=epsilon
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#load an MPI module with SLURM support or a software module
module load openmpi/3.1.3
#
#Set the number of OpenMP threads per MPI task to SLURM internal variable value
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#
#run the binary through SLURM's srun
srun --mpi=pmi2 --cpu_bind=verbose  YourBinaryHere /Path/To/Your/Input/File/Here

GPU computing

GPU computing is the use of a GPU (graphics processing unit) as a co-processor to accelerate CPUs. Due to the architecture of GPUs, they allow for the efficient parallelization of basic computing operations.

Single GPU single-CPU job example

If your application supports GPUs with CUDA, and you need GPU acceleration for your job, first please contact us to include you into the partition called gpu, which contains compute nodes with multiple GPUs. The GPUs have to be allocated through Slurm together with the respective CPU resources and system memory. For this, users must explicitly define the gpu partition in their job scripts.

Currently in the gpu partition we have the following hardware:

one GPU compute node with four Nvidia GTX980 GPU cards (hostname GPU62)
two GPU compute nodes with four Nvidia GTX1080 Ti GPU cards (hostname GPU113 and GPU114)

GPU113 is currently dedicated to a research group, GPU114 is available for all users

For running a job using one CPU and one GTX980, the following example script can be used. The script is asking for a single CPU, 5GB of system RAM and a single GPU:

#!/bin/bash
#
#----------------------------------
# single GPU + single CPU example
#----------------------------------
#
#SBATCH --job-name=YourJobNameHere
#SBATCH --output=YourOutputFileNameHere
#
#number of CPUs to be used
#SBATCH --ntasks=1
#
#Define the number of hours the job should run. 
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=36:00:00
#
#Define the amount of system RAM used by your job in GigaBytes
#SBATCH --mem=5G
#
#Send emails when a job starts, it is finished or it exits
#SBATCH --mail-user=YourEmail@ist.ac.at
#SBATCH --mail-type=ALL
#
#Pick whether you prefer requeue or not. If you use the --requeue
#option, the requeued job script will start from the beginning, 
#potentially overwriting your previous progress, so be careful.
#For some people the --requeue option might be desired if their
#application will continue from the last state.
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
#Define the gpu partition for GPU-accelerated jobs
#SBATCH --partition=gpu
#
#Define the number of GPUs used by your job
#SBATCH --gres=gpu:1
#
#Define the GPU architecture (GTX980 in the example, other options are GTX1080Ti, K40)
#SBATCH --constraint=GTX980
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#
#for single-CPU jobs make sure that they use a single thread
export OMP_NUM_THREADS=1
#
#load an CUDA software module
module load YourCUDAmodule
#
#print out the list of GPUs before the job is started
/usr/bin/nvidia-smi
#
#run your CUDA binary through SLURM's srun
srun --cpu_bind=verbose  YourCUDAbinaryHere /Path/To/Your/Input/File/Here

In case when the availabe device-ids need to be provided to the binary, one can identify list of available gpu devices with the following bash commands:

# get comma-separated list of available devices
export CUDA_VISIBLE_DEVICES=$(nvidia-smi pmon -c 1 | awk '/^[^#]/ {if ($2==-) {printf(%s,,$1);}}')
# remove last comma
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES::-1} 
# display available cuda devices
echo $CUDA_VISIBLE_DEVICES

External resources to get help

A very useful resource, especially if you are new to Slurm but familiar with an other job scheduler, can be find here. You will find a well-arranged list of key command line tools, their parameter and names of internally usable variables, and how they relate between different systems.

Common Slurm errors and their cause

Out of memory error in Slurm

If your job will try to use more RAM compared to the amount which was allocated to it based on the amount of resources you requested in the job script, Slurm will kill the process and exit the job script. In such case in the output file you will see messages similar to:

slurmstepd-epsilon88: error: Step 1686.0 exceeded memory limit (5676892 > 2097152), being killed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd-epsilon88: error: Exceeded job memory limit
srun: error: epsilon88: task 0: Killed

Such message is a clear indication, that you have to allocate more RAM to your jobs.

Out of time error in Slurm

If your job runs longer than the amount of time you requested in the job script, Slurm will kill the process and exit the job script. In such case in the output file you will see messages similar to:

slurmstepd-serbyn128: error: *** JOB 18052208 ON serbyn128 CANCELLED AT 2020-04-17T17:56:50 DUE TO TIME LIMIT ***

Such message is a clear indication, that you have to allocate more time to your job, using the --time=0:12:00 parameter.

Installing software

As already mentioned above, with the module system you have a comprehensive collections of software tools with fine grained options concerning the version numbering, at your fingertip. If essential tools are missing, please do not hesitate to request a new module via our IT ticketing system (mail to it@ist.ac.at). If you would like to have more control over your installations, there are several options for you.

Local installation

In contrast to your laptop/workstation typically you do not have sudo rights on any cluster nodes. This prevents you from using convenient package management tools, such as dpkg, apt, yum, and suchlike. Nevertheless, you still can install in your home directory any software, typically this is done in ~/.local/bin. To make it accessible to the system you might want to set the $PATH environment. If you apply this strategy, please do not use the Slurm parameter #SBATCH --export=NONE; unset SLURM_EXPORT_ENV as seen in the above examples. Otherwise, the Slurm instance will not inherit your local setting, e.g. your $PATH.

Install via pip and virtualenv

If you want to install your private Python programs, pip is the conventional way to go. While the use of pip is convenient, it is also subject to some pitfalls on the cluster. If this installation is shared on different compute nodes with different setup (e.g. gpu vs- non-gpu nodes, cuda8 vs cuda9, cudnn6 vs 7 etc. ), you might end up with some unexpected and undefined behavior. A better alternative is the use of virtualenv. You can defined multiple environments, in the following way

virtualenv -p python3 MYENV001   # 1) set up environment MYENV001 (or any other name)
source MYENV001/bin/activate     # 2) activate that environment
pip install PKG                  # 3) install your packages
python ...                       # 4) do your tasks
deactivate                       # 5) deactivate the environment

Install via conda

Similar in spirit but not restricted to Python programs is the package manager and environment management system conda. It allows you to install and maintain different software versions in parallel and have detailed control over its use. Thereby facilitating the reproducibility of your data analysis workflows. To install conda please follow the instructions here. A list of available packages can be found here. Environments are created and used in the following way:

conda create -n MyEnv01 program1 program2 ... # create new env
conda remove -n MyEnv01 program2              # remove a package from an env
conda install -n MyEnv01 program3             # add a package to an env
conda list -n MyEnv01                         # list all packages in an env
conda activate MyEnv01                        # activate conda env
conda deactivate                              # deactivate conda env

Cluster Storage

In our HPC cluster at IST we use storage servers, which are attached to the compute nodes and head nodes via a specialized high-throughput low-latency network, called InfiniBand (see red lines in the attached figure). Each storage server (currently 4 in production) hosts the home directories on the cluster for multiple research groups, so that the load is being distributed over multiple servers. Within the cluster, the storage is being attached via 40 Gbps QDR and 100 Gbps EDR InfiniBand in order to provide high sustained throughput. Since the specialized InfiniBand network is only connecting cluster head nodes and compute nodes, this means, that the data located in your home directory on the HPC cluster is not directly available on your Windows/Linux/Mac workstation. Therefore, in order to access/process/manipulate your data on the cluster, please log into one of the head nodes (currently gpu62 and bea81 are set up as head nodes). As an important detail, please be aware, that there is no backup made on this cluster storage. Therefore, important data/results should be kept elsewhere. More thorough guidelines are presented regarding the storage of scientific data at IST under the following link: https://it.pages.ist.ac.at/docs/it-policies/general-policies/research-data-handling-guideline/

Since 2017 the storage hierarchy in the HCP cluster is based on research groups, and the home directories of individual members are located within the main directory of their affiliated research group. There is a disk quota being set up on every research group. This way a single user/group can not fill up the whole disk space, which would stop access to all users from IST from using the cluster. However, on a user basis within a group there are no quotas set, i.e. a user can fill up the whole space allocated to a research group. This group hierarchy also makes it easier for researches within a group to share data with each other, since within a group directory new subdirectories can be created and the access rights set on them by the head of the group.

In contrast to the InfiniBand network, the campus-wide storage servers are exported via standard 10 Gbps TCP/IP network connections (show with green lines from the figure below). This campus-wide storage is being used by all laptops/desktops, microscopes and other experimental equipment as well. Due to this, one has to look at the campus-wide storage and the storage on the HPC cluster as two separate entities.

On your personal Windows workstations/laptops the campus-wide storage servers are available under the drive letters H:, K:, and Q: (drive letters may wary), whereas on Linux and Mac workstations/laptops these are being mounted under /fs3/home /fs3/group and /archive3, respectively. If you are an IST member, over the campus you should have access to your data located on these servers.

As shown in the figure, the head nodes of the HPC cluster have both the cluster storage and the campus-wide storage being mounted with read-write access, i.e. you can access/modify your data from a head node. However, due to the difference in the underlying network technology, and especially due to the difference in throughput (100Gbps InfiniBand vs 10Gbps TCP/IP network) the compute nodes have only read access to the campus-wide network, but no write access. One can use the head nodes to transfer data between the cluster storage and the campus-wide storage either via SCP, SFTP or Rsync.

Because of these two different storage networks, if you perform a measurement on a microscope, that data will be written to the campus-wide storage. In order to carry out calculations using this data on the cluster, please copy first the data to the cluster storage via a head node. Useful guidelines regarding the storage and handling of data is available under: https://it.pages.ist.ac.at/docs/it-policies/general-policies/research-data-handling-guideline/