MXQ cluster scheduler

This is a just a brief overview of the MXQ cluster scheduler. Please contact helpdesk@molgen.mpg.de for any questions.

You can also file issues and bug reports at https://github.molgen.mpg.de/mariux64/mxq/issues

General information

The cluster provides a high amount of processors and memory. It is assembled from various heterogeneous servers. Every server adds a different number of slots to the overall cluster. While every slot represents one (hyperthreaded) core of a processor. The amount of memory per slot a server offers may vary (memory_per_slot = server_total_memory / number_of_slots).

Job submission

Resources

There are four resources that we encourage the user to specify when submitting a job:

number of processors
amount of memory
estimated running time
size of the per-job temporary disk storage ($TMPDIR)

If nothing is specified a job runs with

--processors=1 --memory=2G --time=15m --tmpdir=10G

Consult mxqsub --help for further information.

Jobs that exceed these specifications are killed by the cluster.

Jobs that need higher resources are likely to be run with an effective lower priority by the cluster software.

Jobs that run longer than 24 hours are not guaranteed to finish. Since we need do be able to reboot machines occasionally we might kill jobs that are not scheduled to finish within the next 24 hours. Make sure to implement some kind of checkpointing to be able to resume your jobs.

Output

The standard output and standard error channels of a process can be captured to files using the --stdout and --stderr options of mxqsub. By default, the standard error is redirected to the standard output which redirects by default to /dev/null. Thus, using --stderr is useful if and only if you want to redirect those channels to different files.

If standard output or standard error are defined as relative paths, they will be interpreted as being relative to the working directory.

Job execution

Every job is executed in a defined environment. ** No environment variables are inherited ** from the process that submitted the job.

Output is redirected as specified at submission time. The output will be redirected to temporary files in the same directory during execution of a job and once the job has finished, the output files will be renamed in place.

The standard input is redirected from /dev/null.

Execution Environment

Standard environment

USER username
PATH set to /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/local/package/bin
PWD workdir
HOME users home directory
SHELL users login shell
TMPDIR directory for temporary files

mxq special environment variables

MXQ_JOBID numeric job id
MXQ_THREADS number of processors requested for this job
MXQ_SLOTS number of slots this job occupies on this host
MXQ_MEMORY amount of memory reserved for this job (in MiB)
MXQ_TIME number of minutes this job is supposed to run
MXQ_HOSTID host identifier (can be used to file bug reports)
MXQ_JOB_TMPDIR same as TMPDIR

compatibility environment variables

HOSTNAME hostname

Kill groups

Groups can be cancelled/killed at any time.

mxqkill --groupd-id <GROUP_ID>

Running jobs are killed with SIGINT signal which can take several seconds depending on the state of the server. It might just be to lazy to stop working.

List jobs

Please use mxqdump to dump information about running or historical groups or jobs.

mxqdump
mxqdump --group-id <GROUP-ID>
mxqdump --groups [...]
mxqdump --job-id <JOB-ID>
mxqdump --jobs [...]

mxqdump --help

See what's going on

There is also a small web service to show the cluster state: http://afk.molgen.mpg.de/mxq/mxq/

Best Practice

Heavy IO

Your jobs will be started with an individual preallocated temporary filesystem. Use the mxqsub option --tmpdir to change the capacity from the default (10G). The environment variables MXQ_JOB_TMPDIR and TMPDIR both point to that filesystem. TMPDIR is ussed by some tools automatically.

Use $MXQ_JOB_TMPDIR or $TMPDIR from your scripts for temporary files.

Advantages:

The temporary space is on a local disk.
The requested space is preallocated. When your jobs starts, it is guaranteed to have the requested space at its disposal (no "disk full" because of other users bad behavior). Your job will only be started on a node, which has enough free local scratch space.
The space available to your job is limited to the requested size. There is not risk that your jobs accidentally fills up a local disk.
You don't need to care for cleanup. The MXQ daemon will clean up the space after your job completed.
You don't need to restrict yourself to queueing just few jobs. Submit as many jobs as you like - the nodes will only run than many jobs in parallel as the free disk space allows.
The directory is a job-private directory which is empty, when your job starts. There will be no conflict with other jobs of the same group or name clashes with files from other users in a shared scratch directory.

The requested tmpdir size will be shown in the "Active Groups" Table and in the Group Details pages on http://afk.molgen.mpg.de/mxq/mxq/groups. The daemon doesn't track, how much disk space your job actually used. If you want to know that, just put a df -h $MXQ_JOB_TMPDIR at the end of your script.

** How to be nice to the NFS-Servers **

On the cluster, it is allowed to access the global NFS-namespace ( /project/... ). This is convenient for the users of the cluster, but often results in problems for everybody: cluster jobs can easily slow down or completely bring down a fileserver if they are doing a lot of I/O in parallel. With MXQ_JOB_TMPDIR there are two things you can do to avoid the problem:

Avoid temporary files on the nfs server

Often users put pipelines into the cluster, which not just read input files and write the final output files, but also produce a lot of intermediate data. This data may live in the current working directory or in the same directories as the input and output files. So these files most likely are on a remote fileserver. This can be adressed by the following procedure
- copy your input files from their directory to MXQ_JOB_TMPDIR
- cd into MXQ_JOB_TMPDIR
- run your programm
- copy your output files to their final directory.
So if your script, which you submit to the cluster, looks like this
```
 :::bash
 #! /usr/bin/bash

 /project/somewhere/some-strange-pipeline
```
and some-strange-pipeline reads, for example, fasta-files from a subdirectory fasta, does a lot of processing and writes analyze.txt at the end, you could change that to
```
 :::bash
 #! /usr/bin/bash

 set -ve

 ORIGDIR="$(/bin/pwd)"

 cp -a fasta $MXQ_JOB_TMPDIR/

 cd $MXQ_JOB_TMPDIR/

 /project/somewhere/some-strange-pipeline

 cp analyze.txt "$ORIGDIR/"

 df -h $MXQ_JOB_TMPDIR
```
You might need some trial-and-error attempts to identify all required input- and output files, but it is worth. Note for the above code: You don't need to quote $MXQ_JOB_TMPDIR, because it won't contain shell metacharacters. You only need to quote $(/bin/pwd) and $ORIGDIR if your directory may contain shell meta characters (e.g. space).
Avoid parallel copies

The above approach is good, but if you submit this script multiple times multiple copies of it might run in parallel and still doing parallel I/O to the fileserver when reading the input files or writing the output files.

A possible solution is to use a lock to make sure, only a single job does at copy at any time. The lock can be based on any file but all jobs should use the same file, even when started from different directories, so use an absolute path name for it. Transform the script to
```
 :::bash
 #! /usr/bin/bash

 set -ve

 ORIGDIR="$(/bin/pwd)"

 flock /project/somewhere/.my_io_lock -c '
      cp -a fasta $MXQ_JOB_TMPDIR/
 '

 cd $MXQ_JOB_TMPDIR/

 /project/somewhere/some-strange-pipeline

 ORIGDIR="$ORIGDIR" flock /project/somewhere/.my_io_lock -c '
     cp analyze.txt "$ORIGDIR/"
 '

 df -h $MXQ_JOB_TMPDIR
```
The ORIGDIR="$ORIGDIR" before the second flock set the value of the environment variable ORIGDIR for that command to the value of the shell variable ORIGDIR so that the variable can be used in the subshell, because it creates a shell variable for each environment variable when it starts.

The -c argument is a single quoted string which spans multiple lines. It could contain more commands than just a single cp.

Checkpointing your programs

Software that runs for several hours, days, weeks or even months should support checkpointing because execution might fail and the calculations should be able to continue and forced to be restarted.

Very fast jobs

Do not queue very fast running jobs without grouping them. A job should run at least 1 to 5 minutes. Otherwise, the overhead of managing the jobs by the cluster servers will slow down the overall performance.

Thus, if you have jobs that run under 60 seconds, you may group them in a shell script to be executed in groups. 10'000 jobs each running 600s is preferable to 100'000 jobs running 60 seconds each.

Support automatic grouping

Start a a sane amount of different groups - a group for every job is insane by definition.

use mxqsub --group-name NAME to specify a group name for the jobs you're submitting
use mxqsub --comand-alias ALIAS to merge jobs to one group for all jobs started by shell wrapper or script interpreters
- e.g.: in perl SCRIPTNAME set ALIAS to SCRIPTNAME
- e.g.: in wrapperX.sh set ALIAS to the name of the programm the wrapper is executing
- e.g.: /path/to/a/PROGRAM and /path/to/b/PROGRAM set ALIAS to PROGRAM

File content

You can test wheter a file exists. You can not test whether a file that exists is complete or not. Copying tebibytes of data takes time. Therefore: Use temporary files (in the same directory or at least on the same filesystem) and rename (mv, ln) them once the content has been written completely.

GPU usage

By adding --gpu to your mxqsub command, you can request a GPU for your job. The GPUs currently available in the cluster are shown in the following table. Currently there are only NVIDIA (CUDA) GPUs. On demand the A100 GPUs can be split into multiple instances using NVIDIA MIG technology.

number available	type	GPU memory	Max. Runtime
3	A100	40 GB	1 week
1	A100	40 GB	1 hour

mxqsub --gpu ./test.sh                        # run on any gpu

Note, that we might change the configuration according to user demand.

python virtual environment

create a bash script which

activates you virtual environment
runs your python code

and submit that batch script to the cluster

cat > cluster-job
#! /usr/bin/bash

. /PATH-TO-VENV/bin/activate

python whatever.py
^D

chmod +x cluster_job
mxsub ./cluster-job

intranet-doc/cluster.md