MXQ cluster scheduler
This is a just a brief overview of the MXQ cluster scheduler. Please contact helpdesk@molgen.mpg.de for any questions.
You can also file issues and bug reports at https://github.molgen.mpg.de/mariux64/mxq/issues
General information
The cluster provides a high amount of processors and memory. It is assembled from various heterogeneous servers. Every server adds a different number of slots to the overall cluster. While every slot represents one (hyperthreaded) core of a processor. The amount of memory per slot a server offers may vary (memory_per_slot = server_total_memory / number_of_slots).
Job submission
Resources
There are four resources that we encourage the user to specify when submitting a job:
- number of processors
- amount of memory
- estimated running time
- size of the per-job temporary disk storage ($TMPDIR)
If nothing is specified a job runs with
--processors=1 --memory=2G --time=15m --tmpdir=10G
Consult mxqsub --help
for further information.
Jobs that exceed these specifications are killed by the cluster.
Jobs that need higher resources are likely to be run with an effective lower priority by the cluster software.
Jobs that run longer than 24 hours are not guaranteed to finish. Since we need do be able to reboot machines occasionally we might kill jobs that are not scheduled to finish within the next 24 hours. Make sure to implement some kind of checkpointing to be able to resume your jobs.
Output
The standard output and standard error channels of a process can be captured to files using the --stdout
and --stderr
options of mxqsub. By default, the standard error is redirected to the standard output which redirects by default to /dev/null. Thus, using --stderr
is useful if and only if you want to redirect those channels to different files.
If standard output or standard error are defined as relative paths, they will be interpreted as being relative to the working directory.
Job execution
Every job is executed in a defined environment. ** No environment variables are inherited ** from the process that submitted the job.
Output is redirected as specified at submission time. The output will be redirected to temporary files in the same directory during execution of a job and once the job has finished, the output files will be renamed in place.
The standard input is redirected from /dev/null.
Execution Environment
Standard environment
- USER username
- PATH set to /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/local/package/bin
- PWD workdir
- HOME users home directory
- SHELL users login shell
- TMPDIR directory for temporary files
mxq special environment variables
- MXQ_JOBID numeric job id
- MXQ_THREADS number of processors requested for this job
- MXQ_SLOTS number of slots this job occupies on this host
- MXQ_MEMORY amount of memory reserved for this job (in MiB)
- MXQ_TIME number of minutes this job is supposed to run
- MXQ_HOSTID host identifier (can be used to file bug reports)
- MXQ_JOB_TMPDIR same as TMPDIR
compatibility environment variables
- HOSTNAME hostname
Kill groups
Groups can be cancelled/killed at any time.
mxqkill --groupd-id <GROUP_ID>
Running jobs are killed with SIGINT signal which can take several seconds depending on the state of the server. It might just be to lazy to stop working.
List jobs
Please use mxqdump to dump information about running or historical groups or jobs.
mxqdump
mxqdump --group-id <GROUP-ID>
mxqdump --groups [...]
mxqdump --job-id <JOB-ID>
mxqdump --jobs [...]
mxqdump --help
See what's going on
There is also a small web service to show the cluster state: http://afk.molgen.mpg.de/mxq/mxq/
Best Practice
Heavy IO
Your jobs will be started with an individual preallocated temporary filesystem. Use the mxqsub option --tmpdir
to change the capacity from the default (10G).
The environment variables MXQ_JOB_TMPDIR and TMPDIR both point to that filesystem. TMPDIR is ussed by some tools automatically.
Use $MXQ_JOB_TMPDIR
or $TMPDIR
from your scripts for temporary files.
Advantages:
- The temporary space is on a local disk.
- The requested space is preallocated. When your jobs starts, it is guaranteed to have the requested space at its disposal (no "disk full" because of other users bad behavior). Your job will only be started on a node, which has enough free local scratch space.
- The space available to your job is limited to the requested size. There is not risk that your jobs accidentally fills up a local disk.
- You don't need to care for cleanup. The MXQ daemon will clean up the space after your job completed.
- You don't need to restrict yourself to queueing just few jobs. Submit as many jobs as you like - the nodes will only run than many jobs in parallel as the free disk space allows.
- The directory is a job-private directory which is empty, when your job starts. There will be no conflict with other jobs of the same group or name clashes with files from other users in a shared scratch directory.
The requested tmpdir size will be shown in the "Active Groups" Table and in the Group Details pages on http://afk.molgen.mpg.de/mxq/mxq/groups. The daemon doesn't track, how much disk space your job actually used. If you want to know that, just put a df -h $MXQ_JOB_TMPDIR
at the end of your script.
** How to be nice to the NFS-Servers **
On the cluster, it is allowed to access the global NFS-namespace ( /project/...
). This is convenient for the users of the cluster, but often results in problems for everybody: cluster jobs can easily slow down or completely bring down a fileserver if they are doing a lot of I/O in parallel. With MXQ_JOB_TMPDIR there are two things you can do to avoid the problem:
-
Avoid temporary files on the nfs server
Often users put pipelines into the cluster, which not just read input files and write the final output files, but also produce a lot of intermediate data. This data may live in the current working directory or in the same directories as the input and output files. So these files most likely are on a remote fileserver. This can be adressed by the following procedure
- copy your input files from their directory to MXQ_JOB_TMPDIR
- cd into MXQ_JOB_TMPDIR
- run your programm
- copy your output files to their final directory.
So if your script, which you submit to the cluster, looks like this
:::bash #! /usr/bin/bash /project/somewhere/some-strange-pipeline
and
some-strange-pipeline
reads, for example, fasta-files from a subdirectoryfasta
, does a lot of processing and writesanalyze.txt
at the end, you could change that to:::bash #! /usr/bin/bash set -ve ORIGDIR="$(/bin/pwd)" cp -a fasta $MXQ_JOB_TMPDIR/ cd $MXQ_JOB_TMPDIR/ /project/somewhere/some-strange-pipeline cp analyze.txt "$ORIGDIR/" df -h $MXQ_JOB_TMPDIR
You might need some trial-and-error attempts to identify all required input- and output files, but it is worth. Note for the above code: You don't need to quote
$MXQ_JOB_TMPDIR
, because it won't contain shell metacharacters. You only need to quote$(/bin/pwd)
and$ORIGDIR
if your directory may contain shell meta characters (e.g. space). -
Avoid parallel copies
The above approach is good, but if you submit this script multiple times multiple copies of it might run in parallel and still doing parallel I/O to the fileserver when reading the input files or writing the output files.
A possible solution is to use a lock to make sure, only a single job does at copy at any time. The lock can be based on any file but all jobs should use the same file, even when started from different directories, so use an absolute path name for it. Transform the script to
:::bash #! /usr/bin/bash set -ve ORIGDIR="$(/bin/pwd)" flock /project/somewhere/.my_io_lock -c ' cp -a fasta $MXQ_JOB_TMPDIR/ ' cd $MXQ_JOB_TMPDIR/ /project/somewhere/some-strange-pipeline ORIGDIR="$ORIGDIR" flock /project/somewhere/.my_io_lock -c ' cp analyze.txt "$ORIGDIR/" ' df -h $MXQ_JOB_TMPDIR
The
ORIGDIR="$ORIGDIR"
before the secondflock
set the value of the environment variableORIGDIR
for that command to the value of the shell variableORIGDIR
so that the variable can be used in the subshell, because it creates a shell variable for each environment variable when it starts.The
-c
argument is a single quoted string which spans multiple lines. It could contain more commands than just a singlecp
.
Checkpointing your programs
Software that runs for several hours, days, weeks or even months should support checkpointing because execution might fail and the calculations should be able to continue and forced to be restarted.
Very fast jobs
Do not queue very fast running jobs without grouping them. A job should run at least 1 to 5 minutes. Otherwise, the overhead of managing the jobs by the cluster servers will slow down the overall performance.
Thus, if you have jobs that run under 60 seconds, you may group them in a shell script to be executed in groups. 10'000 jobs each running 600s is preferable to 100'000 jobs running 60 seconds each.
Support automatic grouping
Start a a sane amount of different groups - a group for every job is insane by definition.
- use
mxqsub --group-name NAME
to specify a group name for the jobs you're submitting - use
mxqsub --comand-alias ALIAS
to merge jobs to one group for all jobs started by shell wrapper or script interpreters- e.g.: in perl SCRIPTNAME set ALIAS to SCRIPTNAME
- e.g.: in wrapperX.sh set ALIAS to the name of the programm the wrapper is executing
- e.g.: /path/to/a/PROGRAM and /path/to/b/PROGRAM set ALIAS to PROGRAM
File content
You can test wheter a file exists. You can not test whether a file that exists is complete or not. Copying tebibytes of data takes time. Therefore: Use temporary files (in the same directory or at least on the same filesystem) and rename (mv, ln) them once the content has been written completely.
GPU usage
By adding --gpu
to your mxqsub
command, you can request a GPU for your job. The GPUs currently available in the cluster are shown in the following table. Currently there are only NVIDIA (CUDA) GPUs. On demand the A100 GPUs can be split into multiple instances using NVIDIA MIG technology.
number available | type | GPU memory | Max. Runtime |
---|---|---|---|
3 | A100 | 40 GB | 1 week |
1 | A100 | 40 GB | 1 hour |
mxqsub --gpu ./test.sh # run on any gpu
Note, that we might change the configuration according to user demand.
python virtual environment
create a bash script which
- activates you virtual environment
- runs your python code
and submit that batch script to the cluster
cat > cluster-job
#! /usr/bin/bash
. /PATH-TO-VENV/bin/activate
python whatever.py
^D
chmod +x cluster_job
mxsub ./cluster-job