MXQ may wrongly apply RLIMIT_CPU which can cause a job to receive SIGXCPU long before the requested job_time timeout #165

donald · 2024-09-11T12:14:44Z

Example: http://afk.molgen.mpg.de/mxq/mxq/job/52186272

Job:

job_status       : FAILED
host_hostname    : freshwatercrocodile.molgen.mpg.de
host_slots       : 9
host_cpu_set     : 87-95
date_end         : 2024-09-10 17:15:59 (7 minutes runtime)

Group:

job_threads     : 1
job_memory      : 32768 MiB
job_time        : 60 minutes

Server:

daemon_slots   : 256
daemon_memory  : 979817

So this job was submitted with --processors 1 (job_threads: 1) but used 9 server 'slots' because of the memory constraints (32768/(979817/256)=8.56).

At

mxq/mxqd.c

Line 1177 in adb6f2f

    
           cpuset_init_job(&job->host_cpu_set, &server->cpu_set_available, &server->cpu_set_running, glist->slots_per_job);

it was granted 9 processors because of the number of slots. However at

mxq/mxqd.c

Line 1022 in adb6f2f

if (group->job_threads == 1) {

the code wrongly assumes that jobs with job_threads == 1 only have one processor and uses setrlimit(RLIMIT_CPU,...) in that case.

The referenced example job used 256 threads in one of its processes and 9 processors were able to exceed the job_time limit of 63 minutes (60 * 5%) in 7 minutes, so the process received a SIGXCPU.

The text was updated successfully, but these errors were encountered:

donald · 2024-09-11T13:57:16Z

Testsubmitter and job which triggers` the problem: https://github.molgen.mpg.de/gist/donald/802910efeba9975ac883c5e6bb26383d

donald mentioned this issue Sep 11, 2024

mxqd: Remove RLIMIT_CPU usage for runtime limitation #166

Merged

donald closed this as completed in #166 Sep 11, 2024

MXQ may wrongly apply RLIMIT_CPU which can cause a job to receive SIGXCPU long before the requested job_time timeout #165

MXQ may wrongly apply RLIMIT_CPU which can cause a job to receive SIGXCPU long before the requested job_time timeout #165

donald commented Sep 11, 2024

donald commented Sep 11, 2024

MXQ may wrongly apply RLIMIT_CPU which can cause a job to receive SIGXCPU long before the requested job_time timeout #165

MXQ may wrongly apply RLIMIT_CPU which can cause a job to receive SIGXCPU long before the requested job_time timeout #165

Comments

donald commented Sep 11, 2024

donald commented Sep 11, 2024