Skip to content

MXQ may wrongly apply RLIMIT_CPU which can cause a job to receive SIGXCPU long before the requested job_time timeout #165

Closed
donald opened this issue Sep 11, 2024 · 1 comment · Fixed by #166

Comments

@donald
Copy link
Contributor

donald commented Sep 11, 2024

Example: http://afk.molgen.mpg.de/mxq/mxq/job/52186272

Job:

job_status       : FAILED
host_hostname    : freshwatercrocodile.molgen.mpg.de
host_slots       : 9
host_cpu_set     : 87-95
date_end         : 2024-09-10 17:15:59 (7 minutes runtime)

Group:

job_threads     : 1
job_memory      : 32768 MiB
job_time        : 60 minutes

Server:

daemon_slots   : 256
daemon_memory  : 979817

So this job was submitted with --processors 1 (job_threads: 1) but used 9 server 'slots' because of the memory constraints (32768/(979817/256)=8.56).

At

mxq/mxqd.c

Line 1177 in adb6f2f

cpuset_init_job(&job->host_cpu_set, &server->cpu_set_available, &server->cpu_set_running, glist->slots_per_job);

it was granted 9 processors because of the number of slots. However at

mxq/mxqd.c

Line 1022 in adb6f2f

if (group->job_threads == 1) {

the code wrongly assumes that jobs with job_threads == 1 only have one processor and uses setrlimit(RLIMIT_CPU,...) in that case.

The referenced example job used 256 threads in one of its processes and 9 processors were able to exceed the job_time limit of 63 minutes (60 * 5%) in 7 minutes, so the process received a SIGXCPU.

@donald
Copy link
Contributor Author

donald commented Sep 11, 2024

Testsubmitter and job which triggers` the problem: https://github.molgen.mpg.de/gist/donald/802910efeba9975ac883c5e6bb26383d

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant