Better handle terminating jobs when /var/spool is full #136

pmenzel · 2022-07-18T10:44:47Z

Despite a user cancelling/killing a job with

$ mxqkill -g 507924
WARNING: no active group with group_id=507924 found for user=lo(5421)

the job 40881611 is still shown as running. No process from that user runs on superbia anymore.

2022-07-16 00:57:58 +0200 mxqd[5723]:    job=lo(5421):507924:40881611 :: started. pid=81303
2022-07-16 00:57:58 +0200 mxqd[5723]: Main loop started 30 slots.
2022-07-16 01:09:02 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:09:22 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:09:23 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:09:43 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:09:44 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:10:04 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:10:04 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:10:24 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:10:25 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:10:45 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:10:45 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:11:05 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:11:06 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:11:26 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:11:26 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:11:46 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:11:47 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:12:07 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:12:08 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:12:28 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:12:28 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:12:48 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:12:49 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:13:09 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:13:09 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:13:29 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:13:30 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:13:50 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:13:50 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:14:10 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:14:11 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:14:31 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:14:32 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:14:52 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:14:52 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:15:12 +0200 mxqd[5723]: ERROR: /var/spool/mx

The text was updated successfully, but these errors were encountered:

david · 2022-07-18T11:14:00Z

maybe out of free disk space

This is /project/admin/nachtwaechter/nachtwaechter.pl. There is a new alarm:

     ** Platte /               superbia(68G) : 40k left **

See http://afk.molgen.mpg.de/alarms for current status

donald · 2022-07-19T17:44:49Z

Exactly

root@superbia:/var/spool/mxqd/main# ls -l .
total 0
-rw-rw---- 1 root lo     0 Jul 16 01:09 40881611.stat
-rw-rw---- 1 root spwgrp 0 Jul 16 01:34 40881632.stat
-rw-rw---- 1 root spwgrp 0 Jul 16 01:34 40881633.stat
-rw-r----- 1 root haas   0 Jul 16 11:57 40882326.stat

mxqd is not prepared for that. Suggestion: Remove the empty stat files and restart the daemon. (mxqdctl-hostconfig reload). The new daemon should detect that the jobs are gone and the jobs should go from "running" to "unknown". Maybe the restart is not even required and it would be enough to remove these files. Not sure and no time currently to check...

pmenzel · 2022-07-20T09:32:22Z

I removed the files.

@superbia:~$ sudo rm /var/spool/mxqd/main/40881611.stat
@superbia:~$ sudo rm /var/spool/mxqd/main/4088163{2,3}.stat
@superbia:~$ sudo rm /var/spool/mxqd/main/40882326.stat

Let’s see, if mxqd detects the jobs are gone without a restart.

donald · 2023-12-30T12:39:01Z

Should the reaper wait and retry on write failure?

thomas · 2023-12-30T16:03:42Z

sorry, but l do not understand what you are talking about ...

…

On December 30, 2023 1:39:02 PM GMT+01:00, Donald Buczek ***@***.***> wrote: Should the reaper wait and retry on write failure?

donald · 2023-12-31T09:28:46Z

Every running mxq job has a privileged process ("mxq reaper") as its top process. 1. It just reaps the user processes until no more are left and writes the exit status of the main process and the resource usages into a spool file 2.

This issue reported here was triggered when the root filesystem was full, the reaper consequently produced an empty spool file, the mxq daemon complained about the unexpected format and wasn't able to finish the jobs.

My suggestion was that the reaper process, if it finds itself unable to write the spool file, just waits and retries for ever. In the reported case, that would have helped.

donald changed the title ~~ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)~~ Better handle terminating jobs when /var/spool is full Dec 30, 2023

donald mentioned this issue Jan 1, 2024

next #151

Merged

donald closed this as completed in #151 Feb 17, 2024

Better handle terminating jobs when /var/spool is full #136

Better handle terminating jobs when /var/spool is full #136

pmenzel commented Jul 18, 2022

david commented Jul 18, 2022

donald commented Jul 19, 2022

pmenzel commented Jul 20, 2022

donald commented Dec 30, 2023

thomas commented Dec 30, 2023 via email

donald commented Dec 31, 2023

Better handle terminating jobs when /var/spool is full #136

Better handle terminating jobs when /var/spool is full #136

Comments

pmenzel commented Jul 18, 2022

david commented Jul 18, 2022

donald commented Jul 19, 2022

pmenzel commented Jul 20, 2022

donald commented Dec 30, 2023

thomas commented Dec 30, 2023 via email

donald commented Dec 31, 2023