Skip to content

Better handle terminating jobs when /var/spool is full #136

Closed
pmenzel opened this issue Jul 18, 2022 · 6 comments · Fixed by #151
Closed

Better handle terminating jobs when /var/spool is full #136

pmenzel opened this issue Jul 18, 2022 · 6 comments · Fixed by #151

Comments

@pmenzel
Copy link
Contributor

pmenzel commented Jul 18, 2022

Despite a user cancelling/killing a job with

$ mxqkill -g 507924
WARNING: no active group with group_id=507924 found for user=lo(5421)

the job 40881611 is still shown as running. No process from that user runs on superbia anymore.

2022-07-16 00:57:58 +0200 mxqd[5723]:    job=lo(5421):507924:40881611 :: started. pid=81303
2022-07-16 00:57:58 +0200 mxqd[5723]: Main loop started 30 slots.
2022-07-16 01:09:02 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:09:22 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:09:23 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:09:43 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:09:44 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:10:04 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:10:04 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:10:24 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:10:25 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:10:45 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:10:45 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:11:05 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:11:06 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:11:26 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:11:26 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:11:46 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:11:47 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:12:07 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:12:08 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:12:28 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:12:28 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:12:48 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:12:49 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:13:09 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:13:09 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:13:29 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:13:30 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:13:50 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:13:50 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:14:10 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:14:11 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:14:31 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:14:32 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:14:52 +0200 mxqd[5723]: ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1)
2022-07-16 01:14:52 +0200 mxqd[5723]: WARNING: killall_over_memory(): Can't find process with pid 81303 in process tree
2022-07-16 01:15:12 +0200 mxqd[5723]: ERROR: /var/spool/mx
@david
Copy link

david commented Jul 18, 2022

maybe out of free disk space

This is /project/admin/nachtwaechter/nachtwaechter.pl. There is a new alarm:

     ** Platte /               superbia(68G) : 40k left **

See http://afk.molgen.mpg.de/alarms for current status

@donald
Copy link
Contributor

donald commented Jul 19, 2022

Exactly

root@superbia:/var/spool/mxqd/main# ls -l .
total 0
-rw-rw---- 1 root lo     0 Jul 16 01:09 40881611.stat
-rw-rw---- 1 root spwgrp 0 Jul 16 01:34 40881632.stat
-rw-rw---- 1 root spwgrp 0 Jul 16 01:34 40881633.stat
-rw-r----- 1 root haas   0 Jul 16 11:57 40882326.stat

mxqd is not prepared for that. Suggestion: Remove the empty stat files and restart the daemon. (mxqdctl-hostconfig reload). The new daemon should detect that the jobs are gone and the jobs should go from "running" to "unknown". Maybe the restart is not even required and it would be enough to remove these files. Not sure and no time currently to check...

@pmenzel
Copy link
Contributor Author

pmenzel commented Jul 20, 2022

I removed the files.

@superbia:~$ sudo rm /var/spool/mxqd/main/40881611.stat
@superbia:~$ sudo rm /var/spool/mxqd/main/4088163{2,3}.stat
@superbia:~$ sudo rm /var/spool/mxqd/main/40882326.stat

Let’s see, if mxqd detects the jobs are gone without a restart.

@donald donald changed the title ERROR: /var/spool/mxqd/main/40881611.stat : parse error (res=-1) Better handle terminating jobs when /var/spool is full Dec 30, 2023
@donald
Copy link
Contributor

donald commented Dec 30, 2023

Should the reaper wait and retry on write failure?

@thomas
Copy link
Contributor

thomas commented Dec 30, 2023 via email

@donald
Copy link
Contributor

donald commented Dec 31, 2023

Every running mxq job has a privileged process ("mxq reaper") as its top process. 1. It just reaps the user processes until no more are left and writes the exit status of the main process and the resource usages into a spool file 2.

This issue reported here was triggered when the root filesystem was full, the reaper consequently produced an empty spool file, the mxq daemon complained about the unexpected format and wasn't able to finish the jobs.

My suggestion was that the reaper process, if it finds itself unable to write the spool file, just waits and retries for ever. In the reported case, that would have helped.

@donald donald mentioned this issue Jan 1, 2024
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants