Skip to content

mxqd recover and pid wrap #90

Closed
donald opened this issue Aug 12, 2020 · 5 comments · Fixed by #110
Closed

mxqd recover and pid wrap #90

donald opened this issue Aug 12, 2020 · 5 comments · Fixed by #110

Comments

@donald
Copy link
Contributor

donald commented Aug 12, 2020

We've seen this:

  • mxqd starts job with reaper pid 1042
  • someone reboots the system. Pid 1042 is killed with -9 during reboot, so no spool file created
  • system boots and starts a nfsd as pid 1042
  • mxqd is started, doesn't find a spool file for the job finds pid 1042 alive, so it assumes the job is still running.
  • job times out and mxqd tries (unsuccessfully) to kill the nfsd.

mxqd could notice the reboot and declare all pids as dead. But similar events could happen without reboot if pids wrap:

  • mxqd is stopped by admin
  • user job is killed by admin with kill -9
  • pids wrap, some new process is started with the jobs pid
  • mxqd is started by admin

So we might consider to further validate the processes.

@donald
Copy link
Contributor Author

donald commented Aug 21, 2021

What is a solid way to identifiy, whether a certain process is a mxqd reaper or not? Can't use "same program file as me" because we support upgrading and restarting the daemon without killing the jobs.

@wwwutz
Copy link
Contributor

wwwutz commented Aug 21, 2021

What is a solid way to identifiy, whether a certain process is a mxqd reaper or not?

rewrite $0 ?

@donald
Copy link
Contributor Author

donald commented Aug 21, 2021

Would be possible. Potential caveat: $0 if overwritten in place, can only get shorter not longer. And "./mxqd" is just 6 characters.

Maybe use comm (/proc/[pid]/task/[tid]/comm), the thread name is 16 characters, so we could always stick some readable magic value in there like "mxq reaper".

@pmenzel
Copy link
Contributor

pmenzel commented Aug 21, 2021

Could you send it a signal (USR1, …), and get a certain reaction/response?

@donald
Copy link
Contributor Author

donald commented Aug 21, 2021

Could you send it a signal (USR1, …), and get a certain reaction/response?

Like not dying? :-)

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants