Skip to content

Verify job pids after restart #110

Merged
merged 3 commits into from Sep 17, 2021
Merged

Verify job pids after restart #110

merged 3 commits into from Sep 17, 2021

Conversation

donald
Copy link
Contributor

@donald donald commented Sep 16, 2021

All running jobs on our cluster were started after the last update, so the "comm" name of all reaper processes now is "mxqd reaper" (#104).

We can now do the next step and let the daemon check for the name when it is restarting to avoid confusion if pids are reused by unrelated processes.

Fixes #90.

Included is also a fix for a minor problem, which was found during testing the above change and which only applies if mxqd is killed via kill -9.

Add function to get "comm"-value of a process.
When the daemon restarts, it has to figure out which of the jobs, the
database shows as running on the sever, are in fact still running and
which are gone.

Currently we only check, whether the process with the pid from the
database still exists.  However, this can give wrong results if the pid
of a job is reused after a system reboot or after a pid wrap. mxqd might
regard an unrelated process as one of its jobs and nanny and kill it.

Update code to only regard a proces as a running mxqd job if its
"comm"-value (/proc/PID/comm) is "mxqd reaper".
mxqd hold a flock lock on /dev/shm/mxqd.HOST.DAEMON.lck to avoid being
run multiple times. The open file and the lock is inherited by forked
children. It is lost when the child does an execve(), because the file
is opened O_CLOEXEC. However, the reaper is a long running forked child
which doesen't do execve(), so it holds the lock as well.

Usually this isn't a problem, because if mxqd terminates via one of the
signals defined for that purpose, it will unlink the lock file before
terminating. When it is restarted, a new file is generated and the lock
on the new file is not in conflict with locks on the unlinked file.

Only if mqxd is terminated by SIGKILL (which can't be ignored or
handled) it does not clean up the lock file. In that case, trying to
restart the daemon can fail because of the locks held by jobs of the
previous daemon.

Unlock the lock in the reaper after it has been forked.
@donald donald merged commit d417de8 into master Sep 17, 2021
@donald donald deleted the verify-job-pids-after-restart branch October 28, 2022 14:21
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mxqd recover and pid wrap
1 participant