Skip to content

Verify job pids after restart #110

Merged
merged 3 commits into from
Sep 17, 2021
Merged

Verify job pids after restart #110

merged 3 commits into from
Sep 17, 2021

Commits on Sep 16, 2021

  1. mx_proc: Add mx_proc_get_comm

    Add function to get "comm"-value of a process.
    donald committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    89aca1f View commit details
    Browse the repository at this point in the history
  2. mxqd: Verify names of reaper processes no restart

    When the daemon restarts, it has to figure out which of the jobs, the
    database shows as running on the sever, are in fact still running and
    which are gone.
    
    Currently we only check, whether the process with the pid from the
    database still exists.  However, this can give wrong results if the pid
    of a job is reused after a system reboot or after a pid wrap. mxqd might
    regard an unrelated process as one of its jobs and nanny and kill it.
    
    Update code to only regard a proces as a running mxqd job if its
    "comm"-value (/proc/PID/comm) is "mxqd reaper".
    donald committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    f218ec2 View commit details
    Browse the repository at this point in the history
  3. mxqd: Release mxqd lock in reaper

    mxqd hold a flock lock on /dev/shm/mxqd.HOST.DAEMON.lck to avoid being
    run multiple times. The open file and the lock is inherited by forked
    children. It is lost when the child does an execve(), because the file
    is opened O_CLOEXEC. However, the reaper is a long running forked child
    which doesen't do execve(), so it holds the lock as well.
    
    Usually this isn't a problem, because if mxqd terminates via one of the
    signals defined for that purpose, it will unlink the lock file before
    terminating. When it is restarted, a new file is generated and the lock
    on the new file is not in conflict with locks on the unlinked file.
    
    Only if mqxd is terminated by SIGKILL (which can't be ignored or
    handled) it does not clean up the lock file. In that case, trying to
    restart the daemon can fail because of the locks held by jobs of the
    previous daemon.
    
    Unlock the lock in the reaper after it has been forked.
    donald committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    035059c View commit details
    Browse the repository at this point in the history