-
Notifications
You must be signed in to change notification settings - Fork 3
Add reaper process to handle job execution and reaping of childs #30
Comments
If we keep both alternatives to detect finished user jobs, the reaper had to reflect the exit status of the user process including signals. This is possible but ugly. |
it's not that ugly reset all signal handlers and but true... so main process should only reap the pids and should only perform action if no file is available while in the normal case |
In that case... If a reaper terminates without leaving a finished job file, it would always be an error and a rather serious and unexpected one like spool file system full (which the reaper could handle by waiting) or, of course, a coding error or hardware error. In that case it might be an idea for mxqd to terminate or at least stop to start new jobs, to prevent more jobs going into the same sink. |
so reaper should try to write the file forever - if recoverable? ;) |
because "terminate daemon but leave jobs running" is now a valid option, what meaning do we want to have for signals to mxqd ? a) exit (leave children running) SIGINT for a , SIGTERM for c, we don't need b ? |
what do you think? kreuze an ;))))) hehehe |
I just remembered and checked again and found that
So the only new signal to handle is |
Hmmm. I don't like the idea of a "signal escalation order". This isn't defined anywhere and is not really useful. We have a more or less defined meaning for "TERM" in signal(7), which is the termination signal: The process should cleanup and exit. IMO in our case it should leave the user jobs running, because with the reaper thats a perfect legal state and there is no reason to wait for the processes to finish. A system shutdown would send signals to the user processes anyway. "QUIT" and "INT" are defined as keyboard signals. If I have a daemon running in the foreground on a terminal, I'd expect ^C to terminate it (the same as TERM in the above paragraph). A detached daemon has no terminal so the signals have no predefined meaning in that context. We could reuse the signals for other purposes, if we run out of SIGUSR1,SIGUSR2. What do we want to ask the daemon besides "Terminate" ? We might want a way to ask the daemon to stop accepting new jobs. But we could get to the same state by killing the daemon and restarting it with -m 1 or whatever. So this would just be a convenience. We might want a way to ask the daemon to kill all running jobs. But we could do the same manually. So again, this would just be a convenience. We might want a way to ask the daemon to exit, when the jobs are finished and their cleanup work is done, which only makes sense, when we don't accept jobs while waiting. On the other hand, we could leave the daemon running, because it won't do anything when not accepting new jobs or finishing old ones. So a third time just convenience. Hmmm.... |
it does not matter how a signal is sent because there is no difference in sending And for Stopping a daemon and restarting it with different (less) resources ( But after all: The only new action is to quit the daemon immediately and leave all reaper running. There is no need to change or remove existing behavior and signal actions. |
see mariux64#30 * donald/reaper: mxqd: reaper: ignore signals from mxqd mxqd: set cpu_set_running in group_add_job database: store and retrieve cpuset of job mxq_job: add a string version of host_cpu_set mxq_job: refactor (add do_jobs_statement) mxqd: do not finish jobs from signals when we have reaper output mxqd: better loglevels for killall_over_time mxqd: remove unused member mxqd: let reaper call setsid instread of user process mxqd: add job_is_lost mxqd: do not kill children in catchall mxq_job: add mxq_set_job_status_unknown mxqd: add SIGQUIT processing : do not kill or wait for children mxqd: let recover_from_previous_crash rebuild state for previous jobs mxqd: refactor (add reset_signals) mxq_job: add mxq_load_jobs_running_on_server mxqd: stop recover_from_previous_crash from deleting running jobs mxqd: add reaper mxqd: add help functions for fspool (finished job spool directory) mxqd: create MXQ_FINISHED_JOBSDIR on startup make: add FINISHED_JOBSDIR mx_util: add mx_mkdir_p mxqd: refactor (add job_has_finished) mxqd: refactor (add user_process)
implements parts of mariux64#30 * mariux/issues/issue30: mxqd: free structures to remove leftover memory in reaper process mxqd: be a bit more verbose when starting processes to log pids mxq_job: Minor cleanup mx_flock: export mx_flock_free() to free without releasing lock mxqd: Fix memory leak for host_cpu_set_str mxqd: Fix kill signals: send kill to pgrp instead of reaper pid mxqd: Cleanup reaper_process() mxqd: Cleanup user_process() mxqd: Cleanup init_child_process() mxqd: Cleanup job_has_finished() and job_lost() mxqd: Remove fspool_unlink() test_mxqd_control: Init server structure mxqd: Fix fspool_process_file() mxqd: Fix and rename server_reload_running() to load_running_jobs() mxqd: Rename load_groups() to load_running_groups() mxqd_control: Refactor and export server structure management mxqd: Cleanup start_job() mxqd: Cleanup server_close() mxqd: Cleanup server_dump() mxqd: Rename server_find_user() to server_find_user_by_uid() mxqd: Rename lost_scan_one() mxqd: Cleanup server_reload_running() mxqd: Cleanup catchall() mxqd: Cleanup load_groups() mxqd: Rename server_find_group() to server_get_group_list_by_group_id() mxqd: Rename server_find_job() to server_get_job_list_by_job_id() mxqd: Rename server_remove_job() to job_list_remove_self() mxqd: Rename server_find_job_by_pid() to server_get_job_list_by_pid() mxqd: Rename server_remove_job_by_pid() to server_remove_job_list_by_pid() mxqd: Rename killallcancelled() to killall_cancelled() mxqd: Cleanup killall() mxqd: Cleanup killall_over_time() mxqd: Cleanup start_users() mxqd: Cleanup start_user() mxqd: Rename remove_orphaned_groups() to remove_orphaned_group_lists() mxqd: Rename group_list_find_group() to _group_list_find_by_group() mxqd: Rename group_add_job() to group_list_add_job() mxqd: Rename user_list_find_uid() to _user_list_find_by_uid() mxqd: Rename server_update_groupdata() to server_update_group() mxqd: Rename server_add_user() to _server_add_group() mxqd: Rename user_update_groupdata() to _user_list_update_group() mxqd: Rename user_add_group() to _user_list_add_group() mxqd: Rename group_init() to _group_list_init() mxqd: reaper: ignore signals from mxqd mxqd: set cpu_set_running in group_add_job database: store and retrieve cpuset of job mxq_job: add a string version of host_cpu_set mxq_job: refactor (add do_jobs_statement) mxqd: do not finish jobs from signals when we have reaper output mxqd: killall_over_memory: Send SIGKILL after sending SIGTERM mxqd: better loglevels for killall_over_time mxqd: remove unused member mxqd: let reaper call setsid instread of user process mxqd: add job_is_lost mxqd: do not kill children in catchall mxq_job: add mxq_set_job_status_unknown mxqd: add SIGQUIT processing : do not kill or wait for children mxqd: let recover_from_previous_crash rebuild state for previous jobs mxqd: refactor (add reset_signals) mxq_job: add mxq_load_jobs_running_on_server mxqd: stop recover_from_previous_crash from deleting running jobs mxqd: add reaper mxqd: killall_over_memory: rename/cleanup variables mxqd: add help functions for fspool (finished job spool directory) mxqd: create MXQ_FINISHED_JOBSDIR on startup make: add FINISHED_JOBSDIR mx_util: add mx_mkdir_p mxqd: refactor (add job_has_finished) mxqd: refactor (add user_process)
Completed (2015) by 9e9c3a5 ("Merge remote-tracking branch 'mariux/issues/issue30') |
Add reaper process to handle job execution and reaping of childs.
new
mxqd
->reaper
interactions may include:send signalSIGUSR1
kill children slowly withSIGTERM
!?send signalSIGUSR2
kill children nowSIGKILL
!?other issues:
review: bugs to be fixed:
date_start
anddate_end
missing in reaper stat fileLOADED
jobsPPID == PID(reaper)
The text was updated successfully, but these errors were encountered: