Skip to content

Add reaper process to handle job execution and reaping of childs #30

Closed
6 of 12 tasks
mariux opened this issue Oct 24, 2015 · 10 comments
Closed
6 of 12 tasks

Add reaper process to handle job execution and reaping of childs #30

mariux opened this issue Oct 24, 2015 · 10 comments
Milestone

Comments

@mariux
Copy link
Contributor

mariux commented Oct 24, 2015

Add reaper process to handle job execution and reaping of childs.

  • spawn reaper to handle job execution
  • store finished child status in file instead of database to be able to collect data on fails
  • when recovering from fail read status files and store data n database (late finish jobs)
  • when recovering and jobs are still running from previous mxqd runs "reattach" those jobs to mxqd
    • mark memory, thread/cpuset to be occupied by those jobs.
  • catch signals from server to kill user jobs

new mxqd -> reaper interactions may include:

  • send signal SIGUSR1 kill children slowly with SIGTERM!?
  • send signal SIGUSR2 kill children now SIGKILL!?

other issues:

  • log returned childs.

review: bugs to be fixed:

  • date_start and date_endmissing in reaper stat file
  • handle LOADED jobs
  • save status from MAIN pid to database (atm it seems to be last PID ripped?)
  • reaper: when MAINPID exits kill all left over children
    • scan proctree and kill all with PPID == PID(reaper)
@mariux mariux added this to the 1.0 milestone Oct 24, 2015
@donald
Copy link
Contributor

donald commented Oct 24, 2015

If we keep both alternatives to detect finished user jobs, the reaper had to reflect the exit status of the user process including signals. This is possible but ugly.
Perhaps we should go for files only.

@mariux
Copy link
Contributor Author

mariux commented Oct 25, 2015

it's not that ugly reset all signal handlers and kill(getpid(), SIGNAL) or exit(STATUS) ;)

but true... so main process should only reap the pids and should only perform action if no file is available while in the normal case fspool_scan should be the only function to update database entries.... sounds good ;)

@donald
Copy link
Contributor

donald commented Oct 25, 2015

In that case... If a reaper terminates without leaving a finished job file, it would always be an error and a rather serious and unexpected one like spool file system full (which the reaper could handle by waiting) or, of course, a coding error or hardware error. In that case it might be an idea for mxqd to terminate or at least stop to start new jobs, to prevent more jobs going into the same sink.

@mariux
Copy link
Contributor Author

mariux commented Oct 25, 2015

so reaper should try to write the file forever - if recoverable? ;)

@donald
Copy link
Contributor

donald commented Oct 28, 2015

because "terminate daemon but leave jobs running" is now a valid option, what meaning do we want to have for signals to mxqd ?

a) exit (leave children running)
b) wait for children to complete and exit
c) kill children, wait for children to complete and exit

SIGINT for a , SIGTERM for c, we don't need b ?

@mariux
Copy link
Contributor Author

mariux commented Oct 28, 2015

  • a) SIGQUIT because it quits the daemon?
  • b) SIGTERM because it (slowly) terminates all actions?
  • c) SIGINT because it interrupts all actions and exits as fast as possible?
  • maybe SIGTERM does not have to wait forever in future releases and also kills processes that are not scheduled to finish within the (configurable) 24h limit? (e.g. where runtime_left >= 24h and/or where current_runtime >= 24h)
  • but for all signals it should be possible to first TERM and later INT or kill all leftovers if QUIT was used?

what do you think? kreuze an ;))))) hehehe

@mariux
Copy link
Contributor Author

mariux commented Oct 28, 2015

I just remembered and checked again and found that SIGTERM and SIGINT are already implemented that way:

  • TERM stops starting new, waits for running and exits.
  • INT stop stating new, kills all running, waits for killed and exits.

So the only new signal to handle is SIGQUIT or any other signal that should just quit daemon.

@donald
Copy link
Contributor

donald commented Oct 29, 2015

Hmmm. I don't like the idea of a "signal escalation order". This isn't defined anywhere and is not really useful.

We have a more or less defined meaning for "TERM" in signal(7), which is the termination signal: The process should cleanup and exit. IMO in our case it should leave the user jobs running, because with the reaper thats a perfect legal state and there is no reason to wait for the processes to finish. A system shutdown would send signals to the user processes anyway.

"QUIT" and "INT" are defined as keyboard signals. If I have a daemon running in the foreground on a terminal, I'd expect ^C to terminate it (the same as TERM in the above paragraph). A detached daemon has no terminal so the signals have no predefined meaning in that context. We could reuse the signals for other purposes, if we run out of SIGUSR1,SIGUSR2.

What do we want to ask the daemon besides "Terminate" ?

We might want a way to ask the daemon to stop accepting new jobs. But we could get to the same state by killing the daemon and restarting it with -m 1 or whatever. So this would just be a convenience.

We might want a way to ask the daemon to kill all running jobs. But we could do the same manually. So again, this would just be a convenience.

We might want a way to ask the daemon to exit, when the jobs are finished and their cleanup work is done, which only makes sense, when we don't accept jobs while waiting. On the other hand, we could leave the daemon running, because it won't do anything when not accepting new jobs or finishing old ones. So a third time just convenience.

Hmmm....

@mariux
Copy link
Contributor Author

mariux commented Oct 29, 2015

it does not matter how a signal is sent because there is no difference in sending SIGQUIT/SIGINT via keyboard or via kill. When using keyboard shortcuts for signals you want the foreground process to end now: and both signals end the main mxqd - only sending SIGTERM will not exit immediately but wait for a clean shutdown.

And for SIGTERM signal(7) defines no "cleanup and exit." action. Yes, it's a termination signal and that is what currently happens directly after waiting for all jobs to finish in a clean way.

Stopping a daemon and restarting it with different (less) resources (-m 1) is currently undefined behavior.
And removing current SIGTERM action and replacing it with something like SIGQUIT+mxqd -m 1 should not be an option.

But after all: The only new action is to quit the daemon immediately and leave all reaper running. There is no need to change or remove existing behavior and signal actions.

@mariux mariux assigned mariux and unassigned donald Nov 3, 2015
mariux added a commit to mariux/mxq that referenced this issue Nov 3, 2015
see mariux64#30

* donald/reaper:
  mxqd: reaper: ignore signals from mxqd
  mxqd: set cpu_set_running in group_add_job
  database: store and retrieve cpuset of job
  mxq_job: add a string version of host_cpu_set
  mxq_job: refactor (add do_jobs_statement)
  mxqd: do not finish jobs from signals when we have reaper output
  mxqd: better loglevels for killall_over_time
  mxqd: remove unused member
  mxqd: let reaper call setsid instread of user process
  mxqd: add job_is_lost
  mxqd: do not kill children in catchall
  mxq_job: add mxq_set_job_status_unknown
  mxqd: add SIGQUIT processing : do not kill or wait for children
  mxqd: let recover_from_previous_crash rebuild state for previous jobs
  mxqd: refactor (add reset_signals)
  mxq_job: add mxq_load_jobs_running_on_server
  mxqd: stop recover_from_previous_crash from deleting running jobs
  mxqd: add reaper
  mxqd: add help functions for fspool (finished job spool directory)
  mxqd: create MXQ_FINISHED_JOBSDIR on startup
  make: add FINISHED_JOBSDIR
  mx_util: add mx_mkdir_p
  mxqd: refactor (add job_has_finished)
  mxqd: refactor (add user_process)
mariux added a commit to mariux/mxq that referenced this issue Nov 3, 2015
implements parts of mariux64#30

* mariux/issues/issue30:
  mxqd: free structures to remove leftover memory in reaper process
  mxqd: be a bit more verbose when starting processes to log pids
  mxq_job: Minor cleanup
  mx_flock: export mx_flock_free() to free without releasing lock
  mxqd: Fix memory leak for host_cpu_set_str
  mxqd: Fix kill signals: send kill to pgrp instead of reaper pid
  mxqd: Cleanup reaper_process()
  mxqd: Cleanup user_process()
  mxqd: Cleanup init_child_process()
  mxqd: Cleanup job_has_finished() and job_lost()
  mxqd: Remove fspool_unlink()
  test_mxqd_control: Init server structure
  mxqd: Fix fspool_process_file()
  mxqd: Fix and rename server_reload_running() to load_running_jobs()
  mxqd: Rename load_groups() to load_running_groups()
  mxqd_control: Refactor and export server structure management
  mxqd: Cleanup start_job()
  mxqd: Cleanup server_close()
  mxqd: Cleanup server_dump()
  mxqd: Rename server_find_user() to server_find_user_by_uid()
  mxqd: Rename lost_scan_one()
  mxqd: Cleanup server_reload_running()
  mxqd: Cleanup catchall()
  mxqd: Cleanup load_groups()
  mxqd: Rename server_find_group() to server_get_group_list_by_group_id()
  mxqd: Rename server_find_job() to server_get_job_list_by_job_id()
  mxqd: Rename server_remove_job() to job_list_remove_self()
  mxqd: Rename server_find_job_by_pid() to server_get_job_list_by_pid()
  mxqd: Rename server_remove_job_by_pid() to server_remove_job_list_by_pid()
  mxqd: Rename killallcancelled() to killall_cancelled()
  mxqd: Cleanup killall()
  mxqd: Cleanup killall_over_time()
  mxqd: Cleanup start_users()
  mxqd: Cleanup start_user()
  mxqd: Rename remove_orphaned_groups() to remove_orphaned_group_lists()
  mxqd: Rename group_list_find_group() to _group_list_find_by_group()
  mxqd: Rename group_add_job() to group_list_add_job()
  mxqd: Rename user_list_find_uid() to _user_list_find_by_uid()
  mxqd: Rename server_update_groupdata() to server_update_group()
  mxqd: Rename server_add_user() to _server_add_group()
  mxqd: Rename user_update_groupdata() to _user_list_update_group()
  mxqd: Rename user_add_group() to _user_list_add_group()
  mxqd: Rename group_init() to _group_list_init()
  mxqd: reaper: ignore signals from mxqd
  mxqd: set cpu_set_running in group_add_job
  database: store and retrieve cpuset of job
  mxq_job: add a string version of host_cpu_set
  mxq_job: refactor (add do_jobs_statement)
  mxqd: do not finish jobs from signals when we have reaper output
  mxqd: killall_over_memory: Send SIGKILL after sending SIGTERM
  mxqd: better loglevels for killall_over_time
  mxqd: remove unused member
  mxqd: let reaper call setsid instread of user process
  mxqd: add job_is_lost
  mxqd: do not kill children in catchall
  mxq_job: add mxq_set_job_status_unknown
  mxqd: add SIGQUIT processing : do not kill or wait for children
  mxqd: let recover_from_previous_crash rebuild state for previous jobs
  mxqd: refactor (add reset_signals)
  mxq_job: add mxq_load_jobs_running_on_server
  mxqd: stop recover_from_previous_crash from deleting running jobs
  mxqd: add reaper
  mxqd: killall_over_memory: rename/cleanup variables
  mxqd: add help functions for fspool (finished job spool directory)
  mxqd: create MXQ_FINISHED_JOBSDIR on startup
  make: add FINISHED_JOBSDIR
  mx_util: add mx_mkdir_p
  mxqd: refactor (add job_has_finished)
  mxqd: refactor (add user_process)
@mariux mariux removed their assignment May 14, 2017
@donald
Copy link
Contributor

donald commented Jan 1, 2024

Completed (2015) by 9e9c3a5 ("Merge remote-tracking branch 'mariux/issues/issue30')

@donald donald closed this as completed Jan 1, 2024
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants