Work #63

donald · 2017-07-06T10:06:25Z

No description provided.

Add extra parentheses around assigments evaluated as conditions as suggested by -Wparentheses. This is to avoid CC test_mx_util.o test_mx_util.c: In function ‘test_mx_strskipwhitespaces’: test_mx_util.c:17:4: warning: suggest parentheses around assignment used as truth value [-Wparentheses] assert(s = mx_strskipwhitespaces(" abc ")); ^ test_mx_util.c:20:4: warning: suggest parentheses around assignment used as truth value [-Wparentheses] assert(s = mx_strskipwhitespaces("abc ")); ^ test_mx_util.c:23:4: warning: suggest parentheses around assignment used as truth value [-Wparentheses] assert(s = mx_strskipwhitespaces("")); ^ test_mx_util.c: In function ‘test_mx_strscan’: test_mx_util.c:311:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses] assert(s = strdup("123 456 -789 246 abc")); ^ etc.

This is a general purpose routine to sort a single linked list of objects. It requires callbacks for comparison and for locating the next pointer in the object.

No functional change.

Add a utility routing to sort the groups of all users of the server to their group priority.

When active groups are (re-)loaded from the database, we sort them by group priority. This implements group priority (mxqsub -P).

We intend to expand the functionality which can be triggered by signals, e.g. switch log level to debug or trigger a state dump. By processing signals synchronously, we are free to call non-reentrant functions and know, that our own data is not in a transient state. Block asynchronous signals and receive and process signals explicitly. Re-enable the asynchronous signals when the user process in initialized.

Now hat all signals are blocked, we don't need to ignore them.

It is no longer necessary to reset signals, because we no longer change them. The signals may stay blocked for the reaper process and are explicitly unblocked for the user process.

We process signals synchronously now. Change "volatile sig_atomic_t" to "int" type.

A dump of server state can be triggerd by env kill -usr2 -q 10 mxqd

The loglevel may be set to info with env kill -usr2 -q 21 mxqd and to info with env kill -usr2 -q 20 mxqd

We intent to make loggin on info level less verbose so that we don'T need the deduplication here. At debug level we want to see all events as they happen.

Try to start as many jobs jobs as possible without going through a main loop iteration (database roundtrip).

Use 10 seconds everywhere to decrease the load on the database and races between the mxq daemons a bit. At the same time this increases the chance that multiple jobs of the same group are started on the same server, which is good (better use of caches, smaller failure surface). This is the maximum time a single server will need to react to database changes (mxqsub or mxqkill). Administrative signals will get immediate reaction. Finished user jobs will usually also get immediate reaction. However, this is not true for jobs we picked up from a previous daemon incarnation and which are not our children. If these jobs finish, we will not get a signal, so we need to look into the spool directory from time to time. This is another reason, why we need a timeout at all. Now that we want to use 10 seconds everywhere, we can make it a constant.

We don't want spurious signals (like SIGWINCH) to trigger a new evaluation of the main loop. Only wait for signals we want to be processed.

Add commands killall quitall reloadall reloadall (=restartall) dumpall setdebugall setinfoall (=setnodebugall)

The server status if multiple times set from the main loop to IDLE,RUNNING,BACKFILL,CPUOPTIMAL or FULL based on server->slots_running, server->slots, server->threads_running. Factor this out into a separate function to avoid repetition. Add a call to this function before the main loop so that we immediately get a correct status after a server start. Simplify the loop code by setting server status at end of loop and avoid the continue pattern. As a side effect this also resolves a bug, that the server status was not updated in a loop iteration where new jobs were started.

This is just to avoid valgrind leak warnings. We have two variables with cleanup attribute in the scope of main, which are not freed, when we leave by exit() instead of running out of scope. ==10696== 21 bytes in 1 blocks are still reachable in loss record 1 of 3 ==10696== at 0x4C29AC6: malloc (vg_replace_malloc.c:299) ==10696== by 0x5E19449: strdup (strdup.c:42) ==10696== by 0x5E750DD: get_current_dir_name (getdirname.c:40) ==10696== by 0x409F64: main (mxqd.c:2383) ==10696== ==10696== 45 bytes in 1 blocks are still reachable in loss record 2 of 3 ==10696== at 0x4C29AC6: malloc (vg_replace_malloc.c:299) ==10696== by 0x40CFF9: mx_strvec_to_str (mx_util.c:1059) ==10696== by 0x409F58: main (mxqd.c:2382)

The mysql client library sets SIGPIPE to SIG_IGN which gets inherited by the user processes: root@sigchld:~# cat /proc/27900/status |grep Sig SigQ: 0/30656 SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000000000000 This makes a difference for things like CMD | head -1 because CMD is not killed by SIGPIPE when `head` is finished. It will get an error "broken pipe" on the next write and may or may not emit this to stderr. This bug war partially caused by #63, where we changed our own signal handling to synchronous mode. While in the previous mode, we unknowingly fixed the change done by the mysql library, we then just set and reset the blocking mask, which doesn't restore an ignored signal. Restore SIGCHLD to SIG_DFL when we initialize the child process for the user. With this patch applied, no more signals are blocked for the user process. buczek@theinternet:~/git/mxq (restore-sigpipe)$ cat /proc/17081/status |grep Sig SigQ: 1/127301 SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 Fixes #71

The mysql client library sets SIGPIPE to SIG_IGN which gets inherited by the user processes: root@sigchld:~# cat /proc/27900/status |grep Sig SigQ: 0/30656 SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000000000000 This makes a difference for things like CMD | head -1 because CMD is not killed by SIGPIPE when `head` is finished. It will get an error "broken pipe" on the next write and may or may not emit this to stderr. This bug was partially caused by #63 (commit ff2f49f (mxqd: Use synchronous signals)), where we changed our own signal handling to synchronous mode. While in the previous mode, we unknowingly fixed the change done by the mysql library, we then just set and reset the blocking mask, which doesn't restore an ignored signal. Restore SIGCHLD to SIG_DFL when we initialize the child process for the user. With this patch applied, no more signals are blocked for the user process. buczek@theinternet:~/git/mxq (restore-sigpipe)$ cat /proc/17081/status |grep Sig SigQ: 1/127301 SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 Fixes: #71

donald added 20 commits July 4, 2017 09:36

mx_util: Add mx_sort_linked_list

8496f6a

This is a general purpose routine to sort a single linked list of objects. It requires callbacks for comparison and for locating the next pointer in the object.

mxqd_control: Use generic sort function to sort users

c956da7

No functional change.

mxqd_config: Add server_sort_groups_by_priority

e10e3d0

Add a utility routing to sort the groups of all users of the server to their group priority.

mxqd: Sort groups by priority

d99d562

When active groups are (re-)loaded from the database, we sort them by group priority. This implements group priority (mxqsub -P).

mxqd: Remove obsolete signal() calls in reaper

7219743

Now hat all signals are blocked, we don't need to ignore them.

mxqd: Remove reset_signals

784dcbc

It is no longer necessary to reset signals, because we no longer change them. The signals may stay blocked for the reaper process and are explicitly unblocked for the user process.

mxqd: Remove volatile sig_atomic_t type

e64d665

We process signals synchronously now. Change "volatile sig_atomic_t" to "int" type.

mxqd: Dump server state on SIGUSR2 q=10

8660ae5

A dump of server state can be triggerd by env kill -usr2 -q 10 mxqd

mxqd: Change loglevel on SIGUSR2 q=20 or 21

3a7b5f6

The loglevel may be set to info with env kill -usr2 -q 21 mxqd and to info with env kill -usr2 -q 20 mxqd

mxq_log: Do not deduplicate messages

447cc7f

We intent to make loggin on info level less verbose so that we don'T need the deduplication here. At debug level we want to see all events as they happen.

mxqd: Reduce logging at log level info in main loop

6a15b2b

mxqd: Start jobs without main loop iteration

82d7a31

Try to start as many jobs jobs as possible without going through a main loop iteration (database roundtrip).

mxqd: Only process signals we expect

0717a76

We don't want spurious signals (like SIGWINCH) to trigger a new evaluation of the main loop. Only wait for signals we want to be processed.

mxqd: Change some messages from info to debug

3fe7c3d

mxqdctl-hostconfig: Add debug commands

adbe312

Add commands killall quitall reloadall reloadall (=restartall) dumpall setdebugall setinfoall (=setnodebugall)

donald merged commit b5fe300 into mariux64:mpi Jul 6, 2017

donald mentioned this pull request Mar 20, 2018

Restore sigpipe #72

Merged

Work #63

Work #63

Conversation

donald commented Jul 6, 2017