Add disabled servers #81

donald · 2020-04-13T17:43:43Z

No description provided.

pmenzel

Thank you for the implementation. I also thought about this, and believe it should be a white list.

+    "  -H, --hosts=LIST         Whitelist cluster nodes (hosts) on which job may run\n"
+    "                           (default: all)\n"
+    "\n"
+    "                           If a job requires a certain feature on a cluster node like an\n"
+    "                           a processor instruction, this allows to list the cluster nodes\n"
+    "                           with that feature\n"
+    "\n"
+    "        [LIST] Hostnames separated by whitespace\n"
+    "               Example: "$(hostconfigctl --list todsuende)"\n"
+    "               Example: "kreios uselessbox"\n"

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable. I think, users wanting to restrict their jobs to certain features of cluster nodes, should test these features beforehand, and then submit this list with their job.

keywordset.c

pmenzel · 2020-04-13T17:55:54Z

mxqd.c

@@ -1277,6 +1288,26 @@ unsigned long start_job(struct mxq_group_list *glist)

 /**********************************************************************/

+static int server_is_qualified(struct mxq_server *server, struct mxq_group *group) {
+    int is_qualified = 1;


Do we have type bool?

Is there a single agreed on standard? Some revisions of the C language have "bool" or "_Bool" but not all. Many C libraries framework and applications bring their own definitions, all different. Its a mess. If you include <stdbool.h> you might get in conflict with a self-defined "bool".

All this is very unlcear to me. And if there is no single standard, you don't know the semantic of some type of self-invented boolean type definition unless you consult the individual sources or documentation. Is it a macro, a typedef, an enum, a compiler built-in? Can you use arithmetic operators on it (e.g. xor) ? Is the "bool" from one library compatible to the "bool" from another one? What is the size and alignment? Can you pack it? Does the code spend cpu-cycles converting from a boolean expression to this boolean type. What is the legal range of values?

On the other hand, if you just use a standard C type instead, a C programmer immediately knows all features of the type.

If the user-defined boolean type resolves to some integral type (which is usually does, because everything else doesn't play well with boolean expressions), there will be no added type-safety, because everything will be promoted.

There might be some kind of documentation effect. But often you can get the same by picking good names for variables and parameters, .e.g. including is in the name as it is done here.

Just my own spontaneous thoughts about that, but I'd like to learn. Can you can point me to a good argument which standard should be used?

Avoid unneeded warning when scanning JOB_TMPDIR_MNTDIR for possible leftover job mounts. The parent directory of JOB_TMPDIR_MNTDIR is created when the first job with the --tmpdir feature is started. So it won't exist in a new server. This cleanup path is only used in the exceptional case that the usual cleanup path (via job_has_finished) didn't succeed, e.g. when mxqd was killed. The cleanup is not essential for the currently running server.

Add function attribute ((unusged)) to avoid a compiler warning when the static inline function defined in the header file is not used by the compilation unit.

Add header file for the quasi-standard [1] xmalloc call. Include x-version of strndup. Other functions (e.g. xrealloc) can be added as needed. The current approach implemented in mx_util is to wait and retry on ENOMEM malloc failure. However, this overhead doesn't seem to be justified, because it is difficult to imagine a case, where a malloc would fail with ENOMEM at one time and a retry would succeed. [1] https://www.gnu.org/software/libc/manual/html_node/Malloc-Examples.html

Add a utility module which can store a set of keywords. The set can be created and update from a string and can be serialized into a string. The string representation is a space-separated list of keywordis. Strings created by this utility contain the keywords stored in the set sorted in lexical order separated by a single space character. Input strings used to create or update a keyword set contain keywords separated by whitespace. Words from the string are added to the set unless a word starts with '-' in which case it will be removed from the Usage example: struct keywordset *kws = keywordset_new("xxx yyy") keywordset_update(kws, "-yyy aaaa") char *s = keywordset_get(kws): // s now "aaaaa xxx" free(s); // caller must free() if (keywordset_ismember(kws,"xxx")) ... // true keywordset_free(kws);

Add a new option so that specific servers can be excluded from starting jobs for this group. E.g.: mxqsub --disabled-servers="dontpanic sigusr2" sleep 100

Add a new command which can be used to modify an existing group: mxqset group 123 --closed mxqset group 123 --open mxqset group 123 --disabled-servers=dontpanic mxqset group 123 --update-disabled-servers="sigusr sigusr2 -dontpanic" mxqset group 123 --disabled-servers= The flags open and closed can be set from mxqadmin as well (`mxqadmin --close=123`) , but the synopsis, in which the options take the groupid as a value is difficult to expand to options, which take a value by themself.

Check, whether this server is qualified to start jobs from a group. Lazy-evaluate qualification criteria and cache the result. Don't start jobs we are not qualified for. For now, the only qualification criteria is, that our short or fully qualified hostname is not included the the excluded_servers set of the group. This can later be expanded to additional criteria (e.g. hostconfig or processor flags).

donald · 2020-04-14T08:24:21Z

Thank you for the implementation. I also thought about this, and believe it should be a white list.

Hmm, I have a feature in queue, so that you could use arbitrary boolean expressions from cpu feature flags or hostconfig flags ("cuda & !pcpool & ssse3" ). I could register the short hostname as well and the you could use that to select specific hosts, too ("ira|kreios|luxuria").

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

So what is the disadvantage of adding a read/write group attribute. It might be a bit surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

pmenzel · 2020-04-14T08:39:41Z

Thank you for the implementation. I also thought about this, and believe it should be a white list.

Hmm, I have a feature in queue, so that you could use arbitrary boolean expressions from cpu feature flags or hostconfig flags ("cuda & !pcpool & ssse3" ). I could register the short hostname as well and the you could use that to select specific hosts, too ("ira|kreios|luxuria").

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

Semantics also become quite confusing, as the expression depends on the question ask. ira|kreios|luxuria or ira&kreios&luxuria.

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

Killing the group would still finish the running jobs, wouldn’t it? So I do not see the advantage over using mxqset and resubmit the failed job over mxqkill and resubmit (the not yet finished jobs). The unfinished/failed job has to be resubmitted anyway, so the pipeline should support resubmitting jobs anyway.

So what is the disadvantage of adding a read/write group attribute. It might be a big surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

Can all group attributes be changed? If not, the users might ask, which ones? If all group attributes can be changed, the scheduler could be cheated mainly by changing the runtime.

donald · 2020-04-14T09:41:05Z

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.

Okay, but this is another use case. @wwwutz would want his jobs to go into the new node as well, assuming that the failing node is the exception. Same with @arndt and dontpanic before this was identified as "missing ssse3".

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

The boolean expression looked complex, because I made a complex example. But we always wanted to be able to submit to "cuda nodes only", in which case the expression would be a simple "cuda". Or, in the @arndt case, the simple expression "ssse3" would do and would be better than blacklisting dontpanic and much better thatn whitelisting all hosts but dontpanic.

The boolean prerequisite expression could be used to blacklist or whitelist hosts, but I've added the excluded-servers none the less, because with mxqset its easier to add or remove individual hosts from the list. We could add whitelist-servers as well, if there is a use case.

Semantics also become quite confusing, as the expression depends on the question ask. ira|kreios|luxuria or ira&kreios&luxuria.

Well, on "ira", "ira" is true and "kreios" and "luxuria" is false. So "ira|kreios|luxuria" is correct, because it evaluate to true on these three hosts and to false on any other. "ira&kreios&luxuria" would always be false. Is that confusing?

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

Killing the group would still finish the running jobs, wouldn’t it? So I do not see the advantage over using mxqset and resubmit the failed job over mxqkill and resubmit (the not yet finished jobs). The unfinished/failed job has to be resubmitted anyway, so the pipeline should support resubmitting jobs anyway.

But you'd need to resubmit not only the failed jobs, but the "not yet started" jobs as well, which might be more complicated. And while you are sorting that out, the pipeline would be stopped. With this approach you could just exclude the bad host and you pipeline would be running while you identify and resubmit the failed jobs.

Another difference would be that, if the issue on the failing host is resolved, you could use "mxqset" to remove the host from the excluded servers list and wouldn't need to resubmit at all.

So what is the disadvantage of adding a read/write group attribute. It might be a big surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

Can all group attributes be changed? If not, the users might ask, which ones? If all group attributes can be changed, the scheduler could be cheated mainly by changing the runtime.

Of course, you can't change all group attributes.

Currently the "closed" state can be changed and technically you could say, that mxqkill also changes group attributes ("cancelled").

With this PR, the excluded hosts can be changed. If I implement the boolean prerequisite expression, this should be changeable, too. We might consider to allow "root" to change the maximum run time, because we fulfilled some requests of that kind in a few exceptional cases in the past via direct sql commands, which is error prone.

The optimizing compiler predicts that pointer != NULL is more likely than pointer == NULL anyway. The generated code by "gcc -O2" is the same with or without __builtin_expect builtin. However, probably based on special knowledge of malloc either because of its __attribute__ ((malloc)) declaration or hard-coded, the optimization without the __builtin_expect builtin seems to be a bit more stronger: The function out_of_memory() is put into a .text.unlikely code segment, as if had attribute ((cold)) had been declared for it. Remove __builtin_expect.

pmenzel · 2020-04-14T10:59:35Z

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.

Okay, but this is another use case. @wwwutz would want his jobs to go into the new node as well, assuming that the failing node is the exception. Same with @arndt and dontpanic before this was identified as "missing ssse3".

I understood @arndt’s and the users request differently. They will figure out the requirements pretty early and will resubmit.

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

The boolean expression looked complex, because I made a complex example. But we always wanted to be able to submit to "cuda nodes only", in which case the expression would be a simple "cuda". Or, in the @arndt case, the simple expression "ssse3" would do and would be better than blacklisting dontpanic and much better thatn whitelisting all hosts but dontpanic.

The boolean prerequisite expression could be used to blacklist or whitelist hosts, but I've added the excluded-servers none the less, because with mxqset its easier to add or remove individual hosts from the list. We could add whitelist-servers as well, if there is a use case.

Well whitelist in combination with hostconfig --list is a little more intuitive as you do not need to negate.

Semantics also become quite confusing, as the expression depends on the question ask. ira|kreios|luxuria or ira&kreios&luxuria.

Well, on "ira", "ira" is true and "kreios" and "luxuria" is false. So "ira|kreios|luxuria" is correct, because it evaluate to true on these three hosts and to false on any other. "ira&kreios&luxuria" would always be false. Is that confusing?

Yes, because if you ask the question: “Which systems have that feature?” and not “On which system should it run?”.

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

Killing the group would still finish the running jobs, wouldn’t it? So I do not see the advantage over using mxqset and resubmit the failed job over mxqkill and resubmit (the not yet finished jobs). The unfinished/failed job has to be resubmitted anyway, so the pipeline should support resubmitting jobs anyway.

But you'd need to resubmit not only the failed jobs, but the "not yet started" jobs as well, which might be more complicated. And while you are sorting that out, the pipeline would be stopped. With this approach you could just exclude the bad host and you pipeline would be running while you identify and resubmit the failed jobs.

You should stop it anyway, because until mxqset is run, your jobs might be queued on the “bad” node again.

Another difference would be that, if the issue on the failing host is resolved, you could use "mxqset" to remove the host from the excluded servers list and wouldn't need to resubmit at all.

So what is the disadvantage of adding a read/write group attribute. It might be a big surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

Can all group attributes be changed? If not, the users might ask, which ones? If all group attributes can be changed, the scheduler could be cheated mainly by changing the runtime.

Of course, you can't change all group attributes.

Currently the "closed" state can be changed and technically you could say, that mxqkill also changes group attributes ("cancelled").

Yes, but that’s an implementation detail, and not obvious to the user. For the user until now, group attributes couldn’t be changed.

With this PR, the excluded hosts can be changed. If I implement the boolean prerequisite expression, this should be changeable, too. We might consider to allow "root" to change the maximum run time, because we fulfilled some requests of that kind in a few exceptional cases in the past via direct sql commands, which is error prone.

Well that was always the exception, and until now not wanted.

Anyway, despite me disagreeing with the implementation, the main problem is the missing documented requirements, so it’s not possible to validate, if these requirements are fulfilled anyway. Regardless of the implementation, the new feature will be useful to some users.

donald · 2020-04-14T11:46:45Z

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.
Okay, but this is another use case. @wwwutz would want his jobs to go into the new node as well, assuming that the failing node is the exception. Same with @arndt and dontpanic before this was identified as "missing ssse3".
I understood @arndt’s and the users request differently. They will figure out the requirements pretty early and will resubmit.

So lets hope, they speak up to the whitelist .vs. blacklist question.

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

The boolean expression looked complex, because I made a complex example. But we always wanted to be able to submit to "cuda nodes only", in which case the expression would be a simple "cuda". Or, in the @arndt case, the simple expression "ssse3" would do and would be better than blacklisting dontpanic and much better thatn whitelisting all hosts but dontpanic.
The boolean prerequisite expression could be used to blacklist or whitelist hosts, but I've added the excluded-servers none the less, because with mxqset its easier to add or remove individual hosts from the list. We could add whitelist-servers as well, if there is a use case.

Well whitelist in combination with hostconfig --list is a little more intuitive as you do not need to negate.

So you think

mxqub --whitelist-server "$(hostconfig --list todsuende)"

is more intuitive than e.g.

mxqsub --prerequisite todsuende

or whatever we call the option. And try to do

mxqsub --prerequisite '!todsuende'

with hostconfig expansion. The '!' gets pretty much in the way

mxqsub --whitelist-server  "$(hostconfig --list 'mxqd & !todsuende')"
-bash: !todsuende': event not found

I wasn't able to figure that out in an elegant way.

But you'd need to resubmit not only the failed jobs, but the "not yet started" jobs as well, which might be more complicated. And while you are sorting that out, the pipeline would be stopped. With this approach you could just exclude the bad host and you pipeline would be running while you identify and resubmit the failed jobs.

You should stop it anyway, because until mxqset is run, your jobs might be queued on the “bad” node again.

If you are looking at new jobs, you need to change your mxqsub command. "mxqset" is not needed and no help there. Its for already submitted job. As soon as you used mxqset, the jobs will no longer be started on the excluded servers, they will be started on other servers. So there is no need ti mxkill the running group.

With this PR, the excluded hosts can be changed. If I implement the boolean prerequisite expression, this should be changeable, too. We might consider to allow "root" to change the maximum run time, because we fulfilled some requests of that kind in a few exceptional cases in the past via direct sql commands, which is error prone.

Well that was always the exception, and until now not wanted.

Correct. But "that was always" is not an argument by itself. And now it is wanted. As I wrote, Peter called for exactly this. @wwwutz , can you confirm or deny that?

Anyway, despite me disagreeing with the implementation, the main problem is the missing documented requirements, so it’s not possible to validate, if these requirements are fulfilled anyway.

So lets wait for the users.

host whitelist ?
host blacklist ?
prerequisite expression?

We can have all three. A whitelist would be rather trivial to add now. And, as I said, the expression parser is finished, too. Just need to read the hostconfig and /proc/cpu flags.

donald · 2020-04-14T12:01:16Z

Whitelisting might have a nice additional application: We could add a server flag, that a server takes ONLY jobs which explicitly whitelisted it. So we can run mxqd on special servers - e.g. for testing or special applications - and these servers would't accidentally consume jobs not intended for them.

wwwutz · 2020-04-14T12:13:33Z

Just my short notes: My idea was to have the ability to disable a host while a certain group is running. Not when starting a group. Starting a group with a set of constraints is a nice to have but does not solve the initial problem. It s complete different solution to a different problem. To make it clear: If someone submits 10k jobs and after 4k jobs went through we find a single host is doing bad things ( no sse3 ) it would be most helpful to just disable that single host, and let the 6k jobs finish. The few hundreds which failed can be resubmitted with sonstraints, blacklist or whitelist.

So just a whitelist does not solve this.

I would be able to build a submitter which blacklists bad hosts on the fly, by checking first something like cpuinfo | grep sse3 || mxqset --disable $HOSTID $GROUPID && bailout and then do the computation ( or not ). If the jobs lands on a bad machine, the next one of my group will not try that again.

I would be able to select the nodes by either a list supplied on mxqsub or, once submitted, prevent the jobs to run a second time when they failed.

I would even be able to submit cuda jobs on intel cpus with 2G level2 cache only... just check if all my requirements are met, and if not: disable this host and resubmit.

And yes, this is certainly not a feature which is easy peasy to handle. But a solution to actual problems.

arndt · 2020-04-14T12:25:22Z

so - as a user - I would not quite know how to configure such a prerequisite expression. A week ago I was neither be able to diagnose what was wrong with this julia-1.4 version nor did I had any idea what sse3 is. I had just seen that my job always crashes on dontpanic - and to blacklist just this host would be just fine for me. I would therefore prefer the the blacklist option.

To have a whitelist would also be nice. This way one could target certain nodes, e.g. some host(s) where one has already some huge data deposited from a previous run, or just gpus.

To disable certain hosts while a group is already running would be nice but in most of my applications not necessary (either the whole thing would just go through slower or I would resubmit a few jobs to get the sufficient statistics for some simulations). But other users might see this different - I guess.

donald · 2020-04-14T14:33:30Z

#82 is on top of this an has whitelist-servers too.

donald · 2020-04-15T13:40:59Z

closed for #83

donald added the not ready for merge label Apr 13, 2020

donald force-pushed the add-disabled-servers branch from 7b68d55 to a59277a Compare April 13, 2020 17:44

pmenzel reviewed Apr 13, 2020

View reviewed changes

donald added 19 commits April 13, 2020 23:24

Add .vimrc to repository

a2fe22f

mxq.h: Add attribute unused

f0f5ddd

Add function attribute ((unusged)) to avoid a compiler warning when the static inline function defined in the header file is not used by the compilation unit.

sql: Add field group_disabled_servers

459c5a8

mxq_group.h: Add group_disabled_servers

c9f13bd

mxq_group.c: Add group_disabled_servers

3961297

web: Add field group_disabled_servers

106d658

web: Add whitespace to align output

0a335e5

Add keywordset to build system

085cb76

mxqsub: Add option --disabled-servers

230988e

Add a new option so that specific servers can be excluded from starting jobs for this group. E.g.: mxqsub --disabled-servers="dontpanic sigusr2" sleep 100

Makefile: Let mxqsub depend on keywordset

a041fcd

Add mxqset to build system

42c1f44

mxqdump: Show disabled servers

3c7e9c1

Makefile: Let mxqd depend on keywordset

2c6199f

MXQ bump version

c737da4

donald force-pushed the add-disabled-servers branch from a59277a to c737da4 Compare April 13, 2020 21:24

fixup! Add keywordset module

b04d939

donald added 3 commits April 14, 2020 11:56

xmalloc: Add xrealloc

4f7c781

keywordset: Use xrealloc

6affb1d

fixup! xmalloc: Add xrealloc

17613e5

donald mentioned this pull request Apr 15, 2020

Add black and whitelists #83

Merged

donald closed this Apr 15, 2020

Add disabled servers #81

Add disabled servers #81

donald commented Apr 13, 2020

pmenzel left a comment

pmenzel Apr 13, 2020

donald Apr 14, 2020 •

edited

Loading

donald commented Apr 14, 2020 •

edited

Loading

pmenzel commented Apr 14, 2020

donald commented Apr 14, 2020

pmenzel commented Apr 14, 2020

donald commented Apr 14, 2020

donald commented Apr 14, 2020

wwwutz commented Apr 14, 2020

arndt commented Apr 14, 2020

donald commented Apr 14, 2020

donald commented Apr 15, 2020

Add disabled servers #81

Add disabled servers #81

Conversation

donald commented Apr 13, 2020

pmenzel left a comment

Choose a reason for hiding this comment

pmenzel Apr 13, 2020

Choose a reason for hiding this comment

donald Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

donald commented Apr 14, 2020 • edited Loading

pmenzel commented Apr 14, 2020

donald commented Apr 14, 2020

pmenzel commented Apr 14, 2020

donald commented Apr 14, 2020

donald commented Apr 14, 2020

wwwutz commented Apr 14, 2020

arndt commented Apr 14, 2020

donald commented Apr 14, 2020

donald commented Apr 15, 2020

donald Apr 14, 2020 •

edited

Loading

donald commented Apr 14, 2020 •

edited

Loading