Skip to content

Add disabled servers #81

Closed
wants to merge 24 commits into from
Closed

Add disabled servers #81

wants to merge 24 commits into from

Conversation

donald
Copy link
Contributor

@donald donald commented Apr 13, 2020

No description provided.

Copy link
Contributor

@pmenzel pmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the implementation. I also thought about this, and believe it should be a white list.

+    "  -H, --hosts=LIST         Whitelist cluster nodes (hosts) on which job may run\n"
+    "                           (default: all)\n"
+    "\n"
+    "                           If a job requires a certain feature on a cluster node like an\n"
+    "                           a processor instruction, this allows to list the cluster nodes\n"
+    "                           with that feature\n"
+    "\n"
+    "        [LIST] Hostnames separated by whitespace\n"
+    "               Example: "$(hostconfigctl --list todsuende)"\n"
+    "               Example: "kreios uselessbox"\n"

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable. I think, users wanting to restrict their jobs to certain features of cluster nodes, should test these features beforehand, and then submit this list with their job.

keywordset.c Outdated Show resolved Hide resolved
@@ -1277,6 +1288,26 @@ unsigned long start_job(struct mxq_group_list *glist)

/**********************************************************************/

static int server_is_qualified(struct mxq_server *server, struct mxq_group *group) {
int is_qualified = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have type bool?

Copy link
Contributor Author

@donald donald Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a single agreed on standard? Some revisions of the C language have "bool" or "_Bool" but not all. Many C libraries framework and applications bring their own definitions, all different. Its a mess. If you include <stdbool.h> you might get in conflict with a self-defined "bool".

All this is very unlcear to me. And if there is no single standard, you don't know the semantic of some type of self-invented boolean type definition unless you consult the individual sources or documentation. Is it a macro, a typedef, an enum, a compiler built-in? Can you use arithmetic operators on it (e.g. xor) ? Is the "bool" from one library compatible to the "bool" from another one? What is the size and alignment? Can you pack it? Does the code spend cpu-cycles converting from a boolean expression to this boolean type. What is the legal range of values?

On the other hand, if you just use a standard C type instead, a C programmer immediately knows all features of the type.

If the user-defined boolean type resolves to some integral type (which is usually does, because everything else doesn't play well with boolean expressions), there will be no added type-safety, because everything will be promoted.

There might be some kind of documentation effect. But often you can get the same by picking good names for variables and parameters, .e.g. including is in the name as it is done here.

Just my own spontaneous thoughts about that, but I'd like to learn. Can you can point me to a good argument which standard should be used?

Avoid unneeded warning when scanning JOB_TMPDIR_MNTDIR for possible
leftover job mounts. The parent directory of JOB_TMPDIR_MNTDIR is
created when the first job with the --tmpdir feature is started. So
it won't exist in a new server.

This cleanup path is only used in the exceptional case that the usual
cleanup path (via job_has_finished) didn't succeed, e.g. when mxqd was
killed. The cleanup is not essential for the currently running server.
Add function attribute ((unusged)) to avoid a compiler warning when the
static inline function defined in the header file is not used by the
compilation unit.
Add header file for the quasi-standard [1] xmalloc call.

Include x-version of strndup. Other functions (e.g. xrealloc) can be
added as needed.

The current approach implemented in mx_util is to wait and retry on
ENOMEM malloc failure. However, this overhead doesn't seem to be
justified, because it is difficult to imagine a case, where a malloc
would fail with ENOMEM at one time and a retry would succeed.

[1] https://www.gnu.org/software/libc/manual/html_node/Malloc-Examples.html
Add a utility module which can store a set of keywords. The set can be
created and update from a string and can be serialized into a string.

The string representation is a space-separated list of keywordis. Strings
created by this utility contain the keywords stored in the set sorted in
lexical order separated by a single space character.

Input strings used to create or update a keyword set contain keywords
separated by whitespace. Words from the string are added to the set
unless a word starts with '-' in which case it will be removed from the

Usage example:

    struct keywordset *kws = keywordset_new("xxx yyy")
    keywordset_update(kws, "-yyy aaaa")

    char *s = keywordset_get(kws):   // s now "aaaaa xxx"
    free(s);                         // caller must free()

    if (keywordset_ismember(kws,"xxx")) ... // true

    keywordset_free(kws);
Add a new option so that specific servers can be excluded from starting
jobs for this group. E.g.:

    mxqsub --disabled-servers="dontpanic sigusr2" sleep 100
Add a new command which can be used to modify an existing group:

    mxqset group 123 --closed
    mxqset group 123 --open
    mxqset group 123 --disabled-servers=dontpanic
    mxqset group 123 --update-disabled-servers="sigusr sigusr2 -dontpanic"
    mxqset group 123 --disabled-servers=

The flags open and closed can be set from mxqadmin as well
(`mxqadmin --close=123`) , but the synopsis, in which the options take
the groupid as a value is difficult to expand to options, which take a
value by themself.
Check, whether this server is qualified to start jobs from a group.
Lazy-evaluate qualification criteria and cache the result.
Don't start jobs we are not qualified for.

For now, the only qualification criteria is, that our short or fully
qualified hostname is not included the the excluded_servers set of the
group.

This can later be expanded to additional criteria (e.g. hostconfig or
processor flags).
@donald
Copy link
Contributor Author

donald commented Apr 14, 2020

Thank you for the implementation. I also thought about this, and believe it should be a white list.

Hmm, I have a feature in queue, so that you could use arbitrary boolean expressions from cpu feature flags or hostconfig flags ("cuda & !pcpool & ssse3" ). I could register the short hostname as well and the you could use that to select specific hosts, too ("ira|kreios|luxuria").

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

So what is the disadvantage of adding a read/write group attribute. It might be a bit surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

@pmenzel
Copy link
Contributor

pmenzel commented Apr 14, 2020

Thank you for the implementation. I also thought about this, and believe it should be a white list.

Hmm, I have a feature in queue, so that you could use arbitrary boolean expressions from cpu feature flags or hostconfig flags ("cuda & !pcpool & ssse3" ). I could register the short hostname as well and the you could use that to select specific hosts, too ("ira|kreios|luxuria").

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

Semantics also become quite confusing, as the expression depends on the question ask. ira|kreios|luxuria or ira&kreios&luxuria.

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

Killing the group would still finish the running jobs, wouldn’t it? So I do not see the advantage over using mxqset and resubmit the failed job over mxqkill and resubmit (the not yet finished jobs). The unfinished/failed job has to be resubmitted anyway, so the pipeline should support resubmitting jobs anyway.

So what is the disadvantage of adding a read/write group attribute. It might be a big surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

Can all group attributes be changed? If not, the users might ask, which ones? If all group attributes can be changed, the scheduler could be cheated mainly by changing the runtime.

@donald
Copy link
Contributor Author

donald commented Apr 14, 2020

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.

Okay, but this is another use case. @wwwutz would want his jobs to go into the new node as well, assuming that the failing node is the exception. Same with @arndt and dontpanic before this was identified as "missing ssse3".

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

The boolean expression looked complex, because I made a complex example. But we always wanted to be able to submit to "cuda nodes only", in which case the expression would be a simple "cuda". Or, in the @arndt case, the simple expression "ssse3" would do and would be better than blacklisting dontpanic and much better thatn whitelisting all hosts but dontpanic.

The boolean prerequisite expression could be used to blacklist or whitelist hosts, but I've added the excluded-servers none the less, because with mxqset its easier to add or remove individual hosts from the list. We could add whitelist-servers as well, if there is a use case.

Semantics also become quite confusing, as the expression depends on the question ask. ira|kreios|luxuria or ira&kreios&luxuria.

Well, on "ira", "ira" is true and "kreios" and "luxuria" is false. So "ira|kreios|luxuria" is correct, because it evaluate to true on these three hosts and to false on any other. "ira&kreios&luxuria" would always be false. Is that confusing?

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

Killing the group would still finish the running jobs, wouldn’t it? So I do not see the advantage over using mxqset and resubmit the failed job over mxqkill and resubmit (the not yet finished jobs). The unfinished/failed job has to be resubmitted anyway, so the pipeline should support resubmitting jobs anyway.

But you'd need to resubmit not only the failed jobs, but the "not yet started" jobs as well, which might be more complicated. And while you are sorting that out, the pipeline would be stopped. With this approach you could just exclude the bad host and you pipeline would be running while you identify and resubmit the failed jobs.

Another difference would be that, if the issue on the failing host is resolved, you could use "mxqset" to remove the host from the excluded servers list and wouldn't need to resubmit at all.

So what is the disadvantage of adding a read/write group attribute. It might be a big surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

Can all group attributes be changed? If not, the users might ask, which ones? If all group attributes can be changed, the scheduler could be cheated mainly by changing the runtime.

Of course, you can't change all group attributes.

Currently the "closed" state can be changed and technically you could say, that mxqkill also changes group attributes ("cancelled").

With this PR, the excluded hosts can be changed. If I implement the boolean prerequisite expression, this should be changeable, too. We might consider to allow "root" to change the maximum run time, because we fulfilled some requests of that kind in a few exceptional cases in the past via direct sql commands, which is error prone.

The optimizing compiler predicts that pointer != NULL is more likely
than pointer == NULL anyway. The generated code by "gcc -O2" is the
same with or without __builtin_expect builtin.

However, probably based on special knowledge of malloc either because of
its __attribute__ ((malloc)) declaration or hard-coded, the optimization
without the __builtin_expect builtin seems to be a bit more stronger:
The function out_of_memory() is put into a .text.unlikely code segment,
as if had attribute ((cold)) had been declared for it.

Remove __builtin_expect.
@pmenzel
Copy link
Contributor

pmenzel commented Apr 14, 2020

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.

Okay, but this is another use case. @wwwutz would want his jobs to go into the new node as well, assuming that the failing node is the exception. Same with @arndt and dontpanic before this was identified as "missing ssse3".

I understood @arndt’s and the users request differently. They will figure out the requirements pretty early and will resubmit.

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

The boolean expression looked complex, because I made a complex example. But we always wanted to be able to submit to "cuda nodes only", in which case the expression would be a simple "cuda". Or, in the @arndt case, the simple expression "ssse3" would do and would be better than blacklisting dontpanic and much better thatn whitelisting all hosts but dontpanic.

The boolean prerequisite expression could be used to blacklist or whitelist hosts, but I've added the excluded-servers none the less, because with mxqset its easier to add or remove individual hosts from the list. We could add whitelist-servers as well, if there is a use case.

Well whitelist in combination with hostconfig --list is a little more intuitive as you do not need to negate.

Semantics also become quite confusing, as the expression depends on the question ask. ira|kreios|luxuria or ira&kreios&luxuria.

Well, on "ira", "ira" is true and "kreios" and "luxuria" is false. So "ira|kreios|luxuria" is correct, because it evaluate to true on these three hosts and to false on any other. "ira&kreios&luxuria" would always be false. Is that confusing?

Yes, because if you ask the question: “Which systems have that feature?” and not “On which system should it run?”.

mxqset is a nice idea, but breaks with the current design of MXQ, where groups are immutable.

Yes. At first I was reluctant to change that model, too. But Peter ( @wwwutz not @arndt ) convinced me, that this feature would be useful enough to justify getting rid of this principle. The model is, that you have several jobs in the queue. They complete one by one and you recognize, that jobs executed by a specific server fail, although you don't yet know why. So it would be useful if you could just exclude that specific server for the group without the need to cancel the group and resubmit everything unfinished.

Killing the group would still finish the running jobs, wouldn’t it? So I do not see the advantage over using mxqset and resubmit the failed job over mxqkill and resubmit (the not yet finished jobs). The unfinished/failed job has to be resubmitted anyway, so the pipeline should support resubmitting jobs anyway.

But you'd need to resubmit not only the failed jobs, but the "not yet started" jobs as well, which might be more complicated. And while you are sorting that out, the pipeline would be stopped. With this approach you could just exclude the bad host and you pipeline would be running while you identify and resubmit the failed jobs.

You should stop it anyway, because until mxqset is run, your jobs might be queued on the “bad” node again.

Another difference would be that, if the issue on the failing host is resolved, you could use "mxqset" to remove the host from the excluded servers list and wouldn't need to resubmit at all.

So what is the disadvantage of adding a read/write group attribute. It might be a big surprising, that a mxqsub command might go into a new group, because someone changed the attributes of a previous group. But its only surprising when you've already internalized the previous rules. Plus this already has been weakened a bit, because you already can close a group with mxqadmin.

Can all group attributes be changed? If not, the users might ask, which ones? If all group attributes can be changed, the scheduler could be cheated mainly by changing the runtime.

Of course, you can't change all group attributes.

Currently the "closed" state can be changed and technically you could say, that mxqkill also changes group attributes ("cancelled").

Yes, but that’s an implementation detail, and not obvious to the user. For the user until now, group attributes couldn’t be changed.

With this PR, the excluded hosts can be changed. If I implement the boolean prerequisite expression, this should be changeable, too. We might consider to allow "root" to change the maximum run time, because we fulfilled some requests of that kind in a few exceptional cases in the past via direct sql commands, which is error prone.

Well that was always the exception, and until now not wanted.

Anyway, despite me disagreeing with the implementation, the main problem is the missing documented requirements, so it’s not possible to validate, if these requirements are fulfilled anyway. Regardless of the implementation, the new feature will be useful to some users.

@donald
Copy link
Contributor Author

donald commented Apr 14, 2020

I forgot to justify the whitelist. It’s because, new cluster nodes can be added any time. If a user requests a feature, it’s uncertain if the new node fulfills it.
Okay, but this is another use case. @wwwutz would want his jobs to go into the new node as well, assuming that the failing node is the exception. Same with @arndt and dontpanic before this was identified as "missing ssse3".
I understood @arndt’s and the users request differently. They will figure out the requirements pretty early and will resubmit.

So lets hope, they speak up to the whitelist .vs. blacklist question.

Support for arbitrary boolean expressions is overkill in my opinion, and just adds unneeded complexity. Users have to test beforehand what cluster nodes support their requirements, so they will get a list of names quite easily.

The boolean expression looked complex, because I made a complex example. But we always wanted to be able to submit to "cuda nodes only", in which case the expression would be a simple "cuda". Or, in the @arndt case, the simple expression "ssse3" would do and would be better than blacklisting dontpanic and much better thatn whitelisting all hosts but dontpanic.
The boolean prerequisite expression could be used to blacklist or whitelist hosts, but I've added the excluded-servers none the less, because with mxqset its easier to add or remove individual hosts from the list. We could add whitelist-servers as well, if there is a use case.

Well whitelist in combination with hostconfig --list is a little more intuitive as you do not need to negate.

So you think

mxqub --whitelist-server "$(hostconfig --list todsuende)"

is more intuitive than e.g.

mxqsub --prerequisite todsuende

or whatever we call the option. And try to do

mxqsub --prerequisite '!todsuende'

with hostconfig expansion. The '!' gets pretty much in the way

mxqsub --whitelist-server  "$(hostconfig --list 'mxqd & !todsuende')"
-bash: !todsuende': event not found

I wasn't able to figure that out in an elegant way.

But you'd need to resubmit not only the failed jobs, but the "not yet started" jobs as well, which might be more complicated. And while you are sorting that out, the pipeline would be stopped. With this approach you could just exclude the bad host and you pipeline would be running while you identify and resubmit the failed jobs.

You should stop it anyway, because until mxqset is run, your jobs might be queued on the “bad” node again.

If you are looking at new jobs, you need to change your mxqsub command. "mxqset" is not needed and no help there. Its for already submitted job. As soon as you used mxqset, the jobs will no longer be started on the excluded servers, they will be started on other servers. So there is no need ti mxkill the running group.

With this PR, the excluded hosts can be changed. If I implement the boolean prerequisite expression, this should be changeable, too. We might consider to allow "root" to change the maximum run time, because we fulfilled some requests of that kind in a few exceptional cases in the past via direct sql commands, which is error prone.

Well that was always the exception, and until now not wanted.

Correct. But "that was always" is not an argument by itself. And now it is wanted. As I wrote, Peter called for exactly this. @wwwutz , can you confirm or deny that?

Anyway, despite me disagreeing with the implementation, the main problem is the missing documented requirements, so it’s not possible to validate, if these requirements are fulfilled anyway.

So lets wait for the users.

  • host whitelist ?
  • host blacklist ?
  • prerequisite expression?

We can have all three. A whitelist would be rather trivial to add now. And, as I said, the expression parser is finished, too. Just need to read the hostconfig and /proc/cpu flags.

@donald
Copy link
Contributor Author

donald commented Apr 14, 2020

Whitelisting might have a nice additional application: We could add a server flag, that a server takes ONLY jobs which explicitly whitelisted it. So we can run mxqd on special servers - e.g. for testing or special applications - and these servers would't accidentally consume jobs not intended for them.

@wwwutz
Copy link
Contributor

wwwutz commented Apr 14, 2020

Just my short notes: My idea was to have the ability to disable a host while a certain group is running. Not when starting a group. Starting a group with a set of constraints is a nice to have but does not solve the initial problem. It s complete different solution to a different problem. To make it clear: If someone submits 10k jobs and after 4k jobs went through we find a single host is doing bad things ( no sse3 ) it would be most helpful to just disable that single host, and let the 6k jobs finish. The few hundreds which failed can be resubmitted with sonstraints, blacklist or whitelist.

So just a whitelist does not solve this.

I would be able to build a submitter which blacklists bad hosts on the fly, by checking first something like cpuinfo | grep sse3 || mxqset --disable $HOSTID $GROUPID && bailout and then do the computation ( or not ). If the jobs lands on a bad machine, the next one of my group will not try that again.

I would be able to select the nodes by either a list supplied on mxqsub or, once submitted, prevent the jobs to run a second time when they failed.

I would even be able to submit cuda jobs on intel cpus with 2G level2 cache only... just check if all my requirements are met, and if not: disable this host and resubmit.

And yes, this is certainly not a feature which is easy peasy to handle. But a solution to actual problems.

@arndt
Copy link
Contributor

arndt commented Apr 14, 2020

so - as a user - I would not quite know how to configure such a prerequisite expression. A week ago I was neither be able to diagnose what was wrong with this julia-1.4 version nor did I had any idea what sse3 is. I had just seen that my job always crashes on dontpanic - and to blacklist just this host would be just fine for me. I would therefore prefer the the blacklist option.

To have a whitelist would also be nice. This way one could target certain nodes, e.g. some host(s) where one has already some huge data deposited from a previous run, or just gpus.

To disable certain hosts while a group is already running would be nice but in most of my applications not necessary (either the whole thing would just go through slower or I would resubmit a few jobs to get the sufficient statistics for some simulations). But other users might see this different - I guess.

@donald
Copy link
Contributor Author

donald commented Apr 14, 2020

#82 is on top of this an has whitelist-servers too.

@donald donald mentioned this pull request Apr 15, 2020
@donald
Copy link
Contributor Author

donald commented Apr 15, 2020

closed for #83

@donald donald closed this Apr 15, 2020
Sign in to join this conversation on GitHub.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants