Skip to content

next #151

Merged
merged 24 commits into from
Feb 17, 2024
Merged

next #151

merged 24 commits into from
Feb 17, 2024

Conversation

donald
Copy link
Contributor

@donald donald commented Dec 30, 2023

Fixes #145
Fixes #140
Adresses #139 by adding temporary gpu setup debug messages (#141)
Fixes #138
Fixes #144
Fixes #136
Adresses #56
Adresses #153
Fixes #146
Fixes #46
Fixes #156

@donald donald changed the title mxqsub: Don't fail with --umask 000 next Dec 30, 2023
Currently, the gpu lock file `pid` is released (removed) to early, so
that there is a small race condition with a new GPU allocation:

```
   MXQ           job1                        job2
* fork job1
                 * other initialization
                 * reserve gpu:
                 * * find slot without pid
                 * * change access to UID
                 * run user program
                 * exit
* fork job2
                                             *  other initialization
* cleanup job 1:
* * rm .../pid
                                             * reserve gpu:
                                             * * find slot without pid
                                             * * change access to UID
* * change access to root
```

On release, keep the `pid` file until after the access mode has been
changed back to root.
Bison-3.4.2 together with llvm 15.0.4 produce the warning

    parser.tab.c:1078:9: warning: variable 'yynerrs' set but not used

Disable this warning for the parser.tab.c compilation.
@donald donald force-pushed the next branch 2 times, most recently from c77eb3e to 70a0089 Compare January 10, 2024 16:53
If we can't write the spool file for any reason, do not write to stderr,
as this is the users stderr of the mxq job. Just wait and retry until
success.
Add default characer set option to the mxq tables, so that we don't
depend on the default character set of the database or the server.
A select on `(group_jobs_inq > 0 OR group_jobs_running > 0)` is done in
the main loop on each daemon. Add two indexes to drastically reduce the
load on the MySQL server.
Allow mx_mysql_statement_close* functions to be called with a pointer to
a NULL value. This makes the daemon more robust when the functions are
used as cleanup functions, e.g.

    __attribute__((cleanup(mx_mysql_statement_close)))
    struct mx_mysql_stmt *stmt = mx_mysql_statement_prepare(mysql,"...");
    if (!stmt) {
        fprintf(stderr, "mx_mysql_stmt_prepare: %s\n", mx_mysql_error());
        return;
    }
The only caller of the static find_short_option() function doesn't call
it with a NULL pointer value as the third argument:

    idx = find_short_option(optctl->options, &optctl->_unhandled_shortopts, &optctl->optarg);

So it is not necessary for the function to check the pointer for NULL.

The disadvantage of doing is, that the llvm static analyzer concludes
from the check, that it was possible that the function would be called
with a NULL pointer and questions a later access in the same function,
which is unchecked:

    mx_getopt.c:158:16: warning: Dereference of null pointer (loaded from variable 'optarg') [core.NullDereference]
           *optarg = *name;
            ~~~~~~ ^
    1 warning generated.

Remove the check.
Remove the warning for the default runtime option in mxqsub. Users found
the message "option '--runtime' or '-t' not used. Your job will get
killed if it runs longer than the default of 15 minutes" annoying,
especially as there's no equivalent warning for the default memory
limit.
After commit 2e80051 ("mx_mysql: Fix warning in
mx_mysql_option_set_default_file"), mx_mysql_option_set_default_file()
no longer outputs a warning by itself when the file is not readable.
MySQL itself silently ignores unreadable config files.

Make all callers handle the error.
All callers were made to handle errors returned by
mx_mysql_option_set_default_file() in the previous commit. Now add the
function attribute `warn_unused_result` to it, so that future callers
will do, too.
mx_mysql_option_set_default_group() and mx_mysql_option_set_reconnect()
can't fail, so make them void functions.
Currently, setup_cronolog() returns on errors. However, in most error
paths, it leaks one or more file descriptors.  The leaked descriptors
trigger warnings in the GCC 13 static analyzer.

As the only caller of this functions terminates the program on any error
anyway, don't fix the file descriptor leaks, but terminate right after
any error from inside the function.
Newer Valgrind versions complain, when `execv()` is called with
argv==NULL or with argv[0]==NULL.

argv == NULL is not allowed by Posix, but the Linux kernel transforms
that into an empty argument list. An empty argument list with
argv[0]==NULL does not go against any specs, altough it might confuse
programms and led to pkexec being exploitable (CVE-2021-4034)

It is not wrong on Linux to call execv() or a wrapper function with
argv==NULL, but avoid it anyway to prevent Valgrind warnings.
Commit d25a77e ("mxq_job: Remove unused job_flags column") accidentally
removed group_flags together with job_flags from the create_tables.sql
file. Undo.
Sign in to join this conversation on GitHub.