Skip to content

gpu-init job-release releases lock to early #138

Closed
donald opened this issue Dec 22, 2022 · 2 comments · Fixed by #151
Closed

gpu-init job-release releases lock to early #138

donald opened this issue Dec 22, 2022 · 2 comments · Fixed by #151

Comments

@donald
Copy link
Contributor

donald commented Dec 22, 2022

rm $d/pid

We should keep the lock while we modify the access file, otherwise we race with a new allocation.

@thomas
Copy link
Contributor

thomas commented Dec 22, 2022

Wow, this was fast !

So removal must be done at a different location in the script, or will it be kind of an extra/cleanup task?

@donald
Copy link
Contributor Author

donald commented Dec 22, 2022

This is not the answer to #139!

I just spotted this. Its a race. I think the bad outcome is more theoretically, because it is near to impossible to occur in reality. The result would be, that the cuda access files are owned by root and not be the uid of the job, so that accessing the gpus would fail.

   MXQ           job1                        job2
* fork job1
                 * other initialization
                 * reserve gpu:
                 * * find slot without pid
                 * * change access to UID
                 * run user program
                 * exit
* fork job2                           
                                             *  other initialization       
* cleanup job 1:
* * rm .../pid
                                             * reserve gpu:
                                             * * find slot without pid
                                             * * change access to UID
* * change access to root
                                             * run user program

So now job2 would have no access to its gpu.

@donald donald mentioned this issue Dec 30, 2023
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants