-
Notifications
You must be signed in to change notification settings - Fork 3
Enforce/default to tmpdir usage #113
Comments
So the idea is, that every job is started with its own (and guaranteed) $TMPDIR, even if not explicitly requested with Wouldn't help if people got used to use /scratch/local2, though, When a job can't be killed, this is a kernel bug to me, no matter if disks are full or not. |
53622 seems to be gone now. Maybe not deadlocks, but just slow? |
Sorry for being unclear. The user came down with the issue, that killing the group worked, but the job was still listed. After deleting files on |
|
I edited the paste, and added the first time, KILL was signaled to the process. |
Hmmm. Yes, job exit 12 minutes after kill -9 doesn't look right. The above Anyway, full disks are a problem (and might even produce wrong results if applications fail to check for write errors ). So we should try to avoid that as much as possible. To enforce TMPDIR to managed space for every job would be one way to help in that regard. What would be a good TMPDIR default? 100G? |
(Aren’t you on vacation. ;-)) I would have said 10 GB to make users more aware, what their programs are doing, and not exclude too many nodes (which shouldn’t happen too often anymore though, when this is the default). It could always be increased in case it impairs too many users. But 100 GB is also fine. |
This just crashed some of my jobs which apparently needed more tmpdir ... |
Sorry, but see it this way: Your jobs also used to crash, when $TMPDIR was filled by other users. Now, after you've set --tmpdir to what you need, you are guaranteed to have this space., If someone fills /scratch/local2, you job won't start and die on that node but would start on another node which has the requested disk space. |
all good - to have this guarantee is good - but now I also have to think about how much tmp space I need. In my case I didn't even know that I need tmp space - but apparently some internals routine need it to store a gzip-decompressed file when I read it. |
Keep doubling until enough.... |
As we have cluster nodes with small, that means 1 TB,
TMPDIR
(/scratch/local2
), and users sporadically fill up the temporary space causing run-time issues, let’s enforce tmpdir usage with some default.Example problem: Jobs in uninterruptable sleep cannot be killed:
The text was updated successfully, but these errors were encountered: