Skip to content

Enforce/default to tmpdir usage #113

Closed
pmenzel opened this issue Oct 14, 2021 · 11 comments · Fixed by #133
Closed

Enforce/default to tmpdir usage #113

pmenzel opened this issue Oct 14, 2021 · 11 comments · Fixed by #133

Comments

@pmenzel
Copy link
Contributor

pmenzel commented Oct 14, 2021

As we have cluster nodes with small, that means 1 TB, TMPDIR (/scratch/local2), and users sporadically fill up the temporary space causing run-time issues, let’s enforce tmpdir usage with some default.

Example problem: Jobs in uninterruptable sleep cannot be killed:

@internetguide:~$ uname -a
Linux internetguide.molgen.mpg.de 5.10.24.mx64.375 #1 SMP Fri Mar 19 12:29:21 CET 2021 x86_64 GNU/Linux
@internetguide:~$ ps aux | grep XXX
XXX 53622 24.5  0.0   7860  3424 ?        D    15:55   4:25 awk BEGIN{OFS="\t"}NF>=11{$1=$1"/1"; print} /scratch/local2/juicer_job_tmpdir/63521/splits/mpimg_L18466-1_906-02-8_S102_R1_001.fastq.gz_sort.sam
@internetguide:~$ sudo more /proc/53622/stack
[<0>] __flush_work+0x142/0x1c0
[<0>] xfs_file_buffered_aio_write+0x2d2/0x320
[<0>] new_sync_write+0x11f/0x1b0
[<0>] vfs_write+0x218/0x280
[<0>] ksys_write+0xa1/0xe0
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
@donald
Copy link
Contributor

donald commented Oct 15, 2021

So the idea is, that every job is started with its own (and guaranteed) $TMPDIR, even if not explicitly requested with --tmpdir, right? I had this in mind from the very beginning. I don't remember if somebody was against it? Anyway, we wanted to see if this continuous creation and mounting of filesystems runs into problems first. But it didn't. So I agree, we should do that.

Wouldn't help if people got used to use /scratch/local2, though,

When a job can't be killed, this is a kernel bug to me, no matter if disks are full or not.

@donald
Copy link
Contributor

donald commented Oct 15, 2021

53622 seems to be gone now. Maybe not deadlocks, but just slow?

@pmenzel
Copy link
Contributor Author

pmenzel commented Oct 15, 2021

53622 seems to be gone now. Maybe not deadlocks, but just slow?

Sorry for being unclear. The user came down with the issue, that killing the group worked, but the job was still listed. After deleting files on /scratch/local2/ to free up some space, job 32522716 was successfully killed and was gone.

@pmenzel
Copy link
Contributor Author

pmenzel commented Oct 15, 2021

2021-10-14 16:02:32 +0200 mxqd[1844]: job=XXX(15013):291947:32522716 cancelled
2021-10-14 16:02:32 +0200 mxqd[1844]: sending signal=15 to job=XXX(15013):291947:32522716
[…]
2021-10-14 16:03:05 +0200 mxqd[1844]: sending signal=9 to job=XXX(15013):291947:32522716
[…]
2021-10-14 16:15:35 +0200 mxqd[1844]: sending signal=9 to job=XXX(15013):291947:32522716
[…]
2021-10-14 16:15:58 +0200 mxqd[1844]: job finished (via fspool) : job 32522716 pid 63520 status 15

@pmenzel
Copy link
Contributor Author

pmenzel commented Oct 15, 2021

I edited the paste, and added the first time, KILL was signaled to the process.

@donald
Copy link
Contributor

donald commented Oct 15, 2021

Hmmm. Yes, job exit 12 minutes after kill -9 doesn't look right. The above awk writes to stdout (which seems to go into a /scratch/local2 xfs file). Probably awk does not check for write errors so it might just keep going when the disk is full. We could simulate that to find if we can trigger the problem.

Anyway, full disks are a problem (and might even produce wrong results if applications fail to check for write errors ). So we should try to avoid that as much as possible. To enforce TMPDIR to managed space for every job would be one way to help in that regard. What would be a good TMPDIR default? 100G?

@pmenzel
Copy link
Contributor Author

pmenzel commented Oct 15, 2021

(Aren’t you on vacation. ;-))

I would have said 10 GB to make users more aware, what their programs are doing, and not exclude too many nodes (which shouldn’t happen too often anymore though, when this is the default). It could always be increased in case it impairs too many users. But 100 GB is also fine.

@donald donald mentioned this issue May 11, 2022
@donald donald mentioned this issue May 12, 2022
@arndt
Copy link
Contributor

arndt commented May 12, 2022

This just crashed some of my jobs which apparently needed more tmpdir ...

@donald
Copy link
Contributor

donald commented May 12, 2022

Sorry, but see it this way: Your jobs also used to crash, when $TMPDIR was filled by other users. Now, after you've set --tmpdir to what you need, you are guaranteed to have this space., If someone fills /scratch/local2, you job won't start and die on that node but would start on another node which has the requested disk space.

@arndt
Copy link
Contributor

arndt commented May 12, 2022

all good - to have this guarantee is good - but now I also have to think about how much tmp space I need. In my case I didn't even know that I need tmp space - but apparently some internals routine need it to store a gzip-decompressed file when I read it.

@donald
Copy link
Contributor

donald commented May 12, 2022

Keep doubling until enough....

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants