From d58562ca6c992fc5577838d010c8a37401c2a831 Mon Sep 17 00:00:00 2001 From: Chris Mason Date: Thu, 30 Jun 2022 09:52:57 -0700 Subject: [PATCH 1/3] iomap: skip pages past eof in iomap_do_writepage() iomap_do_writepage() sends pages past i_size through folio_redirty_for_writepage(), which normally isn't a problem because truncate and friends clean them very quickly. When the system has cgroups configured, we can end up in situations where one cgroup has almost no dirty pages at all, and other cgroups consume the entire background dirty limit. This is especially common in our XFS workloads in production because they have cgroups using O_DIRECT for almost all of the IO mixed in with cgroups that do more traditional buffered IO work. We've hit storms where the redirty path hits millions of times in a few seconds, on all a single file that's only ~40 pages long. This leads to long tail latencies for file writes because the pdflush workers are hogging the CPU from some kworkers bound to the same CPU. Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new eof page on truncate...") ends up writing/waiting most of these dirty pages before truncate gets a chance to wait on them. The actual repro looks like this: /* * run me in a cgroup all alone. Start a second cgroup with dd * streaming IO into the block device. */ int main(int ac, char **av) { int fd; int ret; char buf[BUFFER_SIZE]; char *filename = av[1]; memset(buf, 0, BUFFER_SIZE); if (ac != 2) { fprintf(stderr, "usage: looper filename\n"); exit(1); } fd = open(filename, O_WRONLY | O_CREAT, 0600); if (fd < 0) { err(errno, "failed to open"); } fprintf(stderr, "looping on %s\n", filename); while(1) { /* * skip past page 0 so truncate doesn't write and wait * on our extent before changing i_size */ ret = lseek(fd, 8192, SEEK_SET); if (ret < 0) err(errno, "lseek"); ret = write(fd, buf, BUFFER_SIZE); if (ret != BUFFER_SIZE) err(errno, "write failed"); /* start IO so truncate has to wait after i_size is 0 */ ret = sync_file_range(fd, 16384, 4095, SYNC_FILE_RANGE_WRITE); if (ret < 0) err(errno, "sync_file_range"); ret = ftruncate(fd, 0); if (ret < 0) err(errno, "truncate"); usleep(1000); } } And this bpftrace script will show when you've hit a redirty storm: kretprobe:xfs_vm_writepages { delete(@dirty[pid]); } kprobe:xfs_vm_writepages { @dirty[pid] = 1; } kprobe:folio_redirty_for_writepage /@dirty[pid] > 0/ { $inode = ((struct folio *)arg1)->mapping->host->i_ino; @inodes[$inode] = count(); @redirty++; if (@redirty > 90000) { printf("inode %d redirty was %d", $inode, @redirty); exit(); } } This patch has the same number of failures on xfstests as unpatched 5.18: Failures: generic/648 xfs/019 xfs/050 xfs/168 xfs/299 xfs/348 xfs/506 xfs/543 I also ran it through a long stress of multiple fsx processes hammering. (Johannes Weiner did significant tracing and debugging on this as well) Signed-off-by: Chris Mason Co-authored-by: Johannes Weiner Reviewed-by: Johannes Weiner Reported-by: Domas Mituzas Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index d2a9f699e17ed..02b8bb46e0b33 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1478,10 +1478,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) pgoff_t end_index = isize >> PAGE_SHIFT; /* - * Skip the page if it's fully outside i_size, e.g. due to a - * truncate operation that's in progress. We must redirty the - * page so that reclaim stops reclaiming it. Otherwise - * iomap_release_folio() is called on it and gets confused. + * Skip the page if it's fully outside i_size, e.g. + * due to a truncate operation that's in progress. We've + * cleaned this page and truncate will finish things off for + * us. * * Note that the end_index is unsigned long. If the given * offset is greater than 16TB on a 32-bit system then if we @@ -1496,7 +1496,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) */ if (folio->index > end_index || (folio->index == end_index && poff == 0)) - goto redirty; + goto unlock; /* * The page straddles i_size. It must be zeroed out on each @@ -1514,6 +1514,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) redirty: folio_redirty_for_writepage(wbc, folio); +unlock: folio_unlock(folio); return 0; } From 98eb8d95025bd96d78fa4d27fb9e1e8d162c7227 Mon Sep 17 00:00:00 2001 From: Kaixu Xia Date: Thu, 30 Jun 2022 10:04:18 -0700 Subject: [PATCH 2/3] iomap: set did_zero to true when zeroing successfully It is unnecessary to check and set did_zero value in while() loop in iomap_zero_iter(), we can set did_zero to true only when zeroing successfully at last. Signed-off-by: Kaixu Xia Reviewed-by: Chaitanya Kulkarni Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 02b8bb46e0b33..afd260632836d 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -917,10 +917,10 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) pos += bytes; length -= bytes; written += bytes; - if (did_zero) - *did_zero = true; } while (length > 0); + if (did_zero) + *did_zero = true; return written; } From f8189d5d5fbf082786fb91c549f5127f23daec09 Mon Sep 17 00:00:00 2001 From: Kaixu Xia Date: Thu, 30 Jun 2022 10:04:18 -0700 Subject: [PATCH 3/3] dax: set did_zero to true when zeroing successfully It is unnecessary to check and set did_zero value in while() loop in dax_zero_iter(), we can set did_zero to true only when zeroing successfully at last. Signed-off-by: Kaixu Xia Reviewed-by: Chaitanya Kulkarni Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/dax.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 4155a6107fa10..649ff51c9a267 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1088,10 +1088,10 @@ static s64 dax_zero_iter(struct iomap_iter *iter, bool *did_zero) pos += size; length -= size; written += size; - if (did_zero) - *did_zero = true; } while (length > 0); + if (did_zero) + *did_zero = true; return written; }