Skip to content

Commit

Permalink
xfs: don't take a spinlock unconditionally in the DIO fastpath
Browse files Browse the repository at this point in the history
Because this happens at high thread counts on high IOPS devices
doing mixed read/write AIO-DIO to a single file at about a million
iops:

   64.09%     0.21%  [kernel]            [k] io_submit_one
   - 63.87% io_submit_one
      - 44.33% aio_write
         - 42.70% xfs_file_write_iter
            - 41.32% xfs_file_dio_write_aligned
               - 25.51% xfs_file_write_checks
                  - 21.60% _raw_spin_lock
                     - 21.59% do_raw_spin_lock
                        - 19.70% __pv_queued_spin_lock_slowpath

This also happens of the IO completion IO path:

   22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
   - 22.49% xfs_dio_write_end_io
      - 21.79% _raw_spin_lock
         - 20.97% do_raw_spin_lock
            - 20.10% __pv_queued_spin_lock_slowpath

IOWs, fio is burning ~14 whole CPUs on this spin lock.

So, do an unlocked check against inode size first, then if we are
at/beyond EOF, take the spinlock and recheck. This makes the
spinlock disappear from the overwrite fastpath.

I'd like to report that fixing this makes things go faster. It
doesn't - it just exposes the the XFS_ILOCK as the next severe
contention point doing extent mapping lookups, and that now burns
all the 14 CPUs this spinlock was burning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
  • Loading branch information
Dave Chinner authored and Darrick J. Wong committed Jun 2, 2021
1 parent 5a981e4 commit 977ec4d
Showing 1 changed file with 31 additions and 11 deletions.
42 changes: 31 additions & 11 deletions fs/xfs/xfs_file.c
Original file line number Diff line number Diff line change
Expand Up @@ -384,21 +384,30 @@ xfs_file_write_checks(
}
goto restart;
}

/*
* If the offset is beyond the size of the file, we need to zero any
* blocks that fall between the existing EOF and the start of this
* write. If zeroing is needed and we are currently holding the
* iolock shared, we need to update it to exclusive which implies
* having to redo all checks before.
* write. If zeroing is needed and we are currently holding the iolock
* shared, we need to update it to exclusive which implies having to
* redo all checks before.
*
* We need to serialise against EOF updates that occur in IO completions
* here. We want to make sure that nobody is changing the size while we
* do this check until we have placed an IO barrier (i.e. hold the
* XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
* spinlock effectively forms a memory barrier once we have the
* XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
* hence be able to correctly determine if we need to run zeroing.
*
* We need to serialise against EOF updates that occur in IO
* completions here. We want to make sure that nobody is changing the
* size while we do this check until we have placed an IO barrier (i.e.
* hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
* The spinlock effectively forms a memory barrier once we have the
* XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
* and hence be able to correctly determine if we need to run zeroing.
* We can do an unlocked check here safely as IO completion can only
* extend EOF. Truncate is locked out at this point, so the EOF can
* not move backwards, only forwards. Hence we only need to take the
* slow path and spin locks when we are at or beyond the current EOF.
*/
if (iocb->ki_pos <= i_size_read(inode))
goto out;

spin_lock(&ip->i_flags_lock);
isize = i_size_read(inode);
if (iocb->ki_pos > isize) {
Expand Down Expand Up @@ -426,7 +435,7 @@ xfs_file_write_checks(
drained_dio = true;
goto restart;
}

trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
NULL, &xfs_buffered_write_iomap_ops);
Expand All @@ -435,6 +444,7 @@ xfs_file_write_checks(
} else
spin_unlock(&ip->i_flags_lock);

out:
return file_modified(file);
}

Expand Down Expand Up @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
* other IO completions here to update the EOF. Failing to serialise
* here can result in EOF moving backwards and Bad Things Happen when
* that occurs.
*
* As IO completion only ever extends EOF, we can do an unlocked check
* here to avoid taking the spinlock. If we land within the current EOF,
* then we do not need to do an extending update at all, and we don't
* need to take the lock to check this. If we race with an update moving
* EOF, then we'll either still be beyond EOF and need to take the lock,
* or we'll be within EOF and we don't need to take it at all.
*/
if (offset + size <= i_size_read(inode))
goto out;

spin_lock(&ip->i_flags_lock);
if (offset + size > i_size_read(inode)) {
i_size_write(inode, offset + size);
Expand Down

0 comments on commit 977ec4d

Please sign in to comment.