Skip to content

Commit

Permalink
xfs: Improve scalability of busy extent tracking
Browse files Browse the repository at this point in the history
When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
  • Loading branch information
Dave Chinner authored and Alex Elder committed May 24, 2010
1 parent 955833c commit ed3b4d6
Show file tree
Hide file tree
Showing 11 changed files with 350 additions and 322 deletions.
9 changes: 9 additions & 0 deletions fs/xfs/linux-2.6/xfs_buf.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@

#include "xfs_sb.h"
#include "xfs_inum.h"
#include "xfs_log.h"
#include "xfs_ag.h"
#include "xfs_dmapi.h"
#include "xfs_mount.h"
Expand Down Expand Up @@ -850,13 +851,21 @@ xfs_buf_lock_value(
* Note that this in no way locks the underlying pages, so it is only
* useful for synchronizing concurrent use of buffer objects, not for
* synchronizing independent access to the underlying pages.
*
* If we come across a stale, pinned, locked buffer, we know that we
* are being asked to lock a buffer that has been reallocated. Because
* it is pinned, we know that the log has not been pushed to disk and
* hence it will still be locked. Rather than sleeping until someone
* else pushes the log, push it ourselves before trying to get the lock.
*/
void
xfs_buf_lock(
xfs_buf_t *bp)
{
trace_xfs_buf_lock(bp, _RET_IP_);

if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
xfs_log_force(bp->b_mount, 0);
if (atomic_read(&bp->b_io_remaining))
blk_run_address_space(bp->b_target->bt_mapping);
down(&bp->b_sema);
Expand Down
1 change: 1 addition & 0 deletions fs/xfs/linux-2.6/xfs_quotaops.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
#include "xfs_dmapi.h"
#include "xfs_sb.h"
#include "xfs_inum.h"
#include "xfs_log.h"
#include "xfs_ag.h"
#include "xfs_mount.h"
#include "xfs_quota.h"
Expand Down
83 changes: 56 additions & 27 deletions fs/xfs/linux-2.6/xfs_trace.h
Original file line number Diff line number Diff line change
Expand Up @@ -1059,83 +1059,112 @@ TRACE_EVENT(xfs_bunmap,

);

#define XFS_BUSY_SYNC \
{ 0, "async" }, \
{ 1, "sync" }

TRACE_EVENT(xfs_alloc_busy,
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
xfs_extlen_t len, int slot),
TP_ARGS(mp, agno, agbno, len, slot),
TP_PROTO(struct xfs_trans *trans, xfs_agnumber_t agno,
xfs_agblock_t agbno, xfs_extlen_t len, int sync),
TP_ARGS(trans, agno, agbno, len, sync),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(struct xfs_trans *, tp)
__field(int, tid)
__field(xfs_agnumber_t, agno)
__field(xfs_agblock_t, agbno)
__field(xfs_extlen_t, len)
__field(int, slot)
__field(int, sync)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->dev = trans->t_mountp->m_super->s_dev;
__entry->tp = trans;
__entry->tid = trans->t_ticket->t_tid;
__entry->agno = agno;
__entry->agbno = agbno;
__entry->len = len;
__entry->slot = slot;
__entry->sync = sync;
),
TP_printk("dev %d:%d agno %u agbno %u len %u slot %d",
TP_printk("dev %d:%d trans 0x%p tid 0x%x agno %u agbno %u len %u %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->tp,
__entry->tid,
__entry->agno,
__entry->agbno,
__entry->len,
__entry->slot)
__print_symbolic(__entry->sync, XFS_BUSY_SYNC))

);

#define XFS_BUSY_STATES \
{ 0, "found" }, \
{ 1, "missing" }

TRACE_EVENT(xfs_alloc_unbusy,
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
int slot, int found),
TP_ARGS(mp, agno, slot, found),
xfs_agblock_t agbno, xfs_extlen_t len),
TP_ARGS(mp, agno, agbno, len),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_agnumber_t, agno)
__field(int, slot)
__field(int, found)
__field(xfs_agblock_t, agbno)
__field(xfs_extlen_t, len)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->agno = agno;
__entry->slot = slot;
__entry->found = found;
__entry->agbno = agbno;
__entry->len = len;
),
TP_printk("dev %d:%d agno %u slot %d %s",
TP_printk("dev %d:%d agno %u agbno %u len %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->agno,
__entry->slot,
__print_symbolic(__entry->found, XFS_BUSY_STATES))
__entry->agbno,
__entry->len)
);

#define XFS_BUSY_STATES \
{ 0, "missing" }, \
{ 1, "found" }

TRACE_EVENT(xfs_alloc_busysearch,
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
xfs_extlen_t len, xfs_lsn_t lsn),
TP_ARGS(mp, agno, agbno, len, lsn),
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
xfs_agblock_t agbno, xfs_extlen_t len, int found),
TP_ARGS(mp, agno, agbno, len, found),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_agnumber_t, agno)
__field(xfs_agblock_t, agbno)
__field(xfs_extlen_t, len)
__field(xfs_lsn_t, lsn)
__field(int, found)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->agno = agno;
__entry->agbno = agbno;
__entry->len = len;
__entry->lsn = lsn;
__entry->found = found;
),
TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx",
TP_printk("dev %d:%d agno %u agbno %u len %u %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->agno,
__entry->agbno,
__entry->len,
__print_symbolic(__entry->found, XFS_BUSY_STATES))
);

TRACE_EVENT(xfs_trans_commit_lsn,
TP_PROTO(struct xfs_trans *trans),
TP_ARGS(trans),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(struct xfs_trans *, tp)
__field(xfs_lsn_t, lsn)
),
TP_fast_assign(
__entry->dev = trans->t_mountp->m_super->s_dev;
__entry->tp = trans;
__entry->lsn = trans->t_commit_lsn;
),
TP_printk("dev %d:%d trans 0x%p commit_lsn 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->tp,
__entry->lsn)
);

Expand Down
24 changes: 15 additions & 9 deletions fs/xfs/xfs_ag.h
Original file line number Diff line number Diff line change
Expand Up @@ -175,14 +175,20 @@ typedef struct xfs_agfl {
} xfs_agfl_t;

/*
* Busy block/extent entry. Used in perag to mark blocks that have been freed
* but whose transactions aren't committed to disk yet.
* Busy block/extent entry. Indexed by a rbtree in perag to mark blocks that
* have been freed but whose transactions aren't committed to disk yet.
*
* Note that we use the transaction ID to record the transaction, not the
* transaction structure itself. See xfs_alloc_busy_insert() for details.
*/
typedef struct xfs_perag_busy {
xfs_agblock_t busy_start;
xfs_extlen_t busy_length;
struct xfs_trans *busy_tp; /* transaction that did the free */
} xfs_perag_busy_t;
struct xfs_busy_extent {
struct rb_node rb_node; /* ag by-bno indexed search tree */
struct list_head list; /* transaction busy extent list */
xfs_agnumber_t agno;
xfs_agblock_t bno;
xfs_extlen_t length;
xlog_tid_t tid; /* transaction that created this */
};

/*
* Per-ag incore structure, copies of information in agf and agi,
Expand Down Expand Up @@ -216,7 +222,8 @@ typedef struct xfs_perag {
xfs_agino_t pagl_leftrec;
xfs_agino_t pagl_rightrec;
#ifdef __KERNEL__
spinlock_t pagb_lock; /* lock for pagb_list */
spinlock_t pagb_lock; /* lock for pagb_tree */
struct rb_root pagb_tree; /* ordered tree of busy extents */

atomic_t pagf_fstrms; /* # of filestreams active in this AG */

Expand All @@ -226,7 +233,6 @@ typedef struct xfs_perag {
int pag_ici_reclaimable; /* reclaimable inodes */
#endif
int pagb_count; /* pagb slots in use */
xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS]; /* unstable blocks */
} xfs_perag_t;

/*
Expand Down
Loading

0 comments on commit ed3b4d6

Please sign in to comment.