Skip to content

Commit

Permalink
Merge branch 'bpf-xdp-unaligned-chunk'
Browse files Browse the repository at this point in the history
Kevin Laatz says:

====================
This patch set adds the ability to use unaligned chunks in the XDP umem.

Currently, all chunk addresses passed to the umem are masked to be chunk
size aligned (max is PAGE_SIZE). This limits where we can place chunks
within the umem as well as limiting the packet sizes that are supported.

The changes in this patch set removes these restrictions, allowing XDP to
be more flexible in where it can place a chunk within a umem. By relaxing
where the chunks can be placed, it allows us to use an arbitrary buffer
size and place that wherever we have a free address in the umem. These
changes add the ability to support arbitrary frame sizes up to 4k
(PAGE_SIZE) and make it easy to integrate with other existing frameworks
that have their own memory management systems, such as DPDK.
In DPDK, for example, there is already support for AF_XDP with zero-copy.
However, with this patch set the integration will be much more seamless.
You can find the DPDK AF_XDP driver at:
https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp

Since we are now dealing with arbitrary frame sizes, we need also need to
update how we pass around addresses. Currently, the addresses can simply be
masked to 2k to get back to the original address. This becomes less trivial
when using frame sizes that are not a 'power of 2' size. This patch set
modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr
field for an offset value, leaving the lower 48-bits for the address (this
leaves us with 256 Terabytes, which should be enough!). We only need to use
the upper 16-bits to store the offset when running in unaligned mode.
Rather than adding the offset (headroom etc) to the address, we will store
it in the upper 16-bits of the address field. This way, we can easily add
the offset to the address where we need it, using some bit manipulation and
addition, and we can also easily get the original address wherever we need
it (for example in i40e_zca_free) by simply masking to get the lower
48-bits of the address field.

The patch set was tested with the following set up:
  - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
  - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
  - Driver: i40e
  - Application: xdpsock with l2fwd (single interface)
  - Turbo disabled in BIOS

There are no changes to performance before and after these patches for SKB
mode and Copy mode. Zero-copy mode saw a performance degradation of ~1.5%.

This patch set has been applied against
commit 0bb52b0 ("tools: bpftool: add 'bpftool map freeze' subcommand")

Structure of the patch set:

Patch 1:
  - Remove unnecessary masking and headroom addition during zero-copy Rx
    buffer recycling in i40e. This change is required in order for the
    buffer recycling to work in the unaligned chunk mode.

Patch 2:
  - Remove unnecessary masking and headroom addition during
    zero-copy Rx buffer recycling in ixgbe. This change is required in
    order for the  buffer recycling to work in the unaligned chunk mode.

Patch 3:
  - Add infrastructure for unaligned chunks. Since we are dealing with
    unaligned chunks that could potentially cross a physical page boundary,
    we add checks to keep track of that information. We can later use this
    information to correctly handle buffers that are placed at an address
    where they cross a page boundary.  This patch also modifies the
    existing Rx and Tx functions to use the new descriptor format. To
    handle addresses correctly, we need to mask appropriately based on
    whether we are in aligned or unaligned mode.

Patch 4:
  - This patch updates the i40e driver to make use of the new descriptor
    format.

Patch 5:
  - This patch updates the ixgbe driver to make use of the new descriptor
    format.

Patch 6:
  - This patch updates the mlx5e driver to make use of the new descriptor
    format. These changes are required to handle the new descriptor format
    and for unaligned chunks support.

Patch 7:
  - This patch allows XSK frames smaller than page size in the mlx5e
    driver. Relax the requirements to the XSK frame size to allow it to be
    smaller than a page and even not a power of two. The current
    implementation can work in this mode, both with Striding RQ and without
    it.

Patch 8:
  - Add flags for umem configuration to libbpf. Since we increase the size
    of the struct by adding flags, we also need to add the ABI versioning
    in this patch.

Patch 9:
  - Modify xdpsock application to add a command line option for
    unaligned chunks

Patch 10:
  - Since we can now run the application in unaligned chunk mode, we need
    to make sure we recycle the buffers appropriately.

Patch 11:
  - Adds hugepage support to the xdpsock application

Patch 12:
  - Documentation update to include the unaligned chunk scenario. We need
    to explicitly state that the incoming addresses are only masked in the
    aligned chunk mode and not the unaligned chunk mode.

v2:
  - fixed checkpatch issues
  - fixed Rx buffer recycling for unaligned chunks in xdpsock
  - removed unused defines
  - fixed how chunk_size is calculated in xsk_diag.c
  - added some performance numbers to cover letter
  - modified descriptor format to make it easier to retrieve original
    address
  - removed patch adding off_t off to the zero copy allocator. This is no
    longer needed with the new descriptor format.

v3:
  - added patch for mlx5 driver changes needed for unaligned chunks
  - moved offset handling to new helper function
  - changed value used for the umem chunk_mask. Now using the new
    descriptor format to save us doing the calculations in a number of
    places meaning more of the code is left unchanged while adding
    unaligned chunk support.

v4:
  - reworked the next_pg_contig field in the xdp_umem_page struct. We now
    use the low 12 bits of the addr for flags rather than adding an extra
    field in the struct.
  - modified unaligned chunks flag define
  - fixed page_start calculation in __xsk_rcv_memcpy().
  - move offset handling to the xdp_umem_get_* functions
  - modified the len field in xdp_umem_reg struct. We now use 16 bits from
    this for the flags field.
  - fixed headroom addition to handle in the mlx5e driver
  - other minor changes based on review comments

v5:
  - Added ABI versioning in the libbpf patch
  - Removed bitfields in the xdp_umem_reg struct. Adding new flags field.
  - Added accessors for getting addr and offset.
  - Added helper function for adding the offset to the addr.
  - Fixed conflicts with 'bpf-af-xdp-wakeup' which was merged recently.
  - Fixed typo in mlx driver patch.
  - Moved libbpf patch to later in the set (7/11, just before the sample
    app changes)

v6:
  - Added support for XSK frames smaller than page in mlx5e driver (Maxim
    Mikityanskiy <maximmi@mellanox.com).
  - Fixed offset handling in xsk_generic_rcv.
  - Added check for base address in xskq_is_valid_addr_unaligned.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  • Loading branch information
Daniel Borkmann committed Aug 30, 2019
2 parents 1c6d6e0 + d57f172 commit bdb15a2
Show file tree
Hide file tree
Showing 20 changed files with 417 additions and 103 deletions.
10 changes: 6 additions & 4 deletions Documentation/networking/af_xdp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -153,10 +153,12 @@ an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has

Frames passed to the kernel are used for the ingress path (RX rings).

The user application produces UMEM addrs to this ring. Note that the
kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
and 3000 refers to the same chunk.
The user application produces UMEM addrs to this ring. Note that, if
running the application with aligned chunk mode, the kernel will mask
the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of
the addr will be masked off, meaning that 2048, 2050 and 3000 refers
to the same chunk. If the user application is run in the unaligned
chunks mode, then the incoming addr will be left untouched.


UMEM Completion Ring
Expand Down
26 changes: 12 additions & 14 deletions drivers/net/ethernet/intel/i40e/i40e_xsk.c
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,9 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
**/
static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
{
struct xdp_umem *umem = rx_ring->xsk_umem;
int err, result = I40E_XDP_PASS;
u64 offset = umem->headroom;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
u32 act;
Expand All @@ -201,7 +203,10 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
*/
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, xdp);
xdp->handle += xdp->data - xdp->data_hard_start;
offset += xdp->data - xdp->data_hard_start;

xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);

switch (act) {
case XDP_PASS:
break;
Expand Down Expand Up @@ -262,7 +267,7 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;

bi->handle = handle + umem->headroom;
bi->handle = handle;

xsk_umem_discard_addr(umem);
return true;
Expand Down Expand Up @@ -299,7 +304,7 @@ static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;

bi->handle = handle + umem->headroom;
bi->handle = handle;

xsk_umem_discard_addr_rq(umem);
return true;
Expand Down Expand Up @@ -420,23 +425,16 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
struct i40e_rx_buffer *old_bi)
{
struct i40e_rx_buffer *new_bi = &rx_ring->rx_bi[rx_ring->next_to_alloc];
unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
u16 nta = rx_ring->next_to_alloc;

/* update, and store next to alloc */
nta++;
rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;

/* transfer page from old buffer to new buffer */
new_bi->dma = old_bi->dma & mask;
new_bi->dma += hr;

new_bi->addr = (void *)((unsigned long)old_bi->addr & mask);
new_bi->addr += hr;

new_bi->handle = old_bi->handle & mask;
new_bi->handle += rx_ring->xsk_umem->headroom;
new_bi->dma = old_bi->dma;
new_bi->addr = old_bi->addr;
new_bi->handle = old_bi->handle;

old_bi->addr = NULL;
}
Expand Down Expand Up @@ -471,7 +469,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
bi->addr += hr;

bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
bi->handle = (u64)handle;
}

/**
Expand Down
26 changes: 12 additions & 14 deletions drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
Original file line number Diff line number Diff line change
Expand Up @@ -143,15 +143,20 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
struct ixgbe_ring *rx_ring,
struct xdp_buff *xdp)
{
struct xdp_umem *umem = rx_ring->xsk_umem;
int err, result = IXGBE_XDP_PASS;
u64 offset = umem->headroom;
struct bpf_prog *xdp_prog;
struct xdp_frame *xdpf;
u32 act;

rcu_read_lock();
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, xdp);
xdp->handle += xdp->data - xdp->data_hard_start;
offset += xdp->data - xdp->data_hard_start;

xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);

switch (act) {
case XDP_PASS:
break;
Expand Down Expand Up @@ -201,8 +206,6 @@ ixgbe_rx_buffer *ixgbe_get_rx_buffer_zc(struct ixgbe_ring *rx_ring,
static void ixgbe_reuse_rx_buffer_zc(struct ixgbe_ring *rx_ring,
struct ixgbe_rx_buffer *obi)
{
unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
u16 nta = rx_ring->next_to_alloc;
struct ixgbe_rx_buffer *nbi;

Expand All @@ -212,14 +215,9 @@ static void ixgbe_reuse_rx_buffer_zc(struct ixgbe_ring *rx_ring,
rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;

/* transfer page from old buffer to new buffer */
nbi->dma = obi->dma & mask;
nbi->dma += hr;

nbi->addr = (void *)((unsigned long)obi->addr & mask);
nbi->addr += hr;

nbi->handle = obi->handle & mask;
nbi->handle += rx_ring->xsk_umem->headroom;
nbi->dma = obi->dma;
nbi->addr = obi->addr;
nbi->handle = obi->handle;

obi->addr = NULL;
obi->skb = NULL;
Expand Down Expand Up @@ -250,7 +248,7 @@ void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
bi->addr += hr;

bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
bi->handle = (u64)handle;
}

static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
Expand All @@ -276,7 +274,7 @@ static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;

bi->handle = handle + umem->headroom;
bi->handle = handle;

xsk_umem_discard_addr(umem);
return true;
Expand All @@ -303,7 +301,7 @@ static bool ixgbe_alloc_buffer_slow_zc(struct ixgbe_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;

bi->handle = handle + umem->headroom;
bi->handle = handle;

xsk_umem_discard_addr_rq(umem);
return true;
Expand Down
23 changes: 19 additions & 4 deletions drivers/net/ethernet/mellanox/mlx5/core/en/params.c
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,33 @@ u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
return headroom;
}

u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
u32 mlx5e_rx_get_min_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{
u32 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
u16 linear_rq_headroom = mlx5e_get_linear_rq_headroom(params, xsk);
u32 frag_sz = linear_rq_headroom + hw_mtu;

return linear_rq_headroom + hw_mtu;
}

u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{
u32 frag_sz = mlx5e_rx_get_min_frag_sz(params, xsk);

/* AF_XDP doesn't build SKBs in place. */
if (!xsk)
frag_sz = MLX5_SKB_FRAG_SZ(frag_sz);

/* XDP in mlx5e doesn't support multiple packets per page. */
/* XDP in mlx5e doesn't support multiple packets per page. AF_XDP is a
* special case. It can run with frames smaller than a page, as it
* doesn't allocate pages dynamically. However, here we pretend that
* fragments are page-sized: it allows to treat XSK frames like pages
* by redirecting alloc and free operations to XSK rings and by using
* the fact there are no multiple packets per "page" (which is a frame).
* The latter is important, because frames may come in a random order,
* and we will have trouble assemblying a real page of multiple frames.
*/
if (mlx5e_rx_is_xdp(params, xsk))
frag_sz = max_t(u32, frag_sz, PAGE_SIZE);

Expand Down
2 changes: 2 additions & 0 deletions drivers/net/ethernet/mellanox/mlx5/core/en/params.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ static inline bool mlx5e_qid_validate(const struct mlx5e_profile *profile,

u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u32 mlx5e_rx_get_min_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params,
Expand Down
8 changes: 6 additions & 2 deletions drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
void *va, u16 *rx_headroom, u32 *len, bool xsk)
{
struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
struct xdp_umem *umem = rq->umem;
struct xdp_buff xdp;
u32 act;
int err;
Expand All @@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
xdp.rxq = &rq->xdp_rxq;

act = bpf_prog_run_xdp(prog, &xdp);
if (xsk)
xdp.handle += xdp.data - xdp.data_hard_start;
if (xsk) {
u64 off = xdp.data - xdp.data_hard_start;

xdp.handle = xsk_umem_adjust_offset(umem, xdp.handle, off);
}
switch (act) {
case XDP_PASS:
*rx_headroom = xdp.data - xdp.data_hard_start;
Expand Down
5 changes: 3 additions & 2 deletions drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
if (!xsk_umem_peek_addr_rq(umem, &handle))
return -ENOMEM;

dma_info->xsk.handle = handle + rq->buff.umem_headroom;
dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
rq->buff.umem_headroom);
dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);

/* No need to add headroom to the DMA address. In striding RQ case, we
Expand Down Expand Up @@ -104,7 +105,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,

/* head_offset is not used in this function, because di->xsk.data and
* di->addr point directly to the necessary place. Furthermore, in the
* current implementation, one page = one packet = one frame, so
* current implementation, UMR pages are mapped to XSK frames, so
* head_offset should always be 0.
*/
WARN_ON_ONCE(head_offset);
Expand Down
15 changes: 10 additions & 5 deletions drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,23 @@
#include "setup.h"
#include "en/params.h"

/* It matches XDP_UMEM_MIN_CHUNK_SIZE, but as this constant is private and may
* change unexpectedly, and mlx5e has a minimum valid stride size for striding
* RQ, keep this check in the driver.
*/
#define MLX5E_MIN_XSK_CHUNK_SIZE 2048

bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5_core_dev *mdev)
{
/* AF_XDP doesn't support frames larger than PAGE_SIZE, and the current
* mlx5e XDP implementation doesn't support multiple packets per page.
*/
if (xsk->chunk_size != PAGE_SIZE)
/* AF_XDP doesn't support frames larger than PAGE_SIZE. */
if (xsk->chunk_size > PAGE_SIZE ||
xsk->chunk_size < MLX5E_MIN_XSK_CHUNK_SIZE)
return false;

/* Current MTU and XSK headroom don't allow packets to fit the frames. */
if (mlx5e_rx_get_linear_frag_sz(params, xsk) > xsk->chunk_size)
if (mlx5e_rx_get_min_frag_sz(params, xsk) > xsk->chunk_size)
return false;

/* frag_sz is different for regular and XSK RQs, so ensure that linear
Expand Down
Loading

0 comments on commit bdb15a2

Please sign in to comment.