Skip to content

Commit

Permalink
Merge branch 'bpf-af-xdp-fixes'
Browse files Browse the repository at this point in the history
Björn Töpel says:

====================
An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
https://www.spinics.net/lists/netdev/msg503664.html) is that it does
not support NICs that have a "type-writer" model in an efficient
way. In this model, a memory window is passed to the hardware and
multiple frames might be filled into that window, instead of just one
that we have in the current fixed frame-size model.

This patch set fixes two bugs in the current implementation and then
changes the uapi so that the type-writer model can be supported
efficiently by a possible future extension of AF_XDP.

These are the uapi changes in this patch:

* Change the "u32 idx" in the descriptors to "u64 addr". The current
  idx based format does NOT work for the type-writer model (as packets
  can start anywhere within a frame) but that a relative address
  pointer (the u64 addr) works well for both models in the prototype
  code we have that supports both models. We increased it from u32 to
  u64 to support umems larger than 4G. We have also removed the u16
  offset when having a "u64 addr" since that information is already
  carried in the least significant bits of the address.

* We want to use "u8 padding[5]" for something useful in the future
  (since we are not allowed to change its name), so we now call it
  just options so it can be extended for various purposes in the
  future. It is an u32 as that it what is left of the 16 byte
  descriptor.

* We changed the name of frame_size in the UMEM_REG setsockopt to
  chunk_size since this naming also makes sense to the type-writer
  model.

With these changes to the uapi, we believe the type-writer model can
be supported without having to resort to a new descriptor format. The
type-writer model could then be supported, from the uapi point of
view, by setting a flag at bind time and providing a new flag bit in
the options field of the descriptor that signals to user space that
all packets have been written in a chunk. Or with a new chunk
completion queue as suggested by Mykyta in his latest feedback mail on
the list.

We based this patch set on bpf-next commit bd3a08a ("bpf:
flowlabel in bpf_fib_lookup should be flowinfo")

The structure of the patch set is as follows:

Patches 1-2: Fixes two bugs in the current implementation.
Patches 3-4: Prepares the uapi for a "type-writer" model and modifies
             the sample application so that it works with the new
	     uapi.
Patch 5: Small performance improvement patch for the sample application.

Cheers: Magnus and Björn
====================

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  • Loading branch information
Daniel Borkmann committed Jun 4, 2018
2 parents bd3a08a + a65ea68 commit 6499536
Show file tree
Hide file tree
Showing 9 changed files with 172 additions and 203 deletions.
101 changes: 58 additions & 43 deletions Documentation/networking/af_xdp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ packet processing.

This document assumes that the reader is familiar with BPF and XDP. If
not, the Cilium project has an excellent reference guide at
http://cilium.readthedocs.io/en/doc-1.0/bpf/.
http://cilium.readthedocs.io/en/latest/bpf/.

Using the XDP_REDIRECT action from an XDP program, the program can
redirect ingress frames to other XDP enabled netdevs, using the
Expand All @@ -33,22 +33,22 @@ for a while due to a possible retransmit, the descriptor that points
to that packet can be changed to point to another and reused right
away. This again avoids copying data.

The UMEM consists of a number of equally size frames and each frame
has a unique frame id. A descriptor in one of the rings references a
frame by referencing its frame id. The user space allocates memory for
this UMEM using whatever means it feels is most appropriate (malloc,
mmap, huge pages, etc). This memory area is then registered with the
kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two
rings: the FILL ring and the COMPLETION ring. The fill ring is used by
the application to send down frame ids for the kernel to fill in with
RX packet data. References to these frames will then appear in the RX
ring once each packet has been received. The completion ring, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion ring are ids that
were previously transmitted using the TX ring. In summary, the RX and
FILL rings are used for the RX path and the TX and COMPLETION rings
are used for the TX path.
The UMEM consists of a number of equally sized chunks. A descriptor in
one of the rings references a frame by referencing its addr. The addr
is simply an offset within the entire UMEM region. The user space
allocates memory for this UMEM using whatever means it feels is most
appropriate (malloc, mmap, huge pages, etc). This memory area is then
registered with the kernel using the new setsockopt XDP_UMEM_REG. The
UMEM also has two rings: the FILL ring and the COMPLETION ring. The
fill ring is used by the application to send down addr for the kernel
to fill in with RX packet data. References to these frames will then
appear in the RX ring once each packet has been received. The
completion ring, on the other hand, contains frame addr that the
kernel has transmitted completely and can now be used again by user
space, for either TX or RX. Thus, the frame addrs appearing in the
completion ring are addrs that were previously transmitted using the
TX ring. In summary, the RX and FILL rings are used for the RX path
and the TX and COMPLETION rings are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
Expand All @@ -59,13 +59,13 @@ wants to do this, it simply skips the registration of the UMEM and its
corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
call and submits the XSK of the process it would like to share UMEM
with as well as its own newly created XSK socket. The new process will
then receive frame id references in its own RX ring that point to this
shared UMEM. Note that since the ring structures are single-consumer /
single-producer (for performance reasons), the new process has to
create its own socket with associated RX and TX rings, since it cannot
share this with the other process. This is also the reason that there
is only one set of FILL and COMPLETION rings per UMEM. It is the
responsibility of a single process to handle the UMEM.
then receive frame addr references in its own RX ring that point to
this shared UMEM. Note that since the ring structures are
single-consumer / single-producer (for performance reasons), the new
process has to create its own socket with associated RX and TX rings,
since it cannot share this with the other process. This is also the
reason that there is only one set of FILL and COMPLETION rings per
UMEM. It is the responsibility of a single process to handle the UMEM.

How is then packets distributed from an XDP program to the XSKs? There
is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
Expand Down Expand Up @@ -102,10 +102,10 @@ UMEM

UMEM is a region of virtual contiguous memory, divided into
equal-sized frames. An UMEM is associated to a netdev and a specific
queue id of that netdev. It is created and configured (frame size,
frame headroom, start address and size) by using the XDP_UMEM_REG
setsockopt system call. A UMEM is bound to a netdev and queue id, via
the bind() system call.
queue id of that netdev. It is created and configured (chunk size,
headroom, start address and size) by using the XDP_UMEM_REG setsockopt
system call. A UMEM is bound to a netdev and queue id, via the bind()
system call.

An AF_XDP is socket linked to a single UMEM, but one UMEM can have
multiple AF_XDP sockets. To share an UMEM created via one socket A,
Expand Down Expand Up @@ -147,13 +147,17 @@ UMEM Fill Ring
~~~~~~~~~~~~~~

The Fill ring is used to transfer ownership of UMEM frames from
user-space to kernel-space. The UMEM indicies are passed in the
ring. As an example, if the UMEM is 64k and each frame is 4k, then the
UMEM has 16 frames and can pass indicies between 0 and 15.
user-space to kernel-space. The UMEM addrs are passed in the ring. As
an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
16 chunks and can pass addrs between 0 and 64k.

Frames passed to the kernel are used for the ingress path (RX rings).

The user application produces UMEM indicies to this ring.
The user application produces UMEM addrs to this ring. Note that the
kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
and 3000 refers to the same chunk.


UMEM Completetion Ring
~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -165,16 +169,15 @@ used.
Frames passed from the kernel to user-space are frames that has been
sent (TX ring) and can be used by user-space again.

The user application consumes UMEM indicies from this ring.
The user application consumes UMEM addrs from this ring.


RX Ring
~~~~~~~

The RX ring is the receiving side of a socket. Each entry in the ring
is a struct xdp_desc descriptor. The descriptor contains UMEM index
(idx), the length of the data (len), the offset into the frame
(offset).
is a struct xdp_desc descriptor. The descriptor contains UMEM offset
(addr) and the length of the data (len).

If no frames have been passed to kernel via the Fill ring, no
descriptors will (or can) appear on the RX ring.
Expand Down Expand Up @@ -221,38 +224,50 @@ side is xdpsock_user.c and the XDP side xdpsock_kern.c.

Naive ring dequeue and enqueue could look like this::

// struct xdp_rxtx_ring {
// __u32 *producer;
// __u32 *consumer;
// struct xdp_desc *desc;
// };

// struct xdp_umem_ring {
// __u32 *producer;
// __u32 *consumer;
// __u64 *desc;
// };

// typedef struct xdp_rxtx_ring RING;
// typedef struct xdp_umem_ring RING;

// typedef struct xdp_desc RING_TYPE;
// typedef __u32 RING_TYPE;
// typedef __u64 RING_TYPE;

int dequeue_one(RING *ring, RING_TYPE *item)
{
__u32 entries = ring->ptrs.producer - ring->ptrs.consumer;
__u32 entries = *ring->producer - *ring->consumer;

if (entries == 0)
return -1;

// read-barrier!

*item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)];
ring->ptrs.consumer++;
*item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
(*ring->consumer)++;
return 0;
}

int enqueue_one(RING *ring, const RING_TYPE *item)
{
u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer);
u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);

if (free_entries == 0)
return -1;

ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item;
ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;

// write-barrier!

ring->ptrs.producer++;
(*ring->producer)++;
return 0;
}

Expand Down
12 changes: 5 additions & 7 deletions include/uapi/linux/if_xdp.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ struct xdp_mmap_offsets {
struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
__u64 len; /* Length of packet data area */
__u32 frame_size; /* Frame size */
__u32 frame_headroom; /* Frame head room */
__u32 chunk_size;
__u32 headroom;
};

struct xdp_statistics {
Expand All @@ -66,13 +66,11 @@ struct xdp_statistics {

/* Rx/Tx descriptor */
struct xdp_desc {
__u32 idx;
__u64 addr;
__u32 len;
__u16 offset;
__u8 flags;
__u8 padding[5];
__u32 options;
};

/* UMEM descriptor is __u32 */
/* UMEM descriptor is __u64 */

#endif /* _LINUX_IF_XDP_H */
33 changes: 15 additions & 18 deletions net/xdp/xdp_umem.c
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

#include "xdp_umem.h"

#define XDP_UMEM_MIN_FRAME_SIZE 2048
#define XDP_UMEM_MIN_CHUNK_SIZE 2048

static void xdp_umem_unpin_pages(struct xdp_umem *umem)
{
Expand Down Expand Up @@ -151,12 +151,12 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)

static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
{
u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
unsigned int chunks, chunks_per_page;
u64 addr = mr->addr, size = mr->len;
unsigned int nframes, nfpp;
int size_chk, err;

if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
/* Strictly speaking we could support this, if:
* - huge pages, or*
* - using an IOMMU, or
Expand All @@ -166,7 +166,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
return -EINVAL;
}

if (!is_power_of_2(frame_size))
if (!is_power_of_2(chunk_size))
return -EINVAL;

if (!PAGE_ALIGNED(addr)) {
Expand All @@ -179,33 +179,30 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
if ((addr + size) < addr)
return -EINVAL;

nframes = (unsigned int)div_u64(size, frame_size);
if (nframes == 0 || nframes > UINT_MAX)
chunks = (unsigned int)div_u64(size, chunk_size);
if (chunks == 0)
return -EINVAL;

nfpp = PAGE_SIZE / frame_size;
if (nframes < nfpp || nframes % nfpp)
chunks_per_page = PAGE_SIZE / chunk_size;
if (chunks < chunks_per_page || chunks % chunks_per_page)
return -EINVAL;

frame_headroom = ALIGN(frame_headroom, 64);
headroom = ALIGN(headroom, 64);

size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
size_chk = chunk_size - headroom - XDP_PACKET_HEADROOM;
if (size_chk < 0)
return -EINVAL;

umem->pid = get_task_pid(current, PIDTYPE_PID);
umem->size = (size_t)size;
umem->address = (unsigned long)addr;
umem->props.frame_size = frame_size;
umem->props.nframes = nframes;
umem->frame_headroom = frame_headroom;
umem->props.chunk_mask = ~((u64)chunk_size - 1);
umem->props.size = size;
umem->headroom = headroom;
umem->chunk_size_nohr = chunk_size - headroom;
umem->npgs = size / PAGE_SIZE;
umem->pgs = NULL;
umem->user = NULL;

umem->frame_size_log2 = ilog2(frame_size);
umem->nfpp_mask = nfpp - 1;
umem->nfpplog2 = ilog2(nfpp);
refcount_set(&umem->users, 1);

err = xdp_umem_account_pages(umem);
Expand Down
27 changes: 6 additions & 21 deletions net/xdp/xdp_umem.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,35 +18,20 @@ struct xdp_umem {
struct xsk_queue *cq;
struct page **pgs;
struct xdp_umem_props props;
u32 npgs;
u32 frame_headroom;
u32 nfpp_mask;
u32 nfpplog2;
u32 frame_size_log2;
u32 headroom;
u32 chunk_size_nohr;
struct user_struct *user;
struct pid *pid;
unsigned long address;
size_t size;
refcount_t users;
struct work_struct work;
u32 npgs;
};

static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
{
u64 pg, off;
char *data;

pg = idx >> umem->nfpplog2;
off = (idx & umem->nfpp_mask) << umem->frame_size_log2;

data = page_address(umem->pgs[pg]);
return data + off;
}

static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
u32 idx)
static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
{
return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
return page_address(umem->pgs[addr >> PAGE_SHIFT]) +
(addr & (PAGE_SIZE - 1));
}

bool xdp_umem_validate_queues(struct xdp_umem *umem);
Expand Down
4 changes: 2 additions & 2 deletions net/xdp/xdp_umem_props.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
#define XDP_UMEM_PROPS_H_

struct xdp_umem_props {
u32 frame_size;
u32 nframes;
u64 chunk_mask;
u64 size;
};

#endif /* XDP_UMEM_PROPS_H_ */
Loading

0 comments on commit 6499536

Please sign in to comment.