Skip to content

Commit

Permalink
Merge branch 'bpf-sockmap-ulp'
Browse files Browse the repository at this point in the history
John Fastabend says:

====================
This series adds a BPF hook for sendmsg and senfile by using
the ULP infrastructure and sockmap. A simple pseudocode example
would be,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
                &obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above snippet any socket attached to the map would run
msg_prog on sendmsg and sendfile system calls.

Three additional helpers are added bpf_msg_apply_bytes(),
bpf_msg_cork_bytes(), and bpf_msg_pull_data(). With
bpf_msg_apply_bytes BPF programs can tell the infrastructure how
many bytes the given verdict should apply to. This has two cases.
First, a BPF program applies verdict to fewer bytes than in the
current sendmsg/sendfile msg this will apply the verdict to the
first N bytes of the message then run the BPF program again with
data pointers recalculated to the N+1 byte. The second case is the
BPF program applies a verdict to more bytes than the current sendmsg
or sendfile system call. In this case the infrastructure will cache
the verdict and apply it to future sendmsg/sendfile calls until the
byte limit is reached. This avoids the overhead of running BPF
programs on large payloads.

The helper bpf_msg_cork_bytes() handles a different case where
a BPF program can not reach a verdict on a msg until it receives
more bytes AND the program doesn't want to forward the packet
until it is known to be "good". The example case being a user
(albeit a dumb one probably) sends a N byte header in 1B system
calls. The BPF program can call bpf_msg_cork_bytes with the
required byte limit to reach a verdict and then the program will
only be called again once N bytes are received.

The last helper added in this series is bpf_msg_pull_data(). It
is used to pull data in for modification or reading. Similar to
how sk_pull_data() works msg_pull_data can be used to access data
not in the initial (data_start, data_end) range. For sendpage()
calls this is needed if any data is accessed because the BPF
sendpage hook initializes the data_start and data_end pointers to
zero. We do this because sendpage data is shared with the user
and can be modified during or after the BPF verdict possibly
invalidating any verdict the BPF program decides. For sendmsg
the data is already copied by the sendmsg bpf infrastructure so
we only copy the data if the user request a data range that is
not already linearized. This happens if the user requests larger
blocks of data that are not in a single scatterlist element. The
common case seems to be accessing headers which normally are
in the first scatterlist element and already linearized.

For more examples please review the sample program. There are
examples for all the actions and helpers there.

Patches 1-8 implement the above sockmap/BPF infrastructure. The
remaining patches flush out some minimal selftests and the sample
sockmap program. The sockmap sample program is the main vehicle
for testing this infrastructure and will be moved into selftests
shortly. The final patch in this series is a simple shell script
to run a set of tests. These are the tests I run after any changes
to sockmap. The next task on the list after this series is to
push those into selftests so we can avoid manually testing.

Couple notes on future items in the pipeline,

  0. move sample sockmap programs into selftests (noted above)
  1. add additional support for tcp flags, most are ignored now.
  2. add a Documentation/bpf/sockmap file with these details
  3. support stacked ULP types to allow this and ktls to cooperate
  4. Ingress flag support, redirect only supports egress here. The
     other redirect helpers support ingress and egress flags.
  5. add optimizations, I cut a few optimizations here in the
     first iteration of the code for later study/implementation

-v3 updates
  : u32 data pointers in msg_md changed to void *
  : page_address NULL check and flag verification in msg_pull_data
  : remove old note in commit msg that is no longer relevant
  : remove enum sk_msg_action its not used anywhere
  : fixup test_verifier W -> DW insn to account for data pointers
  : unintentionally dropped a smap_stop_tx() call in sockmap.c

I propagated the ACKs forward because above changes were small
one/two line changes.

-v2 updates (discussion):

Dave noticed that sendpage call was previously (in v1) running
on the data directly. This allowed users to potentially modify
the data after or during the BPF program. However doing a copy
automatically even if the data is not accessed has measurable
performance impact. So we added another helper modeled after
the existing skb_pull_data() helper to allow users to selectively
pull data from the msg. This is also useful in the sendmsg case
when users need to access data outside the first scatterlist
element or across scatterlist boundaries.

While doing this I also unified the sendmsg and sendfile handlers
a bit. Originally the sendfile call was optimized for never
touching the data. I've decided for a first submission to drop
this optimization and we can add it back later. It introduced
unnecessary complexity, at least for a first posting, for a
use case I have not entirely flushed out yet. When the use
case is deployed we can add it back if needed. Then we can
review concrete performance deltas as well on real-world
use-cases/applications.

Lastly, I reorganized the patches a bit. Now all sockmap
changes are in a single patch and each helper gets its own
patch. This, at least IMO, makes it easier to review because
sockmap changes are not spread across the patch series. On
the other hand now apply_bytes, cork_bytes logic is only
activated later in the series. But that should be OK.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  • Loading branch information
Daniel Borkmann committed Mar 19, 2018
2 parents 318df9f + ae30727 commit d48ce3e
Show file tree
Hide file tree
Showing 26 changed files with 2,236 additions and 131 deletions.
1 change: 1 addition & 0 deletions include/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ struct bpf_verifier_env;
struct perf_event;
struct bpf_prog;
struct bpf_map;
struct sock;

/* map is generic key/value storage optionally accesible by eBPF programs */
struct bpf_map_ops {
Expand Down
1 change: 1 addition & 0 deletions include/linux/bpf_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
#endif
#ifdef CONFIG_BPF_EVENTS
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
Expand Down
17 changes: 17 additions & 0 deletions include/linux/filter.h
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,22 @@ struct xdp_buff {
struct xdp_rxq_info *rxq;
};

struct sk_msg_buff {
void *data;
void *data_end;
__u32 apply_bytes;
__u32 cork_bytes;
int sg_copybreak;
int sg_start;
int sg_curr;
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
__u32 key;
__u32 flags;
struct bpf_map *map;
};

/* Compute the linear packet data range [data, data_end) which
* will be accessed by various program types (cls_bpf, act_bpf,
* lwt, ...). Subsystems allowing direct data access must (!)
Expand Down Expand Up @@ -771,6 +787,7 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp)
void bpf_warn_invalid_xdp_action(u32 act);

struct sock *do_sk_redirect_map(struct sk_buff *skb);
struct sock *do_msg_redirect_map(struct sk_msg_buff *md);

#ifdef CONFIG_BPF_JIT
extern int bpf_jit_enable;
Expand Down
1 change: 1 addition & 0 deletions include/linux/socket.h
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,7 @@ struct ucred {
#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
#define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
#define MSG_EOF MSG_FIN
#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */

#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
Expand Down
4 changes: 4 additions & 0 deletions include/net/sock.h
Original file line number Diff line number Diff line change
Expand Up @@ -2141,6 +2141,10 @@ static inline struct page_frag *sk_page_frag(struct sock *sk)

bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);

int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
int sg_start, int *sg_curr, unsigned int *sg_size,
int first_coalesce);

/*
* Default write policy as shown to user space via poll/select/SIGIO
*/
Expand Down
25 changes: 24 additions & 1 deletion include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
BPF_PROG_TYPE_SK_MSG,
};

enum bpf_attach_type {
Expand All @@ -143,6 +144,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
BPF_SK_MSG_VERDICT,
__MAX_BPF_ATTACH_TYPE
};

Expand Down Expand Up @@ -718,6 +720,15 @@ union bpf_attr {
* int bpf_override_return(pt_regs, rc)
* @pt_regs: pointer to struct pt_regs
* @rc: the return value to set
*
* int bpf_msg_redirect_map(map, key, flags)
* Redirect msg to a sock in map using key as a lookup key for the
* sock in map.
* @map: pointer to sockmap
* @key: key to lookup sock in map
* @flags: reserved for future use
* Return: SK_PASS
*
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
Expand Down Expand Up @@ -779,7 +790,11 @@ union bpf_attr {
FN(perf_prog_read_value), \
FN(getsockopt), \
FN(override_return), \
FN(sock_ops_cb_flags_set),
FN(sock_ops_cb_flags_set), \
FN(msg_redirect_map), \
FN(msg_apply_bytes), \
FN(msg_cork_bytes), \
FN(msg_pull_data),

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
Expand Down Expand Up @@ -942,6 +957,14 @@ enum sk_action {
SK_PASS,
};

/* user accessible metadata for SK_MSG packet hook, new fields must
* be added to the end of this structure
*/
struct sk_msg_md {
void *data;
void *data_end;
};

#define BPF_TAG_SIZE 8

struct bpf_prog_info {
Expand Down
Loading

0 comments on commit d48ce3e

Please sign in to comment.