Skip to content

Commit

Permalink
Merge branch 'bpf-direct-pkt-access'
Browse files Browse the repository at this point in the history
Alexei Starovoitov says:

====================
bpf: introduce direct packet access

This set of patches introduce 'direct packet access' from
cls_bpf and act_bpf programs (which are root only).

Current bpf programs use LD_ABS, LD_INS instructions which have
to do 'if (off < skb_headlen)' for every packet access.
It's ok for socket filters, but too slow for XDP, since single
LD_ABS insn consumes 3% of cpu. Therefore we have to amortize the cost
of length check over multiple packet accesses via direct access
to skb->data, data_end pointers.

The existing packet parser typically look like:
  if (load_half(skb, offsetof(struct ethhdr, h_proto)) != ETH_P_IP)
     return 0;
  if (load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)) != IPPROTO_UDP ||
      load_byte(skb, ETH_HLEN) != 0x45)
     return 0;
  ...
with 'direct packet access' the bpf program becomes:
   void *data = (void *)(long)skb->data;
   void *data_end = (void *)(long)skb->data_end;
   struct eth_hdr *eth = data;
   struct iphdr *iph = data + sizeof(*eth);

   if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
      return 0;
   if (eth->h_proto != htons(ETH_P_IP))
      return 0;
   if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
      return 0;
   ...
which is more natural to write and significantly faster.
See patch 6 for performance tests:
21Mpps(old) vs 24Mpps(new) with just 5 loads.
For more complex parsers the performance gain is higher.

The other approach implemented in [1] was adding two new instructions
to interpreter and JITs and was too hard to use from llvm side.
The approach presented here doesn't need any instruction changes,
but the verifier has to work harder to check safety of the packet access.

Patch 1 prepares the code and Patch 2 adds new checks for direct
packet access and all of them are gated with 'env->allow_ptr_leaks'
which is true for root only.
Patch 3 improves search pruning for large programs.
Patch 4 wires in verifier's changes with net/core/filter side.
Patch 5 updates docs
Patches 6 and 7 add tests.

[1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
David S. Miller committed May 6, 2016
2 parents 95aef7c + 883e44e commit 4b307a8
Show file tree
Hide file tree
Showing 14 changed files with 1,004 additions and 82 deletions.
85 changes: 83 additions & 2 deletions Documentation/networking/filter.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1095,6 +1095,87 @@ all use cases.

See details of eBPF verifier in kernel/bpf/verifier.c

Direct packet access
--------------------
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
data via skb->data and skb->data_end pointers.
Ex:
1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
2: r3 = *(u32 *)(r1 +76) /* load skb->data */
3: r5 = r3
4: r5 += 14
5: if r5 > r4 goto pc+16
R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */

this 2byte load from the packet is safe to do, since the program author
did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which
means that in the fall-through case the register R3 (which points to skb->data)
has at least 14 directly accessible bytes. The verifier marks it
as R3=pkt(id=0,off=0,r=14).
id=0 means that no additional variables were added to the register.
off=0 means that no additional constants were added.
r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
to the packet data, but constant 14 was added to the register, so
it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
which is zero bytes.

More complex packet access may look like:
R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
7: r4 = *(u8 *)(r3 +12)
8: r4 *= 14
9: r3 = *(u32 *)(r1 +76) /* load skb->data */
10: r3 += r4
11: r2 = r1
12: r2 <<= 48
13: r2 >>= 48
14: r3 += r2
15: r2 = r3
16: r2 += 8
17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
18: if r2 > r1 goto pc+2
R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp
19: r1 = *(u8 *)(r3 +4)
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
offset within a packet and since the program author did
'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
The verifier only allows 'add' operation on packet registers. Any other
operation will set the register state to 'unknown_value' and it won't be
available for direct packet access.
Operation 'r3 += rX' may overflow and become less than original skb->data,
therefore the verifier has to prevent that. So it tracks the number of
upper zero bits in all 'uknown_value' registers, so when it sees
'r3 += rX' instruction and rX is more than 16-bit value, it will error as:
"cannot add integer value with N upper zero bits to ptr_to_packet"
Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
R4=inv56 which means that upper 56 bits on the register are guaranteed
to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since
multiplying 8-bit value by constant 14 will keep upper 52 bits as zero.
Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign
extending. This logic is implemented in evaluate_reg_alu() function.

The end result is that bpf program author can access packet directly
using normal C code as:
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct eth_hdr *eth = data;
struct iphdr *iph = data + sizeof(*eth);
struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph);

if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
return 0;
if (eth->h_proto != htons(ETH_P_IP))
return 0;
if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
return 0;
if (udp->dest == 53 || udp->source == 9)
...;
which makes such programs easier to write comparing to LD_ABS insn
and significantly faster.

eBPF maps
---------
'maps' is a generic storage of different types for sharing data between kernel
Expand Down Expand Up @@ -1293,5 +1374,5 @@ to give potential BPF hackers or security auditors a better overview of
the underlying architecture.

Jay Schulist <jschlst@samba.org>
Daniel Borkmann <dborkman@redhat.com>
Alexei Starovoitov <ast@plumgrid.com>
Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov <ast@kernel.org>
16 changes: 16 additions & 0 deletions include/linux/filter.h
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,22 @@ struct sk_filter {

#define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN

struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb;
void *data_end;
};

/* compute the linear packet data range [data, data_end) which
* will be accessed by cls_bpf and act_bpf programs
*/
static inline void bpf_compute_data_end(struct sk_buff *skb)
{
struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;

BUILD_BUG_ON(sizeof(*cb) > FIELD_SIZEOF(struct sk_buff, cb));
cb->data_end = skb->data + skb_headlen(skb);
}

static inline u8 *bpf_skb_cb(struct sk_buff *skb)
{
/* eBPF programs may read/write skb->cb[] area to transfer meta
Expand Down
2 changes: 2 additions & 0 deletions include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,8 @@ struct __sk_buff {
__u32 cb[5];
__u32 hash;
__u32 tc_classid;
__u32 data;
__u32 data_end;
};

struct bpf_tunnel_key {
Expand Down
5 changes: 5 additions & 0 deletions kernel/bpf/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,11 @@ void __weak bpf_int_jit_compile(struct bpf_prog *prog)
{
}

bool __weak bpf_helper_changes_skb_data(void *func)
{
return false;
}

/* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
* skb_copy_bits(), so provide a weak definition of it for NET-less config.
*/
Expand Down
Loading

0 comments on commit 4b307a8

Please sign in to comment.