Skip to content

Commit

Permalink
Merge branch 'bpf_get_stack'
Browse files Browse the repository at this point in the history
Yonghong Song says:

====================
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patch #3 removes two never-hit branches in verifier.
Patches #4 and #5 are two verifier improves to make
bpf programming easier. Patch #6 synced the new helper
to tools headers. Patch #7 moved perf_event polling code
and ksym lookup code from samples/bpf to
tools/testing/selftests/bpf. Patch #8 added a verifier
test in tools/bpf for new verifier change.
Patches #9 and #10 added tests for raw tracepoint prog
and tracepoint prog respectively.

Changelogs:
  v8 -> v9:
    . make function perf_event_mmap (in trace_helpers.c) extern
      to decouple perf_event_mmap and perf_event_poller.
    . add jit enabled handling for kernel stack verification
      in Patch #9. Since we did not have a good way to
      verify jit enabled kernel stack, just return true if
      the kernel stack is not empty.
    . In path #9, using raw_syscalls/sys_enter instead of
      sched/sched_switch, removed calling cmd
      "task 1 dd if=/dev/zero of=/dev/null" which is left
      with dangling process after the program exited.
  v7 -> v8:
    . rebase on top of latest bpf-next
    . simplify BPF_ARSH dst_reg->smin_val/smax_value tracking
    . rewrite the description of bpf_get_stack() in uapi bpf.h
      based on new format.
  v6 -> v7:
    . do perf callchain buffer allocation inside the
      verifier. so if the prog->has_callchain_buf is set,
      it is guaranteed that the buffer has been allocated.
    . change condition "trace_nr <= skip" to "trace_nr < skip"
      so that for zero size buffer, return 0 instead of -EFAULT
  v5 -> v6:
    . after refining return register smax_value and umax_value
      for helpers bpf_get_stack and bpf_probe_read_str,
      bounds and var_off of the return register are further refined.
    . added missing commit message for tools header sync commit.
    . removed one unnecessary empty line.
  v4 -> v5:
    . relied on dst_reg->var_off to refine umin_val/umax_val
      in verifier handling BPF_ARSH value range tracking,
      suggested by Edward.
  v3 -> v4:
    . fixed a bug when meta ptr is set to NULL in check_func_arg.
    . introduced tnum_arshift and added detailed comments for
      the underlying implementation
    . avoided using VLA in tools/bpf test_progs.
  v2 -> v3:
    . used meta to track helper memory size argument
    . implemented range checking for ARSH in verifier
    . moved perf event polling and ksym related functions
      from samples/bpf to tools/bpf
    . added test to compare build id's between bpf_get_stackid
      and bpf_get_stack
  v1 -> v2:
    . fixed compilation error when CONFIG_PERF_EVENTS is not enabled
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
  • Loading branch information
Alexei Starovoitov committed Apr 29, 2018
2 parents 2c25fc9 + 79b4535 commit f60ad0a
Show file tree
Hide file tree
Showing 27 changed files with 927 additions and 222 deletions.
1 change: 1 addition & 0 deletions include/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -692,6 +692,7 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
extern const struct bpf_func_proto bpf_get_stackid_proto;
extern const struct bpf_func_proto bpf_get_stack_proto;
extern const struct bpf_func_proto bpf_sock_map_update_proto;

/* Shared helpers among cBPF and eBPF. */
Expand Down
3 changes: 2 additions & 1 deletion include/linux/filter.h
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,8 @@ struct bpf_prog {
dst_needed:1, /* Do we need dst entry? */
blinded:1, /* Was blinded */
is_func:1, /* program is a bpf function */
kprobe_override:1; /* Do we override a kprobe? */
kprobe_override:1, /* Do we override a kprobe? */
has_callchain_buf:1; /* callchain buffer allocated? */
enum bpf_prog_type type; /* Type of BPF program */
enum bpf_attach_type expected_attach_type; /* For some prog types */
u32 len; /* Number of filter blocks */
Expand Down
4 changes: 3 additions & 1 deletion include/linux/tnum.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,10 @@ struct tnum tnum_range(u64 min, u64 max);
/* Arithmetic and logical ops */
/* Shift a tnum left (by a fixed shift) */
struct tnum tnum_lshift(struct tnum a, u8 shift);
/* Shift a tnum right (by a fixed shift) */
/* Shift (rsh) a tnum right (by a fixed shift) */
struct tnum tnum_rshift(struct tnum a, u8 shift);
/* Shift (arsh) a tnum right (by a fixed min_shift) */
struct tnum tnum_arshift(struct tnum a, u8 min_shift);
/* Add two tnums, return @a + @b */
struct tnum tnum_add(struct tnum a, struct tnum b);
/* Subtract two tnums, return @a - @b */
Expand Down
42 changes: 40 additions & 2 deletions include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -1767,6 +1767,40 @@ union bpf_attr {
* **CONFIG_XFRM** configuration option.
* Return
* 0 on success, or a negative error in case of failure.
*
* int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
* Description
* Return a user or a kernel stack in bpf program provided buffer.
* To achieve this, the helper needs *ctx*, which is a pointer
* to the context on which the tracing program is executed.
* To store the stacktrace, the bpf program provides *buf* with
* a nonnegative *size*.
*
* The last argument, *flags*, holds the number of stack frames to
* skip (from 0 to 255), masked with
* **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
* the following flags:
*
* **BPF_F_USER_STACK**
* Collect a user space stack instead of a kernel stack.
* **BPF_F_USER_BUILD_ID**
* Collect buildid+offset instead of ips for user stack,
* only valid if **BPF_F_USER_STACK** is also specified.
*
* **bpf_get_stack**\ () can collect up to
* **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
* to sufficient large buffer size. Note that
* this limit can be controlled with the **sysctl** program, and
* that it should be manually increased in order to profile long
* user stacks (such as stacks for Java programs). To do so, use:
*
* ::
*
* # sysctl kernel.perf_event_max_stack=<new value>
*
* Return
* a non-negative value equal to or less than size on success, or
* a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
Expand Down Expand Up @@ -1835,7 +1869,8 @@ union bpf_attr {
FN(msg_pull_data), \
FN(bind), \
FN(xdp_adjust_tail), \
FN(skb_get_xfrm_state),
FN(skb_get_xfrm_state), \
FN(get_stack),

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
Expand Down Expand Up @@ -1869,11 +1904,14 @@ enum bpf_func_id {
/* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
#define BPF_F_TUNINFO_IPV6 (1ULL << 0)

/* BPF_FUNC_get_stackid flags. */
/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
#define BPF_F_SKIP_FIELD_MASK 0xffULL
#define BPF_F_USER_STACK (1ULL << 8)
/* flags used by BPF_FUNC_get_stackid only. */
#define BPF_F_FAST_STACK_CMP (1ULL << 9)
#define BPF_F_REUSE_STACKID (1ULL << 10)
/* flags used by BPF_FUNC_get_stack only. */
#define BPF_F_USER_BUILD_ID (1ULL << 11)

/* BPF_FUNC_skb_set_tunnel_key flags. */
#define BPF_F_ZERO_CSUM_TX (1ULL << 1)
Expand Down
5 changes: 5 additions & 0 deletions kernel/bpf/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
#include <linux/rbtree_latch.h>
#include <linux/kallsyms.h>
#include <linux/rcupdate.h>
#include <linux/perf_event.h>

#include <asm/unaligned.h>

Expand Down Expand Up @@ -1722,6 +1723,10 @@ static void bpf_prog_free_deferred(struct work_struct *work)
aux = container_of(work, struct bpf_prog_aux, work);
if (bpf_prog_is_dev_bound(aux))
bpf_prog_offload_destroy(aux->prog);
#ifdef CONFIG_PERF_EVENTS
if (aux->prog->has_callchain_buf)
put_callchain_buffers();
#endif
for (i = 0; i < aux->func_cnt; i++)
bpf_jit_free(aux->func[i]);
if (aux->func_cnt) {
Expand Down
80 changes: 72 additions & 8 deletions kernel/bpf/stackmap.c
Original file line number Diff line number Diff line change
Expand Up @@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct *vma,
return ret;
}

static void stack_map_get_build_id_offset(struct bpf_map *map,
struct stack_map_bucket *bucket,
static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
u64 *ips, u32 trace_nr, bool user)
{
int i;
struct vm_area_struct *vma;
struct bpf_stack_build_id *id_offs;

bucket->nr = trace_nr;
id_offs = (struct bpf_stack_build_id *)bucket->data;

/*
* We cannot do up_read() in nmi context, so build_id lookup is
Expand Down Expand Up @@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
pcpu_freelist_pop(&smap->freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
stack_map_get_build_id_offset(map, new_bucket, ips,
trace_nr, user);
new_bucket->nr = trace_nr;
stack_map_get_build_id_offset(
(struct bpf_stack_build_id *)new_bucket->data,
ips, trace_nr, user);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
Expand Down Expand Up @@ -405,6 +402,73 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
.arg3_type = ARG_ANYTHING,
};

BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
u64, flags)
{
u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
bool user_build_id = flags & BPF_F_USER_BUILD_ID;
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
bool user = flags & BPF_F_USER_STACK;
struct perf_callchain_entry *trace;
bool kernel = !user;
int err = -EINVAL;
u64 *ips;

if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
BPF_F_USER_BUILD_ID)))
goto clear;
if (kernel && user_build_id)
goto clear;

elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
: sizeof(u64);
if (unlikely(size % elem_size))
goto clear;

num_elem = size / elem_size;
if (sysctl_perf_event_max_stack < num_elem)
init_nr = 0;
else
init_nr = sysctl_perf_event_max_stack - num_elem;
trace = get_perf_callchain(regs, init_nr, kernel, user,
sysctl_perf_event_max_stack, false, false);
if (unlikely(!trace))
goto err_fault;

trace_nr = trace->nr - init_nr;
if (trace_nr < skip)
goto err_fault;

trace_nr -= skip;
trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
copy_len = trace_nr * elem_size;
ips = trace->ip + skip + init_nr;
if (user && user_build_id)
stack_map_get_build_id_offset(buf, ips, trace_nr, user);
else
memcpy(buf, ips, copy_len);

if (size > copy_len)
memset(buf + copy_len, 0, size - copy_len);
return copy_len;

err_fault:
err = -EFAULT;
clear:
memset(buf, 0, size);
return err;
}

const struct bpf_func_proto bpf_get_stack_proto = {
.func = bpf_get_stack,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE_OR_ZERO,
.arg4_type = ARG_ANYTHING,
};

/* Called from eBPF program */
static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
{
Expand Down
10 changes: 10 additions & 0 deletions kernel/bpf/tnum.c
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ struct tnum tnum_rshift(struct tnum a, u8 shift)
return TNUM(a.value >> shift, a.mask >> shift);
}

struct tnum tnum_arshift(struct tnum a, u8 min_shift)
{
/* if a.value is negative, arithmetic shifting by minimum shift
* will have larger negative offset compared to more shifting.
* If a.value is nonnegative, arithmetic shifting by minimum shift
* will have larger positive offset compare to more shifting.
*/
return TNUM((s64)a.value >> min_shift, (s64)a.mask >> min_shift);
}

struct tnum tnum_add(struct tnum a, struct tnum b)
{
u64 sm, sv, sigma, chi, mu;
Expand Down
80 changes: 71 additions & 9 deletions kernel/bpf/verifier.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
#include <linux/stringify.h>
#include <linux/bsearch.h>
#include <linux/sort.h>
#include <linux/perf_event.h>

#include "disasm.h"

Expand Down Expand Up @@ -164,6 +165,8 @@ struct bpf_call_arg_meta {
bool pkt_access;
int regno;
int access_size;
s64 msize_smax_value;
u64 msize_umax_value;
};

static DEFINE_MUTEX(bpf_verifier_lock);
Expand Down Expand Up @@ -1984,6 +1987,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
} else if (arg_type_is_mem_size(arg_type)) {
bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);

/* remember the mem_size which may be used later
* to refine return values.
*/
meta->msize_smax_value = reg->smax_value;
meta->msize_umax_value = reg->umax_value;

/* The register is SCALAR_VALUE; the access check
* happens using its boundaries.
*/
Expand Down Expand Up @@ -2323,6 +2332,23 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
return 0;
}

static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
int func_id,
struct bpf_call_arg_meta *meta)
{
struct bpf_reg_state *ret_reg = &regs[BPF_REG_0];

if (ret_type != RET_INTEGER ||
(func_id != BPF_FUNC_get_stack &&
func_id != BPF_FUNC_probe_read_str))
return;

ret_reg->smax_value = meta->msize_smax_value;
ret_reg->umax_value = meta->msize_umax_value;
__reg_deduce_bounds(ret_reg);
__reg_bound_offset(ret_reg);
}

static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
{
const struct bpf_func_proto *fn = NULL;
Expand Down Expand Up @@ -2446,10 +2472,30 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
return -EINVAL;
}

do_refine_retval_range(regs, fn->ret_type, func_id, &meta);

err = check_map_func_compatibility(env, meta.map_ptr, func_id);
if (err)
return err;

if (func_id == BPF_FUNC_get_stack && !env->prog->has_callchain_buf) {
const char *err_str;

#ifdef CONFIG_PERF_EVENTS
err = get_callchain_buffers(sysctl_perf_event_max_stack);
err_str = "cannot get callchain buffer for func %s#%d\n";
#else
err = -ENOTSUPP;
err_str = "func %s#%d not supported without CONFIG_PERF_EVENTS\n";
#endif
if (err) {
verbose(env, err_str, func_id_name(func_id), func_id);
return err;
}

env->prog->has_callchain_buf = true;
}

if (changes_data)
clear_all_pkt_pointers(env);
return 0;
Expand Down Expand Up @@ -2894,10 +2940,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
dst_reg->umin_value <<= umin_val;
dst_reg->umax_value <<= umax_val;
}
if (src_known)
dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val);
else
dst_reg->var_off = tnum_lshift(tnum_unknown, umin_val);
dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val);
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
Expand Down Expand Up @@ -2925,16 +2968,35 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
*/
dst_reg->smin_value = S64_MIN;
dst_reg->smax_value = S64_MAX;
if (src_known)
dst_reg->var_off = tnum_rshift(dst_reg->var_off,
umin_val);
else
dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val);
dst_reg->umin_value >>= umax_val;
dst_reg->umax_value >>= umin_val;
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
case BPF_ARSH:
if (umax_val >= insn_bitness) {
/* Shifts greater than 31 or 63 are undefined.
* This includes shifts by a negative number.
*/
mark_reg_unknown(env, regs, insn->dst_reg);
break;
}

/* Upon reaching here, src_known is true and
* umax_val is equal to umin_val.
*/
dst_reg->smin_value >>= umin_val;
dst_reg->smax_value >>= umin_val;
dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);

/* blow away the dst_reg umin_value/umax_value and rely on
* dst_reg var_off to refine the result.
*/
dst_reg->umin_value = 0;
dst_reg->umax_value = U64_MAX;
__update_reg_bounds(dst_reg);
break;
default:
mark_reg_unknown(env, regs, insn->dst_reg);
break;
Expand Down
Loading

0 comments on commit f60ad0a

Please sign in to comment.