Skip to content

Commit

Permalink
Merge branch 'bpf-timers'
Browse files Browse the repository at this point in the history
Alexei Starovoitov says:

====================
The first request to support timers in bpf was made in 2013 before sys_bpf
syscall was added. That use case was periodic sampling. It was address with
attaching bpf programs to perf_events. Then during XDP development the timers
were requested to do garbage collection and health checks. They were worked
around by implementing timers in user space and triggering progs with
BPF_PROG_RUN command. The user space timers and perf_event+bpf timers are not
armed by the bpf program. They're done asynchronously vs program execution.
The XDP program cannot send a packet and arm the timer at the same time. The
tracing prog cannot record an event and arm the timer right away. This large
class of use cases remained unaddressed. The jiffy based and hrtimer based
timers are essential part of the kernel development and with this patch set
the hrtimer based timers will be available to bpf programs.

TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
to make sure bpf progs cannot crash the kernel.

v6->v7:
- address Andrii's comments and add his Acks.

v5->v6:
- address code review feedback from Martin and add his Acks.
- add usercnt > 0 check to bpf_timer_init and remove timers_cancel_and_free
second loop in map_free callbacks.
- add cond_resched_rcu.

v4->v5:
- Martin noticed the following issues:
. prog could be reallocated bpf_patch_insn_data().
Fixed by passing 'aux' into bpf_timer_set_callback, since 'aux' is stable
during insn patching.
. Added missing rcu_read_lock.
. Removed redundant record_map.
- Discovered few bugs with stress testing:
. One cpu does htab_free_prealloced_timers->bpf_timer_cancel_and_free->hrtimer_cancel
while another is trying to do something with the timer like bpf_timer_start/set_callback.
Those ops try to acquire bpf_spin_lock that is already taken by bpf_timer_cancel_and_free,
so both cpus spin forever. The same problem existed in bpf_timer_cancel().
One bpf prog on one cpu might call bpf_timer_cancel and wait, while another cpu is in
the timer callback that tries to do bpf_timer_*() helper on the same timer.
The fix is to do drop_prog_refcnt() and unlock. And only then hrtimer_cancel.
Because of this had to add callback_fn != NULL check to bpf_timer_cb().
Also removed redundant bpf_prog_inc/put from bpf_timer_cb() and replaced
with rcu_dereference_check similar to recent rcu_read_lock-removal from drivers.
bpf_timer_cb is in softirq.
. Managed to hit refcnt==0 while doing bpf_prog_put from bpf_timer_cancel_and_free().
That exposed the issue that bpf_prog_put wasn't ready to be called from irq context.
Fixed similar to bpf_map_put which is irq ready.
- Refactored BPF_CALL_1(bpf_spin_lock) into __bpf_spin_lock_irqsave() to
make the main logic more clear, since Martin and Yonghong brought up this concern.

v3->v4:
1.
Split callback_fn from bpf_timer_start into bpf_timer_set_callback as
suggested by Martin. That makes bpf timer api match one to one to
kernel hrtimer api and provides greater flexibility.
2.
Martin also discovered the following issue with uref approach:
bpftool prog load xdp_timer.o /sys/fs/bpf/xdp_timer type xdp
bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_timer dev lo
rm /sys/fs/bpf/xdp_timer
nc -6 ::1 8888
bpftool net detach xdpgeneric dev lo
The timer callback stays active in the kernel though the prog was detached
and map usercnt == 0.
It happened because 'bpftool prog load' pinned the prog only.
The map usercnt went to zero. Subsequent attach and runs didn't
affect map usercnt. The timer was able to start and bpf_prog_inc itself.
When the prog was detached the prog stayed active.
To address this issue added
if (!atomic64_read(&(t->map->usercnt))) return -EPERM;
to the first patch.
Which means that timers are allowed only in the maps that are held
by user space with open file descriptor or maps pinned in bpffs.
3.
Discovered that timers in inner maps were broken.
The inner map pointers are dynamic. Therefore changed bpf_timer_init()
to accept explicit map pointer supplied by the program instead
of hidden map pointer supplied by the verifier.
To make sure that pointer to a timer actually belongs to that map
added the verifier check in patch 3.
4.
Addressed Yonghong's feedback. Improved comments and added
dynamic in_nmi() check.
Added Acks.

v2->v3:
The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
called to make sure callback code doesn't disappear when timer is active and
drop refcnt when timer cb is done. That led to a ton of race conditions between
callback running and concurrent bpf_timer_init/start/cancel on another cpu,
and concurrent bpf_map_update/delete_elem, and map destroy.

Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
timers that bpf prog armed in a link list and canceled them when prog refcnt
went to zero. The race conditions disappeared, but timers in map-in-map could
not be supported cleanly, since timers in inner maps have inner map's life time
and don't match prog's life time.

This v3 approach makes timers to be owned by maps. It allows timers in inner
maps to be supported from the start. This apporach relies on "user refcnt"
scheme used in prog_array that stores bpf programs for bpf_tail_call. The
bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
callback does decrement the refcnt. The ops->map_release_uref is
responsible for cancelling the timers and dropping prog refcnt when user space
reference to a map is dropped. That addressed all the races and simplified
locking.

Andrii presented a use case where specifying callback_fn in bpf_timer_init()
is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
typically is called outside for timer callback, while bpf_timer_start() most
likely will be called from the callback.
timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
infinite loop to the verifier. The verifier had to be made smarter to recognize
such async callbacks. Patches 7,8,9 addressed that.

Patch 1 and 2 refactoring.
Patch 3 implements bpf timer helpers and locking.
Patch 4 implements map side of bpf timer support.
Patch 5 prevent pointer mismatch in bpf_timer_init.
Patch 6 adds support for BTF in inner maps.
Patch 7 teaches check_cfg() pass to understand async callbacks.
Patch 8 teaches do_check() pass to understand async callbacks.
Patch 9 teaches check_max_stack_depth() pass to understand async callbacks.
Patches 10 and 11 are the tests.

v1->v2:
- Addressed great feedback from Andrii and Toke.
- Fixed race between parallel bpf_timer_*() ops.
- Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
- Disallowed mmap and global timers.
- Allow spin_lock and bpf_timer in an element.
- Fixed memory leaks due to map destruction and LRU eviction.
- A ton more tests.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  • Loading branch information
Daniel Borkmann committed Jul 15, 2021
2 parents de587d5 + 61f71e7 commit 7628317
Show file tree
Hide file tree
Showing 20 changed files with 1,651 additions and 64 deletions.
47 changes: 36 additions & 11 deletions include/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ struct bpf_map {
u32 max_entries;
u32 map_flags;
int spin_lock_off; /* >=0 valid offset, <0 error */
int timer_off; /* >=0 valid offset, <0 error */
u32 id;
int numa_node;
u32 btf_key_type_id;
Expand Down Expand Up @@ -197,30 +198,53 @@ static inline bool map_value_has_spin_lock(const struct bpf_map *map)
return map->spin_lock_off >= 0;
}

static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
static inline bool map_value_has_timer(const struct bpf_map *map)
{
if (likely(!map_value_has_spin_lock(map)))
return;
*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
(struct bpf_spin_lock){};
return map->timer_off >= 0;
}

/* copy everything but bpf_spin_lock */
static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
{
if (unlikely(map_value_has_spin_lock(map)))
*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
(struct bpf_spin_lock){};
if (unlikely(map_value_has_timer(map)))
*(struct bpf_timer *)(dst + map->timer_off) =
(struct bpf_timer){};
}

/* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
{
u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;

if (unlikely(map_value_has_spin_lock(map))) {
u32 off = map->spin_lock_off;
s_off = map->spin_lock_off;
s_sz = sizeof(struct bpf_spin_lock);
} else if (unlikely(map_value_has_timer(map))) {
t_off = map->timer_off;
t_sz = sizeof(struct bpf_timer);
}

memcpy(dst, src, off);
memcpy(dst + off + sizeof(struct bpf_spin_lock),
src + off + sizeof(struct bpf_spin_lock),
map->value_size - off - sizeof(struct bpf_spin_lock));
if (unlikely(s_sz || t_sz)) {
if (s_off < t_off || !s_sz) {
swap(s_off, t_off);
swap(s_sz, t_sz);
}
memcpy(dst, src, t_off);
memcpy(dst + t_off + t_sz,
src + t_off + t_sz,
s_off - t_off - t_sz);
memcpy(dst + s_off + s_sz,
src + s_off + s_sz,
map->value_size - s_off - s_sz);
} else {
memcpy(dst, src, map->value_size);
}
}
void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
bool lock_src);
void bpf_timer_cancel_and_free(void *timer);
int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);

struct bpf_offload_dev;
Expand Down Expand Up @@ -314,6 +338,7 @@ enum bpf_arg_type {
ARG_PTR_TO_FUNC, /* pointer to a bpf program function */
ARG_PTR_TO_STACK_OR_NULL, /* pointer to stack or NULL */
ARG_PTR_TO_CONST_STR, /* pointer to a null terminated read-only string */
ARG_PTR_TO_TIMER, /* pointer to bpf_timer */
__BPF_ARG_TYPE_MAX,
};

Expand Down
19 changes: 17 additions & 2 deletions include/linux/bpf_verifier.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,14 @@ struct bpf_reg_state {
/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
* PTR_TO_MAP_VALUE_OR_NULL
*/
struct bpf_map *map_ptr;
struct {
struct bpf_map *map_ptr;
/* To distinguish map lookups from outer map
* the map_uid is non-zero for registers
* pointing to inner maps.
*/
u32 map_uid;
};

/* for PTR_TO_BTF_ID */
struct {
Expand Down Expand Up @@ -201,12 +208,19 @@ struct bpf_func_state {
* zero == main subprog
*/
u32 subprogno;
/* Every bpf_timer_start will increment async_entry_cnt.
* It's used to distinguish:
* void foo(void) { for(;;); }
* void foo(void) { bpf_timer_set_callback(,foo); }
*/
u32 async_entry_cnt;
bool in_callback_fn;
bool in_async_callback_fn;

/* The following fields should be last. See copy_func_state() */
int acquired_refs;
struct bpf_reference_state *refs;
int allocated_stack;
bool in_callback_fn;
struct bpf_stack_state *stack;
};

Expand Down Expand Up @@ -392,6 +406,7 @@ struct bpf_subprog_info {
bool has_tail_call;
bool tail_call_reachable;
bool has_ld_abs;
bool is_async_cb;
};

/* single container for all structs
Expand Down
1 change: 1 addition & 0 deletions include/linux/btf.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
const struct btf_member *m,
u32 expected_offset, u32 expected_size);
int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
int btf_find_timer(const struct btf *btf, const struct btf_type *t);
bool btf_type_is_void(const struct btf_type *t);
s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
Expand Down
73 changes: 73 additions & 0 deletions include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -4777,6 +4777,70 @@ union bpf_attr {
* Execute close syscall for given FD.
* Return
* A syscall result.
*
* long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, u64 flags)
* Description
* Initialize the timer.
* First 4 bits of *flags* specify clockid.
* Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
* All other bits of *flags* are reserved.
* The verifier will reject the program if *timer* is not from
* the same *map*.
* Return
* 0 on success.
* **-EBUSY** if *timer* is already initialized.
* **-EINVAL** if invalid *flags* are passed.
* **-EPERM** if *timer* is in a map that doesn't have any user references.
* The user space should either hold a file descriptor to a map with timers
* or pin such map in bpffs. When map is unpinned or file descriptor is
* closed all timers in the map will be cancelled and freed.
*
* long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn)
* Description
* Configure the timer to call *callback_fn* static function.
* Return
* 0 on success.
* **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
* **-EPERM** if *timer* is in a map that doesn't have any user references.
* The user space should either hold a file descriptor to a map with timers
* or pin such map in bpffs. When map is unpinned or file descriptor is
* closed all timers in the map will be cancelled and freed.
*
* long bpf_timer_start(struct bpf_timer *timer, u64 nsecs, u64 flags)
* Description
* Set timer expiration N nanoseconds from the current time. The
* configured callback will be invoked in soft irq context on some cpu
* and will not repeat unless another bpf_timer_start() is made.
* In such case the next invocation can migrate to a different cpu.
* Since struct bpf_timer is a field inside map element the map
* owns the timer. The bpf_timer_set_callback() will increment refcnt
* of BPF program to make sure that callback_fn code stays valid.
* When user space reference to a map reaches zero all timers
* in a map are cancelled and corresponding program's refcnts are
* decremented. This is done to make sure that Ctrl-C of a user
* process doesn't leave any timers running. If map is pinned in
* bpffs the callback_fn can re-arm itself indefinitely.
* bpf_map_update/delete_elem() helpers and user space sys_bpf commands
* cancel and free the timer in the given map element.
* The map can contain timers that invoke callback_fn-s from different
* programs. The same callback_fn can serve different timers from
* different maps if key/value layout matches across maps.
* Every bpf_timer_set_callback() can have different callback_fn.
*
* Return
* 0 on success.
* **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier
* or invalid *flags* are passed.
*
* long bpf_timer_cancel(struct bpf_timer *timer)
* Description
* Cancel the timer and wait for callback_fn to finish if it was running.
* Return
* 0 if the timer was not active.
* 1 if the timer was active.
* **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
* **-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its
* own timer which would have led to a deadlock otherwise.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
Expand Down Expand Up @@ -4948,6 +5012,10 @@ union bpf_attr {
FN(sys_bpf), \
FN(btf_find_by_name_kind), \
FN(sys_close), \
FN(timer_init), \
FN(timer_set_callback), \
FN(timer_start), \
FN(timer_cancel), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
Expand Down Expand Up @@ -6074,6 +6142,11 @@ struct bpf_spin_lock {
__u32 val;
};

struct bpf_timer {
__u64 :64;
__u64 :64;
} __attribute__((aligned(8)));

struct bpf_sysctl {
__u32 write; /* Sysctl is being read (= 0) or written (= 1).
* Allows 1,2,4-byte read, but no write.
Expand Down
21 changes: 21 additions & 0 deletions kernel/bpf/arraymap.c
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,12 @@ static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key
return 0;
}

static void check_and_free_timer_in_array(struct bpf_array *arr, void *val)
{
if (unlikely(map_value_has_timer(&arr->map)))
bpf_timer_cancel_and_free(val + arr->map.timer_off);
}

/* Called from syscall or from eBPF program */
static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
u64 map_flags)
Expand Down Expand Up @@ -321,6 +327,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
copy_map_value_locked(map, val, value, false);
else
copy_map_value(map, val, value);
check_and_free_timer_in_array(array, val);
}
return 0;
}
Expand Down Expand Up @@ -374,6 +381,19 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
return (void *)round_down((unsigned long)array, PAGE_SIZE);
}

static void array_map_free_timers(struct bpf_map *map)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;

if (likely(!map_value_has_timer(map)))
return;

for (i = 0; i < array->map.max_entries; i++)
bpf_timer_cancel_and_free(array->value + array->elem_size * i +
map->timer_off);
}

/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
static void array_map_free(struct bpf_map *map)
{
Expand Down Expand Up @@ -668,6 +688,7 @@ const struct bpf_map_ops array_map_ops = {
.map_alloc = array_map_alloc,
.map_free = array_map_free,
.map_get_next_key = array_map_get_next_key,
.map_release_uref = array_map_free_timers,
.map_lookup_elem = array_map_lookup_elem,
.map_update_elem = array_map_update_elem,
.map_delete_elem = array_map_delete_elem,
Expand Down
77 changes: 63 additions & 14 deletions kernel/bpf/btf.c
Original file line number Diff line number Diff line change
Expand Up @@ -3046,43 +3046,92 @@ static void btf_struct_log(struct btf_verifier_env *env,
btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
}

/* find 'struct bpf_spin_lock' in map value.
* return >= 0 offset if found
* and < 0 in case of error
*/
int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
const char *name, int sz, int align)
{
const struct btf_member *member;
u32 i, off = -ENOENT;

if (!__btf_type_is_struct(t))
return -EINVAL;

for_each_member(i, t, member) {
const struct btf_type *member_type = btf_type_by_id(btf,
member->type);
if (!__btf_type_is_struct(member_type))
continue;
if (member_type->size != sizeof(struct bpf_spin_lock))
if (member_type->size != sz)
continue;
if (strcmp(__btf_name_by_offset(btf, member_type->name_off),
"bpf_spin_lock"))
if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
continue;
if (off != -ENOENT)
/* only one 'struct bpf_spin_lock' is allowed */
/* only one such field is allowed */
return -E2BIG;
off = btf_member_bit_offset(t, member);
if (off % 8)
/* valid C code cannot generate such BTF */
return -EINVAL;
off /= 8;
if (off % __alignof__(struct bpf_spin_lock))
/* valid struct bpf_spin_lock will be 4 byte aligned */
if (off % align)
return -EINVAL;
}
return off;
}

static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
const char *name, int sz, int align)
{
const struct btf_var_secinfo *vsi;
u32 i, off = -ENOENT;

for_each_vsi(i, t, vsi) {
const struct btf_type *var = btf_type_by_id(btf, vsi->type);
const struct btf_type *var_type = btf_type_by_id(btf, var->type);

if (!__btf_type_is_struct(var_type))
continue;
if (var_type->size != sz)
continue;
if (vsi->size != sz)
continue;
if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
continue;
if (off != -ENOENT)
/* only one such field is allowed */
return -E2BIG;
off = vsi->offset;
if (off % align)
return -EINVAL;
}
return off;
}

static int btf_find_field(const struct btf *btf, const struct btf_type *t,
const char *name, int sz, int align)
{

if (__btf_type_is_struct(t))
return btf_find_struct_field(btf, t, name, sz, align);
else if (btf_type_is_datasec(t))
return btf_find_datasec_var(btf, t, name, sz, align);
return -EINVAL;
}

/* find 'struct bpf_spin_lock' in map value.
* return >= 0 offset if found
* and < 0 in case of error
*/
int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
{
return btf_find_field(btf, t, "bpf_spin_lock",
sizeof(struct bpf_spin_lock),
__alignof__(struct bpf_spin_lock));
}

int btf_find_timer(const struct btf *btf, const struct btf_type *t)
{
return btf_find_field(btf, t, "bpf_timer",
sizeof(struct bpf_timer),
__alignof__(struct bpf_timer));
}

static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
u32 type_id, void *data, u8 bits_offset,
struct btf_show *show)
Expand Down
Loading

0 comments on commit 7628317

Please sign in to comment.