Skip to content

Commit

Permalink
Merge tag 'probes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/…
Browse files Browse the repository at this point in the history
…git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - fprobe: Pass return address to the fprobe entry/exit callbacks so
   that the callbacks don't need to analyze pt_regs/stack to find the
   function return address.

 - kprobe events: cleanup usage of TPARG_FL_FENTRY and TPARG_FL_RETURN
   flags so that those are not set at once.

 - fprobe events:
      - Add a new fprobe events for tracing arbitrary function entry and
        exit as a trace event.
      - Add a new tracepoint events for tracing raw tracepoint as a
        trace event. This allows user to trace non user-exposed
        tracepoints.
      - Move eprobe's event parser code into probe event common file.
      - Introduce BTF (BPF type format) support to kernel probe (kprobe,
        fprobe and tracepoint probe) events so that user can specify
        traced function arguments by name. This also applies the type of
        argument when fetching the argument.
      - Introduce '$arg*' wildcard support if BTF is available. This
        expands the '$arg*' meta argument to all function argument
        automatically.
      - Check the return value types by BTF. If the function returns
        'void', '$retval' is rejected.
      - Add some selftest script for fprobe events, tracepoint events
        and BTF support.
      - Update documentation about the fprobe events.
      - Some fixes for above features, document and selftests.

 - selftests for ftrace (in addition to the new fprobe events):
      - Add a test case for multiple consecutive probes in a function
        which checks if ftrace based kprobe, optimized kprobe and normal
        kprobe can be defined in the same target function.
      - Add a test case for optimized probe, which checks whether kprobe
        can be optimized or not.

* tag 'probes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix tracepoint event with $arg* to fetch correct argument
  Documentation: Fix typo of reference file name
  tracing/probes: Fix to return NULL and keep using current argc
  selftests/ftrace: Add new test case which checks for optimized probes
  selftests/ftrace: Add new test case which adds multiple consecutive probes in a function
  Documentation: tracing/probes: Add fprobe event tracing document
  selftests/ftrace: Add BTF arguments test cases
  selftests/ftrace: Add tracepoint probe test case
  tracing/probes: Add BTF retval type support
  tracing/probes: Add $arg* meta argument for all function args
  tracing/probes: Support function parameters if BTF is available
  tracing/probes: Move event parameter fetching code to common parser
  tracing/probes: Add tracepoint support on fprobe_events
  selftests/ftrace: Add fprobe related testcases
  tracing/probes: Add fprobe events for tracing function entry and exit.
  tracing/probes: Avoid setting TPARG_FL_FENTRY and TPARG_FL_RETURN
  fprobe: Pass return address to the handlers
  • Loading branch information
Linus Torvalds committed Jun 30, 2023
2 parents cccf0c2 + 5343179 commit d2a6fd4
Show file tree
Hide file tree
Showing 31 changed files with 2,476 additions and 164 deletions.
188 changes: 188 additions & 0 deletions Documentation/trace/fprobetrace.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
.. SPDX-License-Identifier: GPL-2.0
==========================
Fprobe-based Event Tracing
==========================

.. Author: Masami Hiramatsu <mhiramat@kernel.org>
Overview
--------

Fprobe event is similar to the kprobe event, but limited to probe on
the function entry and exit only. It is good enough for many use cases
which only traces some specific functions.

This document also covers tracepoint probe events (tprobe) since this
is also works only on the tracepoint entry. User can trace a part of
tracepoint argument, or the tracepoint without trace-event, which is
not exposed on tracefs.

As same as other dynamic events, fprobe events and tracepoint probe
events are defined via `dynamic_events` interface file on tracefs.

Synopsis of fprobe-events
-------------------------
::

f[:[GRP1/][EVENT1]] SYM [FETCHARGS] : Probe on function entry
f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS] : Probe on function exit
t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS] : Probe on tracepoint

GRP1 : Group name for fprobe. If omitted, use "fprobes" for it.
GRP2 : Group name for tprobe. If omitted, use "tracepoints" for it.
EVENT1 : Event name for fprobe. If omitted, the event name is
"SYM__entry" or "SYM__exit".
EVENT2 : Event name for tprobe. If omitted, the event name is
the same as "TRACEPOINT", but if the "TRACEPOINT" starts
with a digit character, "_TRACEPOINT" is used.
MAXACTIVE : Maximum number of instances of the specified function that
can be probed simultaneously, or 0 for the default value
as defined in Documentation/trace/fprobe.rst

FETCHARGS : Arguments. Each probe can have up to 128 args.
ARG : Fetch "ARG" function argument using BTF (only for function
entry or tracepoint.) (\*1)
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address.
$argN : Fetch the Nth function argument. (N >= 1) (\*2)
$retval : Fetch return value.(\*3)
$comm : Fetch current task comm.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.

(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0).
(\*3) only for return probe.
(\*4) this is useful for fetching a field of data structures.
(\*5) "u" means user-space dereference.

For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`.

BTF arguments
-------------
BTF (BPF Type Format) argument allows user to trace function and tracepoint
parameters by its name instead of ``$argN``. This feature is available if the
kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF.
If user only specify the BTF argument, the event's argument name is also
automatically set by the given name. ::

# echo 'f:myprobe vfs_read count pos' >> dynamic_events
# cat dynamic_events
f:fprobes/myprobe vfs_read count=count pos=pos

It also chooses the fetch type from BTF information. For example, in the above
example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus, both
are converted to 64bit unsigned long, but only ``pos`` has "%Lx" print-format as
below ::

# cat events/fprobes/myprobe/format
name: myprobe
ID: 1313
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:unsigned long __probe_ip; offset:8; size:8; signed:0;
field:u64 count; offset:16; size:8; signed:0;
field:u64 pos; offset:24; size:8; signed:0;

print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos

If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*``
is expanded to all function arguments of the function or the tracepoint. ::

# echo 'f:myprobe vfs_read $arg*' >> dynamic_events
# cat dynamic_events
f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos

BTF also affects the ``$retval``. If user doesn't set any type, the retval type is
automatically picked from the BTF. If the function returns ``void``, ``$retval``
is rejected.

Usage examples
--------------
Here is an example to add fprobe events on ``vfs_read()`` function entry
and exit, with BTF arguments.
::

# echo 'f vfs_read $arg*' >> dynamic_events
# echo 'f vfs_read%return $retval' >> dynamic_events
# cat dynamic_events
f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
# echo 1 > events/fprobes/enable
# head -n 20 trace | tail
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
sh-70 [000] ...1. 335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
sh-70 [000] ..... 335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
sh-70 [000] ...1. 335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
sh-70 [000] ..... 335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
sh-70 [000] ...1. 335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
sh-70 [000] ..... 335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
sh-70 [000] ...1. 336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
sh-70 [000] ..... 336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1

You can see all function arguments and return values are recorded as signed int.

Also, here is an example of tracepoint events on ``sched_switch`` tracepoint.
To compare the result, this also enables the ``sched_switch`` traceevent too.
::

# echo 't sched_switch $arg*' >> dynamic_events
# echo 1 > events/sched/sched_switch/enable
# echo 1 > events/tracepoints/sched_switch/enable
# echo > trace
# head -n 20 trace | tail
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
sh-70 [000] d..2. 3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
sh-70 [000] d..3. 3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
<idle>-0 [000] d..2. 3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
<idle>-0 [000] d..3. 3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
rcu_preempt-16 [000] d..2. 3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
rcu_preempt-16 [000] d..3. 3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
<idle>-0 [000] d..2. 3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
<idle>-0 [000] d..3. 3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0

As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on
the other hand, the ``sched_switch`` tracepoint probe event shows *raw*
parameters. This means you can access any field values in the task
structure pointed by the ``prev`` and ``next`` arguments.

For example, usually ``task_struct::start_time`` is not traced, but with this
traceprobe event, you can trace it as below.
::

# echo 't sched_switch comm=+1896(next):string start_time=+1728(next):u64' > dynamic_events
# head -n 20 trace | tail
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
sh-70 [000] d..3. 5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
rcu_preempt-16 [000] d..3. 5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
sh-70 [000] d..3. 5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
<idle>-0 [000] d..3. 5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
rcu_preempt-16 [000] d..3. 5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
<idle>-0 [000] d..3. 5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
kworker/0:1-14 [000] d..3. 5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
<idle>-0 [000] d..3. 5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000

Currently, to find the offset of a specific field in the data structure,
you need to build kernel with debuginfo and run `perf probe` command with
`-D` option. e.g.
::

# perf probe -D "__probestub_sched_switch next->comm:string next->start_time"
p:probe/__probestub_sched_switch __probestub_sched_switch+0 comm=+1896(%cx):string start_time=+1728(%cx):u64

And replace the ``%cx`` with the ``next``.
1 change: 1 addition & 0 deletions Documentation/trace/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Linux Tracing Technologies
kprobes
kprobetrace
uprobetracer
fprobetrace
tracepoints
events
events-kmem
Expand Down
2 changes: 2 additions & 0 deletions Documentation/trace/kprobetrace.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ Synopsis of kprobe_events
(\*3) this is useful for fetching a field of data structures.
(\*4) "u" means user-space dereference. See :ref:`user_mem_access`.

.. _kprobetrace_types:

Types
-----
Several types are supported for fetchargs. Kprobe tracer will access memory
Expand Down
11 changes: 9 additions & 2 deletions include/linux/fprobe.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,11 @@ struct fprobe {
int nr_maxactive;

int (*entry_handler)(struct fprobe *fp, unsigned long entry_ip,
struct pt_regs *regs, void *entry_data);
unsigned long ret_ip, struct pt_regs *regs,
void *entry_data);
void (*exit_handler)(struct fprobe *fp, unsigned long entry_ip,
struct pt_regs *regs, void *entry_data);
unsigned long ret_ip, struct pt_regs *regs,
void *entry_data);
};

/* This fprobe is soft-disabled. */
Expand All @@ -64,6 +66,7 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num);
int register_fprobe_syms(struct fprobe *fp, const char **syms, int num);
int unregister_fprobe(struct fprobe *fp);
bool fprobe_is_registered(struct fprobe *fp);
#else
static inline int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter)
{
Expand All @@ -81,6 +84,10 @@ static inline int unregister_fprobe(struct fprobe *fp)
{
return -EOPNOTSUPP;
}
static inline bool fprobe_is_registered(struct fprobe *fp)
{
return false;
}
#endif

/**
Expand Down
2 changes: 1 addition & 1 deletion include/linux/rethook.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

struct rethook_node;

typedef void (*rethook_handler_t) (struct rethook_node *, void *, struct pt_regs *);
typedef void (*rethook_handler_t) (struct rethook_node *, void *, unsigned long, struct pt_regs *);

/**
* struct rethook - The rethook management data structure.
Expand Down
3 changes: 3 additions & 0 deletions include/linux/trace_events.h
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ enum {
TRACE_EVENT_FL_KPROBE_BIT,
TRACE_EVENT_FL_UPROBE_BIT,
TRACE_EVENT_FL_EPROBE_BIT,
TRACE_EVENT_FL_FPROBE_BIT,
TRACE_EVENT_FL_CUSTOM_BIT,
};

Expand All @@ -332,6 +333,7 @@ enum {
* KPROBE - Event is a kprobe
* UPROBE - Event is a uprobe
* EPROBE - Event is an event probe
* FPROBE - Event is an function probe
* CUSTOM - Event is a custom event (to be attached to an exsiting tracepoint)
* This is set when the custom event has not been attached
* to a tracepoint yet, then it is cleared when it is.
Expand All @@ -346,6 +348,7 @@ enum {
TRACE_EVENT_FL_KPROBE = (1 << TRACE_EVENT_FL_KPROBE_BIT),
TRACE_EVENT_FL_UPROBE = (1 << TRACE_EVENT_FL_UPROBE_BIT),
TRACE_EVENT_FL_EPROBE = (1 << TRACE_EVENT_FL_EPROBE_BIT),
TRACE_EVENT_FL_FPROBE = (1 << TRACE_EVENT_FL_FPROBE_BIT),
TRACE_EVENT_FL_CUSTOM = (1 << TRACE_EVENT_FL_CUSTOM_BIT),
};

Expand Down
1 change: 1 addition & 0 deletions include/linux/tracepoint-defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ struct tracepoint {
struct static_call_key *static_call_key;
void *static_call_tramp;
void *iterator;
void *probestub;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
Expand Down
5 changes: 5 additions & 0 deletions include/linux/tracepoint.h
Original file line number Diff line number Diff line change
Expand Up @@ -303,13 +303,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
__section("__tracepoints_strings") = #_name; \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
int __traceiter_##_name(void *__data, proto); \
void __probestub_##_name(void *__data, proto); \
struct tracepoint __tracepoint_##_name __used \
__section("__tracepoints") = { \
.name = __tpstrtab_##_name, \
.key = STATIC_KEY_INIT_FALSE, \
.static_call_key = &STATIC_CALL_KEY(tp_func_##_name), \
.static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
.iterator = &__traceiter_##_name, \
.probestub = &__probestub_##_name, \
.regfunc = _reg, \
.unregfunc = _unreg, \
.funcs = NULL }; \
Expand All @@ -330,6 +332,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
} \
return 0; \
} \
void __probestub_##_name(void *__data, proto) \
{ \
} \
DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);

#define DEFINE_TRACE(name, proto, args) \
Expand Down
1 change: 1 addition & 0 deletions kernel/kprobes.c
Original file line number Diff line number Diff line change
Expand Up @@ -2127,6 +2127,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
NOKPROBE_SYMBOL(pre_handler_kretprobe);

static void kretprobe_rethook_handler(struct rethook_node *rh, void *data,
unsigned long ret_addr,
struct pt_regs *regs)
{
struct kretprobe *rp = (struct kretprobe *)data;
Expand Down
26 changes: 26 additions & 0 deletions kernel/trace/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -665,6 +665,32 @@ config BLK_DEV_IO_TRACE

If unsure, say N.

config FPROBE_EVENTS
depends on FPROBE
depends on HAVE_REGS_AND_STACK_ACCESS_API
bool "Enable fprobe-based dynamic events"
select TRACING
select PROBE_EVENTS
select DYNAMIC_EVENTS
default y
help
This allows user to add tracing events on the function entry and
exit via ftrace interface. The syntax is same as the kprobe events
and the kprobe events on function entry and exit will be
transparently converted to this fprobe events.

config PROBE_EVENTS_BTF_ARGS
depends on HAVE_FUNCTION_ARG_ACCESS_API
depends on FPROBE_EVENTS || KPROBE_EVENTS
depends on DEBUG_INFO_BTF && BPF_SYSCALL
bool "Support BTF function arguments for probe events"
default y
help
The user can specify the arguments of the probe event using the names
of the arguments of the probed function, when the probe location is a
kernel function entry or a tracepoint.
This is available only if BTF (BPF Type Format) support is enabled.

config KPROBE_EVENTS
depends on KPROBES
depends on HAVE_REGS_AND_STACK_ACCESS_API
Expand Down
1 change: 1 addition & 0 deletions kernel/trace/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ obj-$(CONFIG_BOOTTIME_TRACING) += trace_boot.o
obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o
obj-$(CONFIG_FPROBE) += fprobe.o
obj-$(CONFIG_RETHOOK) += rethook.o
obj-$(CONFIG_FPROBE_EVENTS) += trace_fprobe.o

obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
obj-$(CONFIG_RV) += rv/
Expand Down
6 changes: 4 additions & 2 deletions kernel/trace/bpf_trace.c
Original file line number Diff line number Diff line change
Expand Up @@ -2652,7 +2652,8 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,

static int
kprobe_multi_link_handler(struct fprobe *fp, unsigned long fentry_ip,
struct pt_regs *regs, void *data)
unsigned long ret_ip, struct pt_regs *regs,
void *data)
{
struct bpf_kprobe_multi_link *link;

Expand All @@ -2663,7 +2664,8 @@ kprobe_multi_link_handler(struct fprobe *fp, unsigned long fentry_ip,

static void
kprobe_multi_link_exit_handler(struct fprobe *fp, unsigned long fentry_ip,
struct pt_regs *regs, void *data)
unsigned long ret_ip, struct pt_regs *regs,
void *data)
{
struct bpf_kprobe_multi_link *link;

Expand Down
Loading

0 comments on commit d2a6fd4

Please sign in to comment.