Skip to content

Commit

Permalink
Merge branch 'skb-mono-delivery-time'
Browse files Browse the repository at this point in the history
Martin KaFai Lau says:

====================
Preserve mono delivery time (EDT) in skb->tstamp

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.

This set is to keep the mono delivery time and make it available to
the final egress interface.  Please see individual patch for
the details.

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

v6:
- Add kdoc and use non-UAPI type in patch 6 (Jakub)

v5:
netdev:
- Patch 3 in v4 is broken down into smaller patches 3, 4, and 5 in v5
- The mono_delivery_time bit clearing in __skb_tstamp_tx() is
  done in __net_timestamp() instead.  This is patch 4 in v5.
- Missed a skb_clear_delivery_time() for the 'skip_classify' case
  in dev.c in v4.  That is fixed in patch 5 in v5 for correctness.
  The skb_clear_delivery_time() will be moved to a later
  stage in Patch 10, so it was an intermediate error in v4.
- Added delivery time handling for nfnetlink_{log, queue}.c in patch 9 (Daniel)
- Added delivery time handling in the IPv6 IOAM hop-by-hop option which has
  an experimental IANA assigned value 49 in patch 8
- Added delivery time handling in nf_conntrack for the ipv6 defrag case
  in patch 7
- Removed unlikely() from testing skb->mono_delivery_time (Daniel)

bpf:
- Remove the skb->tstamp dance in ingress.  Depends on bpf insn
  rewrite to return 0 if skb->tstamp has delivery time in patch 11.
  It is to backward compatible with the existing tc-bpf@ingress in
  patch 11.
- bpf_set_delivery_time() will also allow dtime == 0 and
  dtime_type == BPF_SKB_DELIVERY_TIME_NONE as argument
  in patch 12.

v4:
netdev:
- Push the skb_clear_delivery_time() from
  ip_local_deliver() and ip6_input() to
  ip_local_deliver_finish() and ip6_input_finish()
  to accommodate the ipvs forward path.
  This is the notable change in v4 at the netdev side.

    - Patch 3/8 first does the skb_clear_delivery_time() after
      sch_handle_ingress() in dev.c and this will make the
      tc-bpf forward path work via the bpf_redirect_*() helper.

    - The next patch 4/8 (new in v4) will then postpone the
      skb_clear_delivery_time() from dev.c to
      the ip_local_deliver_finish() and ip6_input_finish() after
      taking care of the tstamp usage in the ip defrag case.
      This will make the kernel forward path also work, e.g.
      the ip[6]_forward().

- Fixed a case v3 which missed setting the skb->mono_delivery_time bit
  when sending TCP rst/ack in some cases (e.g. from a ctl_sk).
  That case happens at ip_send_unicast_reply() and
  tcp_v6_send_response().  It is fixed in patch 1/8 (and
  then patch 3/8) in v4.

bpf:
- Adding __sk_buff->delivery_time_type instead of adding
  __sk_buff->mono_delivery_time as in v3.  The tc-bpf can stay with
  one __sk_buff->tstamp instead of having two 'time' fields
  while one is 0 and another is not.
  tc-bpf can use the new __sk_buff->delivery_time_type to tell
  what is stored in __sk_buff->tstamp.
- bpf_skb_set_delivery_time() helper is added to set
  __sk_buff->tstamp from non mono delivery_time to
  mono delivery_time
- Most of the convert_ctx_access() bpf insn rewrite in v3
  is gone, so no new rewrite added for __sk_buff->tstamp.
  The only rewrite added is for reading the new
  __sk_buff->delivery_time_type.
- Added selftests, test_tc_dtime.c

v3:
- Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
- Considered to reuse a few bits in skb->tstamp to represent
  different semantics, other than more code churns, it will break
  the bpf usecase which currently can write and then read back
  the skb->tstamp.
- Went back to v1 idea on adding a bit to skb and address the
  feedbacks on v1:
- Added one bit skb->mono_delivery_time to flag that
  the skb->tstamp has the mono delivery_time (EDT), instead
  of adding a bit to flag if the skb->tstamp has been forwarded or not.
- Instead of resetting the delivery_time back to the (rcv) timestamp
  during recvmsg syscall which may be too late and not useful,
  the delivery_time reset in v3 happens earlier once the stack
  knows that the skb will be delivered locally.
- Handled the tapping@ingress case by af_packet
- No need to change the (rcv) timestamp to mono clock base as in v1.
  The added one bit to flag skb->mono_delivery_time is enough
  to keep the EDT delivery_time during forward.
- Added logic to the bpf side to make the existing bpf
  running at ingress can still get the (rcv) timestamp
  when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
  is also added.  Test is still needed to test this piece.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
David S. Miller committed Mar 3, 2022
2 parents 6fb8661 + c803475 commit 01e2d15
Show file tree
Hide file tree
Showing 38 changed files with 1,174 additions and 73 deletions.
2 changes: 1 addition & 1 deletion drivers/net/loopback.c
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
skb_tx_timestamp(skb);

/* do not fool net_timestamp_check() with various clock bases */
skb->tstamp = 0;
skb_clear_tstamp(skb);

skb_orphan(skb);

Expand Down
3 changes: 2 additions & 1 deletion include/linux/filter.h
Original file line number Diff line number Diff line change
Expand Up @@ -572,7 +572,8 @@ struct bpf_prog {
has_callchain_buf:1, /* callchain buffer allocated? */
enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
call_get_func_ip:1; /* Do we call get_func_ip() */
call_get_func_ip:1, /* Do we call get_func_ip() */
delivery_time_access:1; /* Accessed __sk_buff->delivery_time_type */
enum bpf_prog_type type; /* Type of BPF program */
enum bpf_attach_type expected_attach_type; /* For some prog types */
u32 len; /* Number of filter blocks */
Expand Down
74 changes: 68 additions & 6 deletions include/linux/skbuff.h
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,10 @@ typedef unsigned char *sk_buff_data_t;
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
* @mono_delivery_time: When set, skb->tstamp has the
* delivery_time in mono clock base (i.e. EDT). Otherwise, the
* skb->tstamp has the (rcv) timestamp at ingress and
* delivery_time at egress.
* @napi_id: id of the NAPI struct this skb came from
* @sender_cpu: (aka @napi_id) source CPU in XPS
* @secmark: security marking
Expand Down Expand Up @@ -937,8 +941,12 @@ struct sk_buff {
__u8 vlan_present:1; /* See PKT_VLAN_PRESENT_BIT */
__u8 csum_complete_sw:1;
__u8 csum_level:2;
__u8 csum_not_inet:1;
__u8 dst_pending_confirm:1;
__u8 mono_delivery_time:1;
#ifdef CONFIG_NET_CLS_ACT
__u8 tc_skip_classify:1;
__u8 tc_at_ingress:1;
#endif
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
Expand All @@ -949,10 +957,6 @@ struct sk_buff {
#ifdef CONFIG_NET_SWITCHDEV
__u8 offload_fwd_mark:1;
__u8 offload_l3_fwd_mark:1;
#endif
#ifdef CONFIG_NET_CLS_ACT
__u8 tc_skip_classify:1;
__u8 tc_at_ingress:1;
#endif
__u8 redirected:1;
#ifdef CONFIG_NET_REDIRECT
Expand All @@ -965,6 +969,7 @@ struct sk_buff {
__u8 decrypted:1;
#endif
__u8 slow_gro:1;
__u8 csum_not_inet:1;

#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
Expand Down Expand Up @@ -1042,10 +1047,16 @@ struct sk_buff {
/* if you move pkt_vlan_present around you also must adapt these constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define PKT_VLAN_PRESENT_BIT 7
#define TC_AT_INGRESS_MASK (1 << 0)
#define SKB_MONO_DELIVERY_TIME_MASK (1 << 2)
#else
#define PKT_VLAN_PRESENT_BIT 0
#define TC_AT_INGRESS_MASK (1 << 7)
#define SKB_MONO_DELIVERY_TIME_MASK (1 << 5)
#endif
#define PKT_VLAN_PRESENT_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)

#ifdef __KERNEL__
/*
Expand Down Expand Up @@ -3976,13 +3987,64 @@ static inline void skb_get_new_timestampns(const struct sk_buff *skb,
static inline void __net_timestamp(struct sk_buff *skb)
{
skb->tstamp = ktime_get_real();
skb->mono_delivery_time = 0;
}

static inline ktime_t net_timedelta(ktime_t t)
{
return ktime_sub(ktime_get_real(), t);
}

static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
bool mono)
{
skb->tstamp = kt;
skb->mono_delivery_time = kt && mono;
}

DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);

/* It is used in the ingress path to clear the delivery_time.
* If needed, set the skb->tstamp to the (rcv) timestamp.
*/
static inline void skb_clear_delivery_time(struct sk_buff *skb)
{
if (skb->mono_delivery_time) {
skb->mono_delivery_time = 0;
if (static_branch_unlikely(&netstamp_needed_key))
skb->tstamp = ktime_get_real();
else
skb->tstamp = 0;
}
}

static inline void skb_clear_tstamp(struct sk_buff *skb)
{
if (skb->mono_delivery_time)
return;

skb->tstamp = 0;
}

static inline ktime_t skb_tstamp(const struct sk_buff *skb)
{
if (skb->mono_delivery_time)
return 0;

return skb->tstamp;
}

static inline ktime_t skb_tstamp_cond(const struct sk_buff *skb, bool cond)
{
if (!skb->mono_delivery_time && skb->tstamp)
return skb->tstamp;

if (static_branch_unlikely(&netstamp_needed_key) || cond)
return ktime_get_real();

return 0;
}

static inline u8 skb_metadata_len(const struct sk_buff *skb)
{
return skb_shinfo(skb)->meta_len;
Expand Down Expand Up @@ -4839,7 +4901,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress)
#ifdef CONFIG_NET_REDIRECT
skb->from_ingress = from_ingress;
if (skb->from_ingress)
skb->tstamp = 0;
skb_clear_tstamp(skb);
#endif
}

Expand Down
2 changes: 2 additions & 0 deletions include/net/inet_frag.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ struct frag_v6_compare_key {
* @stamp: timestamp of the last received fragment
* @len: total length of the original datagram
* @meat: length of received fragments so far
* @mono_delivery_time: stamp has a mono delivery time (EDT)
* @flags: fragment queue flags
* @max_size: maximum received fragment size
* @fqdir: pointer to struct fqdir
Expand All @@ -90,6 +91,7 @@ struct inet_frag_queue {
ktime_t stamp;
int len;
int meat;
u8 mono_delivery_time;
__u8 flags;
u16 max_size;
struct fqdir *fqdir;
Expand Down
41 changes: 40 additions & 1 deletion include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -5086,6 +5086,37 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure. On error
* *dst* buffer is zeroed out.
*
* long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type)
* Description
* Set a *dtime* (delivery time) to the __sk_buff->tstamp and also
* change the __sk_buff->delivery_time_type to *dtime_type*.
*
* When setting a delivery time (non zero *dtime*) to
* __sk_buff->tstamp, only BPF_SKB_DELIVERY_TIME_MONO *dtime_type*
* is supported. It is the only delivery_time_type that will be
* kept after bpf_redirect_*().
*
* If there is no need to change the __sk_buff->delivery_time_type,
* the delivery time can be directly written to __sk_buff->tstamp
* instead.
*
* *dtime* 0 and *dtime_type* BPF_SKB_DELIVERY_TIME_NONE
* can be used to clear any delivery time stored in
* __sk_buff->tstamp.
*
* Only IPv4 and IPv6 skb->protocol are supported.
*
* This function is most useful when it needs to set a
* mono delivery time to __sk_buff->tstamp and then
* bpf_redirect_*() to the egress of an iface. For example,
* changing the (rcv) timestamp in __sk_buff->tstamp at
* ingress to a mono delivery time and then bpf_redirect_*()
* to sch_fq@phy-dev.
* Return
* 0 on success.
* **-EINVAL** for invalid input
* **-EOPNOTSUPP** for unsupported delivery_time_type and protocol
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
Expand Down Expand Up @@ -5280,6 +5311,7 @@ union bpf_attr {
FN(xdp_load_bytes), \
FN(xdp_store_bytes), \
FN(copy_from_user_task), \
FN(skb_set_delivery_time), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
Expand Down Expand Up @@ -5469,6 +5501,12 @@ union { \
__u64 :64; \
} __attribute__((aligned(8)))

enum {
BPF_SKB_DELIVERY_TIME_NONE,
BPF_SKB_DELIVERY_TIME_UNSPEC,
BPF_SKB_DELIVERY_TIME_MONO,
};

/* user accessible mirror of in-kernel sk_buff.
* new fields can only be added to the end of this structure
*/
Expand Down Expand Up @@ -5509,7 +5547,8 @@ struct __sk_buff {
__u32 gso_segs;
__bpf_md_ptr(struct bpf_sock *, sk);
__u32 gso_size;
__u32 :32; /* Padding, future use. */
__u8 delivery_time_type;
__u32 :24; /* Padding, future use. */
__u64 hwtstamp;
};

Expand Down
2 changes: 1 addition & 1 deletion net/bridge/br_forward.c
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);

int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
skb->tstamp = 0;
skb_clear_tstamp(skb);
return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
net, sk, skb, NULL, skb->dev,
br_dev_queue_push_xmit);
Expand Down
5 changes: 3 additions & 2 deletions net/bridge/netfilter/nf_conntrack_bridge.c
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
struct sk_buff *))
{
int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
bool mono_delivery_time = skb->mono_delivery_time;
unsigned int hlen, ll_rs, mtu;
ktime_t tstamp = skb->tstamp;
struct ip_frag_state state;
Expand Down Expand Up @@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
if (iter.frag)
ip_fraglist_prepare(skb, &iter);

skb->tstamp = tstamp;
skb_set_delivery_time(skb, tstamp, mono_delivery_time);
err = output(net, sk, data, skb);
if (err || !iter.frag)
break;
Expand Down Expand Up @@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
goto blackhole;
}

skb2->tstamp = tstamp;
skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
err = output(net, sk, data, skb2);
if (err)
goto blackhole;
Expand Down
8 changes: 5 additions & 3 deletions net/core/dev.c
Original file line number Diff line number Diff line change
Expand Up @@ -2047,7 +2047,8 @@ void net_dec_egress_queue(void)
EXPORT_SYMBOL_GPL(net_dec_egress_queue);
#endif

static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
EXPORT_SYMBOL(netstamp_needed_key);
#ifdef CONFIG_JUMP_LABEL
static atomic_t netstamp_needed_deferred;
static atomic_t netstamp_wanted;
Expand Down Expand Up @@ -2108,14 +2109,15 @@ EXPORT_SYMBOL(net_disable_timestamp);
static inline void net_timestamp_set(struct sk_buff *skb)
{
skb->tstamp = 0;
skb->mono_delivery_time = 0;
if (static_branch_unlikely(&netstamp_needed_key))
__net_timestamp(skb);
skb->tstamp = ktime_get_real();
}

#define net_timestamp_check(COND, SKB) \
if (static_branch_unlikely(&netstamp_needed_key)) { \
if ((COND) && !(SKB)->tstamp) \
__net_timestamp(SKB); \
(SKB)->tstamp = ktime_get_real(); \
} \

bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
Expand Down
Loading

0 comments on commit 01e2d15

Please sign in to comment.