Skip to content

Commit

Permalink
Merge branch 'mitigate-the-two-reallocations-issue-for-iptunnels'
Browse files Browse the repository at this point in the history
Justin Iurman says:

====================
Mitigate the two-reallocations issue for iptunnels

RESEND v5:
- v5 was sent just when net-next closed
v5:
- address Paolo's comments
- s/int dst_dev_overhead()/unsigned int dst_dev_overhead()/
v4:
- move static inline function to include/net/dst.h
v3:
- fix compilation error in seg6_iptunnel
v2:
- add missing "static" keywords in seg6_iptunnel
- use a static-inline function to return the dev overhead (as suggested
  by Olek, thanks)

The same pattern is found in ioam6, rpl6, and seg6. Basically, it first
makes sure there is enough room for inserting a new header:

(1) err = skb_cow_head(skb, len + skb->mac_len);

Then, when the insertion (encap or inline) is performed, the input and
output handlers respectively make sure there is enough room for layer 2:

(2) err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));

skb_cow_head() does nothing when there is enough room. Otherwise, it
reallocates more room, which depends on the architecture. Briefly,
skb_cow_head() calls __skb_cow() which then calls pskb_expand_head() as
follows:

pskb_expand_head(skb, ALIGN(delta, NET_SKB_PAD), 0, GFP_ATOMIC);

"delta" represents the number of bytes to be added. This value is
aligned with NET_SKB_PAD, which is defined as follows:

NET_SKB_PAD = max(32, L1_CACHE_BYTES)

... where L1_CACHE_BYTES also depends on the architecture. In our case
(x86), it is defined as follows:

L1_CACHE_BYTES = (1 << CONFIG_X86_L1_CACHE_SHIFT)

... where (again, in our case) CONFIG_X86_L1_CACHE_SHIFT equals 6
(=X86_GENERIC).

All this to say, skb_cow_head() would reallocate to the next multiple of
NET_SKB_PAD (in our case a 64-byte multiple) when there is not enough
room.

Back to the main issue with the pattern: in some cases, two
reallocations are triggered, resulting in a performance drop (i.e.,
lines (1) and (2) would both trigger an implicit reallocation). How's
that possible? Well, this is kind of bad luck as we hit an exact
NET_SKB_PAD boundary and when skb->mac_len (=14) is smaller than
LL_RESERVED_SPACE(dst->dev) (=16 in our case). For an x86 arch, it
happens in the following cases (with the default needed_headroom):

- ioam6:
 - (inline mode) pre-allocated data trace of 236 or 240 bytes
 - (encap mode) pre-allocated data trace of 196 or 200 bytes
- seg6:
 - (encap mode) for 13, 17, 21, 25, 29, 33, ...(+4)... prefixes

Let's illustrate the problem, i.e., when we fall on the exact
NET_SKB_PAD boundary. In the case of ioam6, for the above problematic
values, the total overhead is 256 bytes for both modes. Based on line
(1), skb->mac_len (=14) is added, therefore passing 270 bytes to
skb_cow_head(). At that moment, the headroom has 206 bytes available (in
our case). Since 270 > 206, skb_cow_head() performs a reallocation and
the new headroom is now 206 + 64 (NET_SKB_PAD) = 270. Which is exactly
the room we needed. After the insertion, the headroom has 0 byte
available. But, there's line (2) where 16 bytes are still needed. Which,
again, triggers another reallocation.

The same logic is applied to seg6 (although it does not happen with the
inline mode, i.e., -40 bytes). It happens with other L1 cache shifts too
(the larger the cache shift, the less often it happens). For example,
with a +32 cache shift (instead of +64), the following number of
segments would trigger two reallocations: 11, 15, 19, ... With a +128
cache shift, the following number of segments would trigger two
reallocations: 17, 25, 33, ... And so on and so forth. Note that it is
the same for both the "encap" and "l2encap" modes. For the "encap.red"
and "l2encap.red" modes, it is the same logic but with "segs+1" (e.g.,
14, 18, 22, 26, etc for a +64 cache shift). Note also that it may happen
with rpl6 (based on some calculations), although it did not in our case.

This series provides a solution to mitigate the aforementioned issue for
ioam6, seg6, and rpl6. It provides the dst_entry (in the cache) to
skb_cow_head() **before** the insertion (line (1)). As a result, the
very first iteration would still trigger two reallocations (i.e., empty
cache), while next iterations would only trigger a single reallocation.
====================

Link: https://patch.msgid.link/20241203124945.22508-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
  • Loading branch information
Paolo Abeni committed Dec 5, 2024
2 parents 152d00a + 985ec6f commit da4fa00
Show file tree
Hide file tree
Showing 4 changed files with 123 additions and 90 deletions.
9 changes: 9 additions & 0 deletions include/net/dst.h
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,15 @@ static inline void dst_set_expires(struct dst_entry *dst, int timeout)
dst->expires = expires;
}

static inline unsigned int dst_dev_overhead(struct dst_entry *dst,
struct sk_buff *skb)
{
if (likely(dst))
return LL_RESERVED_SPACE(dst->dev);

return skb->mac_len;
}

INDIRECT_CALLABLE_DECLARE(int ip6_output(struct net *, struct sock *,
struct sk_buff *));
INDIRECT_CALLABLE_DECLARE(int ip_output(struct net *, struct sock *,
Expand Down
73 changes: 37 additions & 36 deletions net/ipv6/ioam6_iptunnel.c
Original file line number Diff line number Diff line change
Expand Up @@ -253,14 +253,15 @@ static int ioam6_do_fill(struct net *net, struct sk_buff *skb)
}

static int ioam6_do_inline(struct net *net, struct sk_buff *skb,
struct ioam6_lwt_encap *tuninfo)
struct ioam6_lwt_encap *tuninfo,
struct dst_entry *cache_dst)
{
struct ipv6hdr *oldhdr, *hdr;
int hdrlen, err;

hdrlen = (tuninfo->eh.hdrlen + 1) << 3;

err = skb_cow_head(skb, hdrlen + skb->mac_len);
err = skb_cow_head(skb, hdrlen + dst_dev_overhead(cache_dst, skb));
if (unlikely(err))
return err;

Expand Down Expand Up @@ -291,7 +292,8 @@ static int ioam6_do_encap(struct net *net, struct sk_buff *skb,
struct ioam6_lwt_encap *tuninfo,
bool has_tunsrc,
struct in6_addr *tunsrc,
struct in6_addr *tundst)
struct in6_addr *tundst,
struct dst_entry *cache_dst)
{
struct dst_entry *dst = skb_dst(skb);
struct ipv6hdr *hdr, *inner_hdr;
Expand All @@ -300,7 +302,7 @@ static int ioam6_do_encap(struct net *net, struct sk_buff *skb,
hdrlen = (tuninfo->eh.hdrlen + 1) << 3;
len = sizeof(*hdr) + hdrlen;

err = skb_cow_head(skb, len + skb->mac_len);
err = skb_cow_head(skb, len + dst_dev_overhead(cache_dst, skb));
if (unlikely(err))
return err;

Expand Down Expand Up @@ -334,7 +336,7 @@ static int ioam6_do_encap(struct net *net, struct sk_buff *skb,

static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
struct dst_entry *dst = skb_dst(skb), *cache_dst;
struct in6_addr orig_daddr;
struct ioam6_lwt *ilwt;
int err = -EINVAL;
Expand All @@ -352,14 +354,18 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)

orig_daddr = ipv6_hdr(skb)->daddr;

local_bh_disable();
cache_dst = dst_cache_get(&ilwt->cache);
local_bh_enable();

switch (ilwt->mode) {
case IOAM6_IPTUNNEL_MODE_INLINE:
do_inline:
/* Direct insertion - if there is no Hop-by-Hop yet */
if (ipv6_hdr(skb)->nexthdr == NEXTHDR_HOP)
goto out;

err = ioam6_do_inline(net, skb, &ilwt->tuninfo);
err = ioam6_do_inline(net, skb, &ilwt->tuninfo, cache_dst);
if (unlikely(err))
goto drop;

Expand All @@ -369,7 +375,7 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
/* Encapsulation (ip6ip6) */
err = ioam6_do_encap(net, skb, &ilwt->tuninfo,
ilwt->has_tunsrc, &ilwt->tunsrc,
&ilwt->tundst);
&ilwt->tundst, cache_dst);
if (unlikely(err))
goto drop;

Expand All @@ -387,41 +393,36 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
goto drop;
}

err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
if (unlikely(err))
goto drop;
if (unlikely(!cache_dst)) {
struct ipv6hdr *hdr = ipv6_hdr(skb);
struct flowi6 fl6;

memset(&fl6, 0, sizeof(fl6));
fl6.daddr = hdr->daddr;
fl6.saddr = hdr->saddr;
fl6.flowlabel = ip6_flowinfo(hdr);
fl6.flowi6_mark = skb->mark;
fl6.flowi6_proto = hdr->nexthdr;

cache_dst = ip6_route_output(net, NULL, &fl6);
if (cache_dst->error) {
err = cache_dst->error;
dst_release(cache_dst);
goto drop;
}

if (!ipv6_addr_equal(&orig_daddr, &ipv6_hdr(skb)->daddr)) {
local_bh_disable();
dst = dst_cache_get(&ilwt->cache);
dst_cache_set_ip6(&ilwt->cache, cache_dst, &fl6.saddr);
local_bh_enable();

if (unlikely(!dst)) {
struct ipv6hdr *hdr = ipv6_hdr(skb);
struct flowi6 fl6;

memset(&fl6, 0, sizeof(fl6));
fl6.daddr = hdr->daddr;
fl6.saddr = hdr->saddr;
fl6.flowlabel = ip6_flowinfo(hdr);
fl6.flowi6_mark = skb->mark;
fl6.flowi6_proto = hdr->nexthdr;

dst = ip6_route_output(net, NULL, &fl6);
if (dst->error) {
err = dst->error;
dst_release(dst);
goto drop;
}

local_bh_disable();
dst_cache_set_ip6(&ilwt->cache, dst, &fl6.saddr);
local_bh_enable();
}
err = skb_cow_head(skb, LL_RESERVED_SPACE(cache_dst->dev));
if (unlikely(err))
goto drop;
}

if (!ipv6_addr_equal(&orig_daddr, &ipv6_hdr(skb)->daddr)) {
skb_dst_drop(skb);
skb_dst_set(skb, dst);

skb_dst_set(skb, cache_dst);
return dst_output(net, sk, skb);
}
out:
Expand Down
46 changes: 25 additions & 21 deletions net/ipv6/rpl_iptunnel.c
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ static void rpl_destroy_state(struct lwtunnel_state *lwt)
}

static int rpl_do_srh_inline(struct sk_buff *skb, const struct rpl_lwt *rlwt,
const struct ipv6_rpl_sr_hdr *srh)
const struct ipv6_rpl_sr_hdr *srh,
struct dst_entry *cache_dst)
{
struct ipv6_rpl_sr_hdr *isrh, *csrh;
const struct ipv6hdr *oldhdr;
Expand Down Expand Up @@ -153,7 +154,7 @@ static int rpl_do_srh_inline(struct sk_buff *skb, const struct rpl_lwt *rlwt,

hdrlen = ((csrh->hdrlen + 1) << 3);

err = skb_cow_head(skb, hdrlen + skb->mac_len);
err = skb_cow_head(skb, hdrlen + dst_dev_overhead(cache_dst, skb));
if (unlikely(err)) {
kfree(buf);
return err;
Expand Down Expand Up @@ -186,7 +187,8 @@ static int rpl_do_srh_inline(struct sk_buff *skb, const struct rpl_lwt *rlwt,
return 0;
}

static int rpl_do_srh(struct sk_buff *skb, const struct rpl_lwt *rlwt)
static int rpl_do_srh(struct sk_buff *skb, const struct rpl_lwt *rlwt,
struct dst_entry *cache_dst)
{
struct dst_entry *dst = skb_dst(skb);
struct rpl_iptunnel_encap *tinfo;
Expand All @@ -196,7 +198,7 @@ static int rpl_do_srh(struct sk_buff *skb, const struct rpl_lwt *rlwt)

tinfo = rpl_encap_lwtunnel(dst->lwtstate);

return rpl_do_srh_inline(skb, rlwt, tinfo->srh);
return rpl_do_srh_inline(skb, rlwt, tinfo->srh, cache_dst);
}

static int rpl_output(struct net *net, struct sock *sk, struct sk_buff *skb)
Expand All @@ -208,14 +210,14 @@ static int rpl_output(struct net *net, struct sock *sk, struct sk_buff *skb)

rlwt = rpl_lwt_lwtunnel(orig_dst->lwtstate);

err = rpl_do_srh(skb, rlwt);
if (unlikely(err))
goto drop;

local_bh_disable();
dst = dst_cache_get(&rlwt->cache);
local_bh_enable();

err = rpl_do_srh(skb, rlwt, dst);
if (unlikely(err))
goto drop;

if (unlikely(!dst)) {
struct ipv6hdr *hdr = ipv6_hdr(skb);
struct flowi6 fl6;
Expand All @@ -237,15 +239,15 @@ static int rpl_output(struct net *net, struct sock *sk, struct sk_buff *skb)
local_bh_disable();
dst_cache_set_ip6(&rlwt->cache, dst, &fl6.saddr);
local_bh_enable();

err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
if (unlikely(err))
goto drop;
}

skb_dst_drop(skb);
skb_dst_set(skb, dst);

err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
if (unlikely(err))
goto drop;

return dst_output(net, sk, skb);

drop:
Expand All @@ -262,29 +264,31 @@ static int rpl_input(struct sk_buff *skb)

rlwt = rpl_lwt_lwtunnel(orig_dst->lwtstate);

err = rpl_do_srh(skb, rlwt);
if (unlikely(err))
goto drop;

local_bh_disable();
dst = dst_cache_get(&rlwt->cache);
local_bh_enable();

err = rpl_do_srh(skb, rlwt, dst);
if (unlikely(err))
goto drop;

if (!dst) {
ip6_route_input(skb);
dst = skb_dst(skb);
if (!dst->error) {
local_bh_disable();
dst_cache_set_ip6(&rlwt->cache, dst,
&ipv6_hdr(skb)->saddr);
local_bh_enable();
}

err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
if (unlikely(err))
goto drop;
} else {
skb_dst_drop(skb);
skb_dst_set(skb, dst);
}
local_bh_enable();

err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
if (unlikely(err))
goto drop;

return dst_input(skb);

Expand Down
Loading

0 comments on commit da4fa00

Please sign in to comment.