Skip to content

Commit

Permalink
Merge branch 'add-layer-2-miss-indication-and-filtering'
Browse files Browse the repository at this point in the history
Ido Schimmel says:

====================
Add layer 2 miss indication and filtering

tl;dr
=====

This patchset adds a single bit to the tc skb extension to indicate that
a packet encountered a layer 2 miss in the bridge and extends flower to
match on this metadata. This is required for non-DF (Designated
Forwarder) filtering in EVPN multi-homing which prevents decapsulated
BUM packets from being forwarded multiple times to the same multi-homed
host.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches in a rack. These switches act as VTEPs and are not directly
connected (as opposed to MLAG), but can communicate with each other (as
well as with VTEPs in remote racks) via spine switches over L3.

When a host sends a BUM packet over ES1 to VTEP1, the VTEP will flood it
to other VTEPs in the network, including those connected to the host
over ES1. The receiving VTEPs must drop the packet and not forward it
back to the host. This is called "split-horizon filtering" (SPH) [1].

FRR configures SPH filtering using two tc filters. The first, an ingress
filter that matches on packets received from VTEP1 and marks them using
a fwmark (firewall mark). The second, an egress filter configured on the
LAG interface connected to the host that matches on the fwmark and drops
the packets. Example:

 # tc filter add dev vxlan0 ingress pref 1 proto all flower enc_src_ip $VTEP1_IP action skbedit mark 101
 # tc filter add dev bond0 egress pref 1 handle 101 fw action drop

Motivation
==========

For each ES, only one VTEP is elected by the control plane as the DF.
The DF is responsible for forwarding decapsulated BUM traffic to the
host over the ES. The non-DF VTEPs must drop such traffic as otherwise
the host will receive multiple copies of BUM traffic. This is called
"non-DF filtering" [2].

Filtering of multicast and broadcast traffic can be achieved using the
following flower filter:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop

Unlike broadcast and multicast traffic, it is not currently possible to
filter unknown unicast traffic. The classification into unknown unicast
is performed by the bridge driver, but is not visible to other layers.

Implementation
==============

The proposed solution is to add a single bit to the tc skb extension
that is set by the bridge for packets that encountered an FDB or MDB
miss. The flower classifier is extended to be able to match on this new
metadata bit in a similar fashion to existing metadata options such as
'indev'.

A bit that is set for every flooded packet would also work, but it does
not allow us to differentiate between registered and unregistered
multicast traffic which might be useful in the future.

A relatively generic name is chosen for this bit - 'l2_miss' - to allow
its use to be extended to other layer 2 devices such as VXLAN, should a
use case arise.

With the above, the control plane can implement a non-DF filter using
the following tc filters:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
 # tc filter add dev bond0 egress pref 2 proto all flower indev vxlan0 l2_miss true action drop

The first drops broadcast and multicast traffic and the second drops
unknown unicast traffic.

Testing
=======

A test exercising the different permutations of the 'l2_miss' bit is
added in patch #8.

Patchset overview
=================

Patch #1 adds the new bit to the tc skb extension and sets it in the
bridge driver for packets that encountered a miss. The marking of the
packets and the use of this extension is protected by the
'tc_skb_ext_tc' static key in order to keep performance impact to a
minimum when the feature is not in use.

Patch #2 extends the flow dissector to dissect this information from the
tc skb extension into the 'FLOW_DISSECTOR_KEY_META' key.

Patch #3 extends the flower classifier to be able to match on the new
layer 2 miss metadata. The classifier enables the 'tc_skb_ext_tc' static
key upon the installation of the first filter that matches on 'l2_miss'
and disables the key upon the removal of the last filter that matches on
it.

Patch #4 rejects matching on the new metadata in drivers that already
support the 'FLOW_DISSECTOR_KEY_META' key.

Patches #5-#6 are small preparations in mlxsw.

Patch #7 extends mlxsw to be able to match on layer 2 miss.

Patch #8 adds a selftest.

iproute2 patches can be found here [3].

[1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
[2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
[3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1
[4] https://lore.kernel.org/netdev/20230518113328.1952135-1-idosch@nvidia.com/
[5] https://lore.kernel.org/netdev/20230509070446.246088-1-idosch@nvidia.com/
====================

Link: https://lore.kernel.org/r/20230529114835.372140-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
  • Loading branch information
Jakub Kicinski committed May 31, 2023
2 parents 2e246bc + 8c33266 commit e180a33
Show file tree
Hide file tree
Showing 18 changed files with 485 additions and 16 deletions.
6 changes: 6 additions & 0 deletions drivers/net/ethernet/marvell/prestera/prestera_flower.c
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,12 @@ static int prestera_flower_parse_meta(struct prestera_acl_rule *rule,
__be16 key, mask;

flow_rule_match_meta(f_rule, &match);

if (match.mask->l2_miss) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on \"l2_miss\"");
return -EOPNOTSUPP;
}

if (match.mask->ingress_ifindex != 0xFFFFFFFF) {
NL_SET_ERR_MSG_MOD(f->common.extack,
"Unsupported ingress ifindex mask");
Expand Down
6 changes: 6 additions & 0 deletions drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
Original file line number Diff line number Diff line change
Expand Up @@ -2587,6 +2587,12 @@ static int mlx5e_flower_parse_meta(struct net_device *filter_dev,
return 0;

flow_rule_match_meta(rule, &match);

if (match.mask->l2_miss) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on \"l2_miss\"");
return -EOPNOTSUPP;
}

if (!match.mask->ingress_ifindex)
return 0;

Expand Down
1 change: 1 addition & 0 deletions drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ static const struct mlxsw_afk_element_info mlxsw_afk_element_infos[] = {
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_64_95, 0x34, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_32_63, 0x38, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_0_31, 0x3C, 4),
MLXSW_AFK_ELEMENT_INFO_U32(FDB_MISS, 0x40, 0, 1),
};

struct mlxsw_afk {
Expand Down
3 changes: 2 additions & 1 deletion drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ enum mlxsw_afk_element {
MLXSW_AFK_ELEMENT_IP_DSCP,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_MSB,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_LSB,
MLXSW_AFK_ELEMENT_FDB_MISS,
MLXSW_AFK_ELEMENT_MAX,
};

Expand Down Expand Up @@ -69,7 +70,7 @@ struct mlxsw_afk_element_info {
MLXSW_AFK_ELEMENT_INFO(MLXSW_AFK_ELEMENT_TYPE_BUF, \
_element, _offset, 0, _size)

#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x40
#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x44

struct mlxsw_afk_element_inst { /* element instance in actual block */
enum mlxsw_afk_element element;
Expand Down
2 changes: 2 additions & 0 deletions drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,12 @@ const struct mlxsw_afk_ops mlxsw_sp1_afk_ops = {
};

static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_0[] = {
MLXSW_AFK_ELEMENT_INST_U32(FDB_MISS, 0x00, 3, 1),
MLXSW_AFK_ELEMENT_INST_BUF(DMAC_0_31, 0x04, 4),
};

static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_1[] = {
MLXSW_AFK_ELEMENT_INST_U32(FDB_MISS, 0x00, 3, 1),
MLXSW_AFK_ELEMENT_INST_BUF(SMAC_0_31, 0x04, 4),
};

Expand Down
45 changes: 32 additions & 13 deletions drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
Original file line number Diff line number Diff line change
Expand Up @@ -281,49 +281,68 @@ static int mlxsw_sp_flower_parse_actions(struct mlxsw_sp *mlxsw_sp,
return 0;
}

static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
struct flow_cls_offload *f,
struct mlxsw_sp_flow_block *block)
static int
mlxsw_sp_flower_parse_meta_iif(struct mlxsw_sp_acl_rule_info *rulei,
const struct mlxsw_sp_flow_block *block,
const struct flow_match_meta *match,
struct netlink_ext_ack *extack)
{
struct flow_rule *rule = flow_cls_offload_flow_rule(f);
struct mlxsw_sp_port *mlxsw_sp_port;
struct net_device *ingress_dev;
struct flow_match_meta match;

if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META))
if (!match->mask->ingress_ifindex)
return 0;

flow_rule_match_meta(rule, &match);
if (match.mask->ingress_ifindex != 0xFFFFFFFF) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress ifindex mask");
if (match->mask->ingress_ifindex != 0xFFFFFFFF) {
NL_SET_ERR_MSG_MOD(extack, "Unsupported ingress ifindex mask");
return -EINVAL;
}

ingress_dev = __dev_get_by_index(block->net,
match.key->ingress_ifindex);
match->key->ingress_ifindex);
if (!ingress_dev) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Can't find specified ingress port to match on");
NL_SET_ERR_MSG_MOD(extack, "Can't find specified ingress port to match on");
return -EINVAL;
}

if (!mlxsw_sp_port_dev_check(ingress_dev)) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on non-mlxsw ingress port");
NL_SET_ERR_MSG_MOD(extack, "Can't match on non-mlxsw ingress port");
return -EINVAL;
}

mlxsw_sp_port = netdev_priv(ingress_dev);
if (mlxsw_sp_port->mlxsw_sp != block->mlxsw_sp) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on a port from different device");
NL_SET_ERR_MSG_MOD(extack, "Can't match on a port from different device");
return -EINVAL;
}

mlxsw_sp_acl_rulei_keymask_u32(rulei,
MLXSW_AFK_ELEMENT_SRC_SYS_PORT,
mlxsw_sp_port->local_port,
0xFFFFFFFF);

return 0;
}

static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
struct flow_cls_offload *f,
struct mlxsw_sp_flow_block *block)
{
struct flow_rule *rule = flow_cls_offload_flow_rule(f);
struct flow_match_meta match;

if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META))
return 0;

flow_rule_match_meta(rule, &match);

mlxsw_sp_acl_rulei_keymask_u32(rulei, MLXSW_AFK_ELEMENT_FDB_MISS,
match.key->l2_miss, match.mask->l2_miss);

return mlxsw_sp_flower_parse_meta_iif(rulei, block, &match,
f->common.extack);
}

static void mlxsw_sp_flower_parse_ipv4(struct mlxsw_sp_acl_rule_info *rulei,
struct flow_cls_offload *f)
{
Expand Down
10 changes: 10 additions & 0 deletions drivers/net/ethernet/mscc/ocelot_flower.c
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,16 @@ ocelot_flower_parse_key(struct ocelot *ocelot, int port, bool ingress,
return -EOPNOTSUPP;
}

if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
struct flow_match_meta match;

flow_rule_match_meta(rule, &match);
if (match.mask->l2_miss) {
NL_SET_ERR_MSG_MOD(extack, "Can't match on \"l2_miss\"");
return -EOPNOTSUPP;
}
}

/* For VCAP ES0 (egress rewriter) we can match on the ingress port */
if (!ingress) {
ret = ocelot_flower_parse_indev(ocelot, port, f, filter);
Expand Down
1 change: 1 addition & 0 deletions include/linux/skbuff.h
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,7 @@ struct tc_skb_ext {
u8 post_ct_snat:1;
u8 post_ct_dnat:1;
u8 act_miss:1; /* Set if act_miss_cookie is used */
u8 l2_miss:1; /* Set by bridge upon FDB or MDB miss */
};
#endif

Expand Down
2 changes: 2 additions & 0 deletions include/net/flow_dissector.h
Original file line number Diff line number Diff line change
Expand Up @@ -243,10 +243,12 @@ struct flow_dissector_key_ip {
* struct flow_dissector_key_meta:
* @ingress_ifindex: ingress ifindex
* @ingress_iftype: ingress interface type
* @l2_miss: packet did not match an L2 entry during forwarding
*/
struct flow_dissector_key_meta {
int ingress_ifindex;
u16 ingress_iftype;
u8 l2_miss;
};

/**
Expand Down
2 changes: 2 additions & 0 deletions include/uapi/linux/pkt_cls.h
Original file line number Diff line number Diff line change
Expand Up @@ -594,6 +594,8 @@ enum {

TCA_FLOWER_KEY_L2TPV3_SID, /* be32 */

TCA_FLOWER_L2_MISS, /* u8 */

__TCA_FLOWER_MAX,
};

Expand Down
1 change: 1 addition & 0 deletions net/bridge/br_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
u16 vid = 0;

memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
br_tc_skb_miss_set(skb, false);

rcu_read_lock();
nf_ops = rcu_dereference(nf_br_ops);
Expand Down
3 changes: 3 additions & 0 deletions net/bridge/br_forward.c
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
struct net_bridge_port *prev = NULL;
struct net_bridge_port *p;

br_tc_skb_miss_set(skb, pkt_type != BR_PKT_BROADCAST);

list_for_each_entry_rcu(p, &br->port_list, list) {
/* Do not flood unicast traffic to ports that turn it off, nor
* other traffic if flood off, except for traffic we originate
Expand Down Expand Up @@ -295,6 +297,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
allow_mode_include = false;
} else {
p = NULL;
br_tc_skb_miss_set(skb, true);
}

while (p || rp) {
Expand Down
1 change: 1 addition & 0 deletions net/bridge/br_input.c
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
return RX_HANDLER_CONSUMED;

memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
br_tc_skb_miss_set(skb, false);

p = br_port_get_rcu(skb->dev);
if (p->flags & BR_VLAN_TUNNEL)
Expand Down
27 changes: 27 additions & 0 deletions net/bridge/br_private.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#include <linux/u64_stats_sync.h>
#include <net/route.h>
#include <net/ip6_fib.h>
#include <net/pkt_cls.h>
#include <linux/if_vlan.h>
#include <linux/rhashtable.h>
#include <linux/refcount.h>
Expand Down Expand Up @@ -754,6 +755,32 @@ void br_boolopt_multi_get(const struct net_bridge *br,
struct br_boolopt_multi *bm);
void br_opt_toggle(struct net_bridge *br, enum net_bridge_opts opt, bool on);

#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
static inline void br_tc_skb_miss_set(struct sk_buff *skb, bool miss)
{
struct tc_skb_ext *ext;

if (!tc_skb_ext_tc_enabled())
return;

ext = skb_ext_find(skb, TC_SKB_EXT);
if (ext) {
ext->l2_miss = miss;
return;
}
if (!miss)
return;
ext = tc_skb_ext_alloc(skb);
if (!ext)
return;
ext->l2_miss = true;
}
#else
static inline void br_tc_skb_miss_set(struct sk_buff *skb, bool miss)
{
}
#endif

/* br_device.c */
void br_dev_setup(struct net_device *dev);
void br_dev_delete(struct net_device *dev, struct list_head *list);
Expand Down
10 changes: 10 additions & 0 deletions net/core/flow_dissector.c
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include <linux/tcp.h>
#include <linux/ptp_classify.h>
#include <net/flow_dissector.h>
#include <net/pkt_cls.h>
#include <scsi/fc/fc_fcoe.h>
#include <uapi/linux/batadv_packet.h>
#include <linux/bpf.h>
Expand Down Expand Up @@ -241,6 +242,15 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
FLOW_DISSECTOR_KEY_META,
target_container);
meta->ingress_ifindex = skb->skb_iif;
#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
if (tc_skb_ext_tc_enabled()) {
struct tc_skb_ext *ext;

ext = skb_ext_find(skb, TC_SKB_EXT);
if (ext)
meta->l2_miss = ext->l2_miss;
}
#endif
}
EXPORT_SYMBOL(skb_flow_dissect_meta);

Expand Down
30 changes: 28 additions & 2 deletions net/sched/cls_flower.c
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ struct cls_fl_filter {
u32 handle;
u32 flags;
u32 in_hw_count;
u8 needs_tc_skb_ext:1;
struct rcu_work rwork;
struct net_device *hw_dev;
/* Flower classifier is unlocked, which means that its reference counter
Expand Down Expand Up @@ -415,6 +416,8 @@ static struct cls_fl_head *fl_head_dereference(struct tcf_proto *tp)

static void __fl_destroy_filter(struct cls_fl_filter *f)
{
if (f->needs_tc_skb_ext)
tc_skb_ext_tc_disable();
tcf_exts_destroy(&f->exts);
tcf_exts_put_net(&f->exts);
kfree(f);
Expand Down Expand Up @@ -615,7 +618,8 @@ static void *fl_get(struct tcf_proto *tp, u32 handle)
}

static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
[TCA_FLOWER_UNSPEC] = { .type = NLA_UNSPEC },
[TCA_FLOWER_UNSPEC] = { .strict_start_type =
TCA_FLOWER_L2_MISS },
[TCA_FLOWER_CLASSID] = { .type = NLA_U32 },
[TCA_FLOWER_INDEV] = { .type = NLA_STRING,
.len = IFNAMSIZ },
Expand Down Expand Up @@ -720,7 +724,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
[TCA_FLOWER_KEY_PPPOE_SID] = { .type = NLA_U16 },
[TCA_FLOWER_KEY_PPP_PROTO] = { .type = NLA_U16 },
[TCA_FLOWER_KEY_L2TPV3_SID] = { .type = NLA_U32 },

[TCA_FLOWER_L2_MISS] = NLA_POLICY_MAX(NLA_U8, 1),
};

static const struct nla_policy
Expand Down Expand Up @@ -1668,6 +1672,10 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
mask->meta.ingress_ifindex = 0xffffffff;
}

fl_set_key_val(tb, &key->meta.l2_miss, TCA_FLOWER_L2_MISS,
&mask->meta.l2_miss, TCA_FLOWER_UNSPEC,
sizeof(key->meta.l2_miss));

fl_set_key_val(tb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
sizeof(key->eth.dst));
Expand Down Expand Up @@ -2085,6 +2093,11 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
return ret;
}

static bool fl_needs_tc_skb_ext(const struct fl_flow_key *mask)
{
return mask->meta.l2_miss;
}

static int fl_set_parms(struct net *net, struct tcf_proto *tp,
struct cls_fl_filter *f, struct fl_flow_mask *mask,
unsigned long base, struct nlattr **tb,
Expand Down Expand Up @@ -2121,6 +2134,14 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp,
return -EINVAL;
}

/* Enable tc skb extension if filter matches on data extracted from
* this extension.
*/
if (fl_needs_tc_skb_ext(&mask->key)) {
f->needs_tc_skb_ext = 1;
tc_skb_ext_tc_enable();
}

return 0;
}

Expand Down Expand Up @@ -3074,6 +3095,11 @@ static int fl_dump_key(struct sk_buff *skb, struct net *net,
goto nla_put_failure;
}

if (fl_dump_key_val(skb, &key->meta.l2_miss,
TCA_FLOWER_L2_MISS, &mask->meta.l2_miss,
TCA_FLOWER_UNSPEC, sizeof(key->meta.l2_miss)))
goto nla_put_failure;

if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
sizeof(key->eth.dst)) ||
Expand Down
1 change: 1 addition & 0 deletions tools/testing/selftests/net/forwarding/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ TEST_PROGS = bridge_igmp.sh \
tc_chains.sh \
tc_flower_router.sh \
tc_flower.sh \
tc_flower_l2_miss.sh \
tc_mpls_l2vpn.sh \
tc_police.sh \
tc_shblocks.sh \
Expand Down
Loading

0 comments on commit e180a33

Please sign in to comment.