Skip to content

Commit

Permalink
Merge branch 'dsa-changes-for-multiple-cpu-ports-part-4'
Browse files Browse the repository at this point in the history
Vladimir Oltean says:

====================
DSA changes for multiple CPU ports (part 4)

Those who have been following part 1:
https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/
part 2:
https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/
and part 3:
https://patchwork.kernel.org/project/netdevbpf/cover/20220819174820.3585002-1-vladimir.oltean@nxp.com/
will know that I am trying to enable the second internal port pair from
the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q".

This series represents the final part of that effort. We have:

- the introduction of new UAPI in the form of IFLA_DSA_MASTER, the
  iproute2 patch for which is here:
  https://patchwork.kernel.org/project/netdevbpf/patch/20220904190025.813574-1-vladimir.oltean@nxp.com/

- preparation for LAG DSA masters in terms of suppressing some
  operations for masters in the DSA core that simply don't make sense
  when those masters are a bonding/team interface

- handling all the net device events that occur between DSA and a
  LAG DSA master, including migration to a different DSA master when the
  current master joins a LAG, or the LAG gets destroyed

- updating documentation

- adding an implementation for NXP LS1028A, where things are insanely
  complicated due to hardware limitations. We have 2 tagging protocols:

  * the native "ocelot" protocol (NPI port mode). This does not support
    CPU ports in a LAG, and supports a single DSA master. The DSA master
    can be changed between eno2 (2.5G) and eno3 (1G), but all ports must
    be down during the changing process, and user ports assigned to the
    old DSA master will refuse to come up if the user requests that
    during a "transient" state.

  * the "ocelot-8021q" software-defined protocol, where the Ethernet
    ports connected to the CPU are not actually "god mode" ports as far
    as the hardware is concerned. So here, static assignment between
    user and CPU ports is possible by editing the PGID_SRC masks for
    the port-based forwarding matrix, and "CPU ports in a LAG" simply
    means "a LAG like any other".

The series was regression-tested on LS1028A using the local_termination.sh
kselftest, in most of the possible operating modes and tagging protocols.
I have not done a detailed performance evaluation yet, but using LAG, is
possible to exceed the termination bandwidth of a single CPU port in an
iperf3 test with multiple senders and multiple receivers.

v1 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20220830195932.683432-1-vladimir.oltean@nxp.com/

Previous (older) RFC at:
https://lore.kernel.org/netdev/20220523104256.3556016-1-olteanv@gmail.com/
====================

Link: https://lore.kernel.org/r/20220911010706.2137967-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
  • Loading branch information
Paolo Abeni committed Sep 20, 2022
2 parents 42e53b4 + eca7010 commit e8b9f0d
Show file tree
Hide file tree
Showing 27 changed files with 1,064 additions and 87 deletions.
96 changes: 96 additions & 0 deletions Documentation/networking/dsa/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ In this documentation the following Ethernet interfaces are used:
*eth0*
the master interface

*eth1*
another master interface

*lan1*
a slave interface

Expand Down Expand Up @@ -360,3 +363,96 @@ the ``self`` flag) has been removed. This results in the following changes:
Script writers are therefore encouraged to use the ``master static`` set of
flags when working with bridge FDB entries on DSA switch interfaces.

Affinity of user ports to CPU ports
-----------------------------------

Typically, DSA switches are attached to the host via a single Ethernet
interface, but in cases where the switch chip is discrete, the hardware design
may permit the use of 2 or more ports connected to the host, for an increase in
termination throughput.

DSA can make use of multiple CPU ports in two ways. First, it is possible to
statically assign the termination traffic associated with a certain user port
to be processed by a certain CPU port. This way, user space can implement
custom policies of static load balancing between user ports, by spreading the
affinities according to the available CPU ports.

Secondly, it is possible to perform load balancing between CPU ports on a per
packet basis, rather than statically assigning user ports to CPU ports.
This can be achieved by placing the DSA masters under a LAG interface (bonding
or team). DSA monitors this operation and creates a mirror of this software LAG
on the CPU ports facing the physical DSA masters that constitute the LAG slave
devices.

To make use of multiple CPU ports, the firmware (device tree) description of
the switch must mark all the links between CPU ports and their DSA masters
using the ``ethernet`` reference/phandle. At startup, only a single CPU port
and DSA master will be used - the numerically first port from the firmware
description which has an ``ethernet`` property. It is up to the user to
configure the system for the switch to use other masters.

DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow
changing the DSA master of a user port. The ``IFLA_DSA_MASTER`` u32 netlink
attribute contains the ifindex of the master device that handles each slave
device. The DSA master must be a valid candidate based on firmware node
information, or a LAG interface which contains only slaves which are valid
candidates.

Using iproute2, the following manipulations are possible:

.. code-block:: sh
# See the DSA master in current use
ip -d link show dev swp0
(...)
dsa master eth0
# Static CPU port distribution
ip link set swp0 type dsa master eth1
ip link set swp1 type dsa master eth0
ip link set swp2 type dsa master eth1
ip link set swp3 type dsa master eth0
# CPU ports in LAG, using explicit assignment of the DSA master
ip link add bond0 type bond mode balance-xor && ip link set bond0 up
ip link set eth1 down && ip link set eth1 master bond0
ip link set swp0 type dsa master bond0
ip link set swp1 type dsa master bond0
ip link set swp2 type dsa master bond0
ip link set swp3 type dsa master bond0
ip link set eth0 down && ip link set eth0 master bond0
ip -d link show dev swp0
(...)
dsa master bond0
# CPU ports in LAG, relying on implicit migration of the DSA master
ip link add bond0 type bond mode balance-xor && ip link set bond0 up
ip link set eth0 down && ip link set eth0 master bond0
ip link set eth1 down && ip link set eth1 master bond0
ip -d link show dev swp0
(...)
dsa master bond0
Notice that in the case of CPU ports under a LAG, the use of the
``IFLA_DSA_MASTER`` netlink attribute is not strictly needed, but rather, DSA
reacts to the ``IFLA_MASTER`` attribute change of its present master (``eth0``)
and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly,
when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports
that were assigned to this interface to the first physical DSA master which is
eligible, based on the firmware description (it effectively reverts to the
startup configuration).

In a setup with more than 2 physical CPU ports, it is therefore possible to mix
static user to CPU port assignment with LAG between DSA masters. It is not
possible to statically assign a user port towards a DSA master that has any
upper interfaces (this includes LAG devices - the master must always be the LAG
in this case).

Live changing of the DSA master (and thus CPU port) affinity of a user port is
permitted, in order to allow dynamic redistribution in response to traffic.

Physical DSA masters are allowed to join and leave at any time a LAG interface
used as a DSA master; however, DSA will reject a LAG interface as a valid
candidate for being a DSA master unless it has at least one physical DSA master
as a slave device.
38 changes: 32 additions & 6 deletions Documentation/networking/dsa/dsa.rst
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,20 @@ These frames are then queued for transmission using the master network device
Ethernet switch will be able to process these incoming frames from the
management interface and deliver them to the physical switch port.

When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
device between the DSA slave devices and the physical DSA masters. The LAG
device is thus also a DSA master, but the LAG slave devices continue to be DSA
masters as well (just with no user port assigned to them; this is needed for
recovery in case the LAG DSA master disappears). Thus, the data path of the LAG
DSA master is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA master;
LAG slave). Therefore, the RX data path of the LAG DSA master is not used.
On the other hand, TX takes place linearly: ``dsa_slave_xmit`` calls
``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA master.
The latter calls ``dev_queue_xmit`` towards one physical DSA master or the
other, and in both cases, the packet exits the system through a hardware path
towards the switch.

Graphical representation
------------------------

Expand Down Expand Up @@ -629,6 +643,24 @@ Switch configuration
PHY cannot be found. In this case, probing of the DSA switch continues
without that particular port.

- ``port_change_master``: method through which the affinity (association used
for traffic termination purposes) between a user port and a CPU port can be
changed. By default all user ports from a tree are assigned to the first
available CPU port that makes sense for them (most of the times this means
the user ports of a tree are all assigned to the same CPU port, except for H
topologies as described in commit 2c0b03258b8b). The ``port`` argument
represents the index of the user port, and the ``master`` argument represents
the new DSA master ``net_device``. The CPU port associated with the new
master can be retrieved by looking at ``struct dsa_port *cpu_dp =
master->dsa_ptr``. Additionally, the master can also be a LAG device where
all the slave devices are physical DSA masters. LAG DSA masters also have a
valid ``master->dsa_ptr`` pointer, however this is not unique, but rather a
duplicate of the first physical DSA master's (LAG slave) ``dsa_ptr``. In case
of a LAG DSA master, a further call to ``port_lag_join`` will be emitted
separately for the physical CPU ports associated with the physical DSA
masters, requesting them to create a hardware LAG associated with the LAG
interface.

PHY devices and link management
-------------------------------

Expand Down Expand Up @@ -1095,9 +1127,3 @@ capable hardware, but does not enforce a strict switch device driver model. On
the other DSA enforces a fairly strict device driver model, and deals with most
of the switch specific. At some point we should envision a merger between these
two subsystems and get the best of both worlds.

Other hanging fruits
--------------------

- allowing more than one CPU/management interface:
http://comments.gmane.org/gmane.linux.network/365657
4 changes: 2 additions & 2 deletions drivers/net/dsa/bcm_sf2.c
Original file line number Diff line number Diff line change
Expand Up @@ -983,7 +983,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
static void bcm_sf2_sw_get_wol(struct dsa_switch *ds, int port,
struct ethtool_wolinfo *wol)
{
struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
struct ethtool_wolinfo pwol = { };

Expand All @@ -1007,7 +1007,7 @@ static void bcm_sf2_sw_get_wol(struct dsa_switch *ds, int port,
static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int port,
struct ethtool_wolinfo *wol)
{
struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
s8 cpu_port = dsa_to_port(ds, port)->cpu_dp->index;
struct ethtool_wolinfo pwol = { };
Expand Down
4 changes: 2 additions & 2 deletions drivers/net/dsa/bcm_sf2_cfp.c
Original file line number Diff line number Diff line change
Expand Up @@ -1102,7 +1102,7 @@ static int bcm_sf2_cfp_rule_get_all(struct bcm_sf2_priv *priv,
int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
struct ethtool_rxnfc *nfc, u32 *rule_locs)
{
struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
int ret = 0;

Expand Down Expand Up @@ -1145,7 +1145,7 @@ int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
int bcm_sf2_set_rxnfc(struct dsa_switch *ds, int port,
struct ethtool_rxnfc *nfc)
{
struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
int ret = 0;

Expand Down
4 changes: 2 additions & 2 deletions drivers/net/dsa/lan9303-core.c
Original file line number Diff line number Diff line change
Expand Up @@ -1092,7 +1092,7 @@ static int lan9303_port_enable(struct dsa_switch *ds, int port,
if (!dsa_port_is_user(dp))
return 0;

vlan_vid_add(dp->cpu_dp->master, htons(ETH_P_8021Q), port);
vlan_vid_add(dsa_port_to_master(dp), htons(ETH_P_8021Q), port);

return lan9303_enable_processing_port(chip, port);
}
Expand All @@ -1105,7 +1105,7 @@ static void lan9303_port_disable(struct dsa_switch *ds, int port)
if (!dsa_port_is_user(dp))
return;

vlan_vid_del(dp->cpu_dp->master, htons(ETH_P_8021Q), port);
vlan_vid_del(dsa_port_to_master(dp), htons(ETH_P_8021Q), port);

lan9303_disable_processing_port(chip, port);
lan9303_phy_write(ds, chip->phy_addr_base + port, MII_BMCR, BMCR_PDOWN);
Expand Down
27 changes: 19 additions & 8 deletions drivers/net/dsa/mv88e6xxx/chip.c
Original file line number Diff line number Diff line change
Expand Up @@ -6593,14 +6593,17 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,

static bool mv88e6xxx_lag_can_offload(struct dsa_switch *ds,
struct dsa_lag lag,
struct netdev_lag_upper_info *info)
struct netdev_lag_upper_info *info,
struct netlink_ext_ack *extack)
{
struct mv88e6xxx_chip *chip = ds->priv;
struct dsa_port *dp;
int members = 0;

if (!mv88e6xxx_has_lag(chip))
if (!mv88e6xxx_has_lag(chip)) {
NL_SET_ERR_MSG_MOD(extack, "Chip does not support LAG offload");
return false;
}

if (!lag.id)
return false;
Expand All @@ -6609,14 +6612,20 @@ static bool mv88e6xxx_lag_can_offload(struct dsa_switch *ds,
/* Includes the port joining the LAG */
members++;

if (members > 8)
if (members > 8) {
NL_SET_ERR_MSG_MOD(extack,
"Cannot offload more than 8 LAG ports");
return false;
}

/* We could potentially relax this to include active
* backup in the future.
*/
if (info->tx_type != NETDEV_LAG_TX_TYPE_HASH)
if (info->tx_type != NETDEV_LAG_TX_TYPE_HASH) {
NL_SET_ERR_MSG_MOD(extack,
"Can only offload LAG using hash TX type");
return false;
}

/* Ideally we would also validate that the hash type matches
* the hardware. Alas, this is always set to unknown on team
Expand Down Expand Up @@ -6769,12 +6778,13 @@ static int mv88e6xxx_port_lag_change(struct dsa_switch *ds, int port)

static int mv88e6xxx_port_lag_join(struct dsa_switch *ds, int port,
struct dsa_lag lag,
struct netdev_lag_upper_info *info)
struct netdev_lag_upper_info *info,
struct netlink_ext_ack *extack)
{
struct mv88e6xxx_chip *chip = ds->priv;
int err, id;

if (!mv88e6xxx_lag_can_offload(ds, lag, info))
if (!mv88e6xxx_lag_can_offload(ds, lag, info, extack))
return -EOPNOTSUPP;

/* DSA LAG IDs are one-based */
Expand Down Expand Up @@ -6827,12 +6837,13 @@ static int mv88e6xxx_crosschip_lag_change(struct dsa_switch *ds, int sw_index,

static int mv88e6xxx_crosschip_lag_join(struct dsa_switch *ds, int sw_index,
int port, struct dsa_lag lag,
struct netdev_lag_upper_info *info)
struct netdev_lag_upper_info *info,
struct netlink_ext_ack *extack)
{
struct mv88e6xxx_chip *chip = ds->priv;
int err;

if (!mv88e6xxx_lag_can_offload(ds, lag, info))
if (!mv88e6xxx_lag_can_offload(ds, lag, info, extack))
return -EOPNOTSUPP;

mv88e6xxx_reg_lock(chip);
Expand Down
Loading

0 comments on commit e8b9f0d

Please sign in to comment.