Skip to content

Commit

Permalink
Merge branch 'xsk-tx-metadata-launch-time-support'
Browse files Browse the repository at this point in the history
Song Yoong Siang says:

====================
xsk: TX metadata Launch Time support

This series expands the XDP TX metadata framework to allow user
applications to pass per packet 64-bit launch time directly to the kernel
driver, requesting launch time hardware offload support. The XDP TX
metadata framework will not perform any clock conversion or packet
reordering.

Please note that the role of Tx metadata is just to pass the launch time,
not to enable the offload feature. Users will need to enable the launch
time hardware offload feature of the device by using the respective
command, such as the tc-etf command.

Although some devices use the tc-etf command to enable their launch time
hardware offload feature, xsk packets will not go through the etf qdisc.
Therefore, in my opinion, the launch time should always be based on the PTP
Hardware Clock (PHC). Thus, i did not include a clock ID to indicate the
clock source.

To simplify the test steps, I modified the xdp_hw_metadata bpf self-test
tool in such a way that it will set the launch time based on the offset
provided by the user and the value of the Receive Hardware Timestamp, which
is against the PHC. This will eliminate the need to discipline System Clock
with the PHC and then use clock_gettime() to get the time.

Please note that AF_XDP lacks a feedback mechanism to inform the
application if the requested launch time is invalid. So, users are expected
to familiar with the horizon of the launch time of the device they use and
not request a launch time that is beyond the horizon. Otherwise, the driver
might interpret the launch time incorrectly and react wrongly. For stmmac
and igc, where modulo computation is used, a launch time larger than the
horizon will cause the device to transmit the packet earlier that the
requested launch time.

Although there is no feedback mechanism for the launch time request
for now, user still can check whether the requested launch time is
working or not, by requesting the Transmit Completion Hardware Timestamp.

v12:
  - Fix the comment in include/uapi/linux/if_xdp.h to allign with what is
    generated by ./tools/net/ynl/ynl-regen.sh to avoid dirty tree error in
    the netdev/ynl checks.

v11: https://lore.kernel.org/netdev/20250216074302.956937-1-yoong.siang.song@intel.com/
  - regenerate netdev_xsk_flags based on latest netdev.yaml (Jakub)

v10: https://lore.kernel.org/netdev/20250207021943.814768-1-yoong.siang.song@intel.com/
  - use net_err_ratelimited(), instead of net_ratelimit() (Maciej)
  - accumulate the amount of used descs in local variable and update the
    igc_metadata_request::used_desc once (Maciej)
  - Ensure reverse christmas tree rule (Maciej)

V9: https://lore.kernel.org/netdev/20250206060408.808325-1-yoong.siang.song@intel.com/
  - Remove the igc_desc_unused() checking (Maciej)
  - Ensure that skb allocation and DMA mapping work before proceeding to
    fill in igc_tx_buffer info, context desc, and data desc (Maciej)
  - Rate limit the error messages (Maciej)
  - Update the comment to indicate that the 2 descriptors needed by the
    empty frame are already taken into consideration (Maciej)
  - Handle the case where the insertion of an empty frame fails and
    explain the reason behind (Maciej)
  - put self SOB tag as last tag (Maciej)

V8: https://lore.kernel.org/netdev/20250205024116.798862-1-yoong.siang.song@intel.com/
  - check the number of used descriptor in xsk_tx_metadata_request()
    by using used_desc of struct igc_metadata_request, and then decreases
    the budget with it (Maciej)
  - submit another bug fix patch to set the buffer type for empty frame (Maciej):
    https://lore.kernel.org/netdev/20250205023603.798819-1-yoong.siang.song@intel.com/

V7: https://lore.kernel.org/netdev/20250204004907.789330-1-yoong.siang.song@intel.com/
  - split the refactoring code of igc empty packet insertion into a separate
    commit (Faizal)
  - add explanation on why the value "4" is used as igc transmit budget
    (Faizal)
  - perform a stress test by sending 1000 packets with 10ms interval and
    launch time set to 500us in the future (Faizal & Yong Liang)

V6: https://lore.kernel.org/netdev/20250116155350.555374-1-yoong.siang.song@intel.com/
  - fix selftest build errors by using asprintf() and realloc(), instead of
    managing the buffer sizes manually (Daniel, Stanislav)

V5: https://lore.kernel.org/netdev/20250114152718.120588-1-yoong.siang.song@intel.com/
  - change netdev feature name from tx-launch-time to tx-launch-time-fifo
    to explicitly state the FIFO behaviour (Stanislav)
  - improve the looping of xdp_hw_metadata app to wait for packet tx
    completion to be more readable by using clock_gettime() (Stanislav)
  - add launch time setup steps into xdp_hw_metadata app (Stanislav)

V4: https://lore.kernel.org/netdev/20250106135506.9687-1-yoong.siang.song@intel.com/
  - added XDP launch time support to the igc driver (Jesper & Florian)
  - added per-driver launch time limitation on xsk-tx-metadata.rst (Jesper)
  - added explanation on FIFO behavior on xsk-tx-metadata.rst (Jakub)
  - added step to enable launch time in the commit message (Jesper & Willem)
  - explicitly documented the type of launch_time and which clock source
    it is against (Willem)

V3: https://lore.kernel.org/netdev/20231203165129.1740512-1-yoong.siang.song@intel.com/
  - renamed to use launch time (Jesper & Willem)
  - changed the default launch time in xdp_hw_metadata apps from 1s to 0.1s
    because some NICs do not support such a large future time.

V2: https://lore.kernel.org/netdev/20231201062421.1074768-1-yoong.siang.song@intel.com/
  - renamed to use Earliest TxTime First (Willem)
  - renamed to use txtime (Willem)

V1: https://lore.kernel.org/netdev/20231130162028.852006-1-yoong.siang.song@intel.com/
====================

Link: https://patch.msgid.link/20250216093430.957880-1-yoong.siang.song@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
  • Loading branch information
Martin KaFai Lau committed Feb 20, 2025
2 parents 68b92ac + d7c3a7f commit 494a044
Show file tree
Hide file tree
Showing 15 changed files with 396 additions and 39 deletions.
4 changes: 4 additions & 0 deletions Documentation/netlink/specs/netdev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@ definitions:
name: tx-checksum
doc:
L3 checksum HW offload is supported by the driver.
-
name: tx-launch-time-fifo
doc:
Launch time HW offload is supported by the driver.
-
name: queue-type
type: enum
Expand Down
62 changes: 62 additions & 0 deletions Documentation/networking/xsk-tx-metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ The flags field enables the particular offload:
checksum. ``csum_start`` specifies byte offset of where the checksumming
should start and ``csum_offset`` specifies byte offset where the
device should store the computed checksum.
- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
packet for transmission at a pre-determined time called launch time. The
value of launch time is indicated by ``launch_time`` field of
``union xsk_tx_metadata``.

Besides the flags above, in order to trigger the offloads, the first
packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
Expand All @@ -65,6 +69,63 @@ In this case, when running in ``XDK_COPY`` mode, the TX checksum
is calculated on the CPU. Do not enable this option in production because
it will negatively affect performance.

Launch Time
===========

The value of the requested launch time should be based on the device's PTP
Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
compared to the ETF queuing discipline, which organizes packets and delays
their transmission. Instead, AF_XDP immediately hands off the packets to
the device driver without rearranging their order or holding them prior to
transmission. Since the driver maintains FIFO behavior and does not perform
packet reordering, a packet with a launch time request will block other
packets in the same Tx Queue until it is sent. Therefore, it is recommended
to allocate separate queue for scheduling traffic that is intended for
future transmission.

In scenarios where the launch time offload feature is disabled, the device
driver is expected to disregard the launch time request. For correct
interpretation and meaningful operation, the launch time should never be
set to a value larger than the farthest programmable time in the future
(the horizon). Different devices have different hardware limitations on the
launch time offload feature.

stmmac driver
-------------

For stmmac, TSO and launch time (TBS) features are mutually exclusive for
each individual Tx Queue. By default, the driver configures Tx Queue 0 to
support TSO and the rest of the Tx Queues to support TBS. The launch time
hardware offload feature can be enabled or disabled by using the tc-etf
command to call the driver's ndo_setup_tc() callback.

The value of the launch time that is programmed in the Enhanced Normal
Transmit Descriptors is a 32-bit value, where the most significant 8 bits
represent the time in seconds and the remaining 24 bits represent the time
in 256 ns increments. The programmed launch time is compared against the
PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
future.

igc driver
----------

For igc, all four Tx Queues support the launch time feature. The launch
time hardware offload feature can be enabled or disabled by using the
tc-etf command to call the driver's ndo_setup_tc() callback. When entering
TSN mode, the igc driver will reset the device and create a default Qbv
schedule with a 1-second cycle time, with all Tx Queues open at all times.

The value of the launch time that is programmed in the Advanced Transmit
Context Descriptor is a relative offset to the starting time of the Qbv
transmission window of the queue. The Frst flag of the descriptor can be
set to schedule the packet for the next Qbv cycle. Therefore, the horizon
of the launch time for i225 and i226 is the ending time of the next cycle
of the Qbv transmission window of the queue. For example, when the Qbv
cycle time is set to 1 second, the horizon of the launch time ranges
from 1 second to 2 seconds, depending on where the Qbv cycle is currently
running.

Querying Device Capabilities
============================

Expand All @@ -74,6 +135,7 @@ Refer to ``xsk-flags`` features bitmask in

- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
- ``tx-launch-time-fifo``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME``

See ``tools/net/ynl/samples/netdev.c`` on how to query this information.

Expand Down
1 change: 1 addition & 0 deletions drivers/net/ethernet/intel/igc/igc.h
Original file line number Diff line number Diff line change
Expand Up @@ -579,6 +579,7 @@ struct igc_metadata_request {
struct xsk_tx_metadata *meta;
struct igc_ring *tx_ring;
u32 cmd_type;
u16 used_desc;
};

struct igc_q_vector {
Expand Down
143 changes: 109 additions & 34 deletions drivers/net/ethernet/intel/igc/igc_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -1092,7 +1092,8 @@ static int igc_init_empty_frame(struct igc_ring *ring,

dma = dma_map_single(ring->dev, skb->data, size, DMA_TO_DEVICE);
if (dma_mapping_error(ring->dev, dma)) {
netdev_err_once(ring->netdev, "Failed to map DMA for TX\n");
net_err_ratelimited("%s: DMA mapping error for empty frame\n",
netdev_name(ring->netdev));
return -ENOMEM;
}

Expand All @@ -1108,20 +1109,12 @@ static int igc_init_empty_frame(struct igc_ring *ring,
return 0;
}

static int igc_init_tx_empty_descriptor(struct igc_ring *ring,
struct sk_buff *skb,
struct igc_tx_buffer *first)
static void igc_init_tx_empty_descriptor(struct igc_ring *ring,
struct sk_buff *skb,
struct igc_tx_buffer *first)
{
union igc_adv_tx_desc *desc;
u32 cmd_type, olinfo_status;
int err;

if (!igc_desc_unused(ring))
return -EBUSY;

err = igc_init_empty_frame(ring, first, skb);
if (err)
return err;

cmd_type = IGC_ADVTXD_DTYP_DATA | IGC_ADVTXD_DCMD_DEXT |
IGC_ADVTXD_DCMD_IFCS | IGC_TXD_DCMD |
Expand All @@ -1140,8 +1133,6 @@ static int igc_init_tx_empty_descriptor(struct igc_ring *ring,
ring->next_to_use++;
if (ring->next_to_use == ring->count)
ring->next_to_use = 0;

return 0;
}

#define IGC_EMPTY_FRAME_SIZE 60
Expand Down Expand Up @@ -1567,6 +1558,40 @@ static bool igc_request_tx_tstamp(struct igc_adapter *adapter, struct sk_buff *s
return false;
}

static int igc_insert_empty_frame(struct igc_ring *tx_ring)
{
struct igc_tx_buffer *empty_info;
struct sk_buff *empty_skb;
void *data;
int ret;

empty_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
empty_skb = alloc_skb(IGC_EMPTY_FRAME_SIZE, GFP_ATOMIC);
if (unlikely(!empty_skb)) {
net_err_ratelimited("%s: skb alloc error for empty frame\n",
netdev_name(tx_ring->netdev));
return -ENOMEM;
}

data = skb_put(empty_skb, IGC_EMPTY_FRAME_SIZE);
memset(data, 0, IGC_EMPTY_FRAME_SIZE);

/* Prepare DMA mapping and Tx buffer information */
ret = igc_init_empty_frame(tx_ring, empty_info, empty_skb);
if (unlikely(ret)) {
dev_kfree_skb_any(empty_skb);
return ret;
}

/* Prepare advanced context descriptor for empty packet */
igc_tx_ctxtdesc(tx_ring, 0, false, 0, 0, 0);

/* Prepare advanced data descriptor for empty packet */
igc_init_tx_empty_descriptor(tx_ring, empty_skb, empty_info);

return 0;
}

static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
struct igc_ring *tx_ring)
{
Expand All @@ -1586,6 +1611,7 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
* + 1 desc for skb_headlen/IGC_MAX_DATA_PER_TXD,
* + 2 desc gap to keep tail from touching head,
* + 1 desc for context descriptor,
* + 2 desc for inserting an empty packet for launch time,
* otherwise try next time
*/
for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
Expand All @@ -1605,24 +1631,16 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
launch_time = igc_tx_launchtime(tx_ring, txtime, &first_flag, &insert_empty);

if (insert_empty) {
struct igc_tx_buffer *empty_info;
struct sk_buff *empty;
void *data;

empty_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
empty = alloc_skb(IGC_EMPTY_FRAME_SIZE, GFP_ATOMIC);
if (!empty)
goto done;

data = skb_put(empty, IGC_EMPTY_FRAME_SIZE);
memset(data, 0, IGC_EMPTY_FRAME_SIZE);

igc_tx_ctxtdesc(tx_ring, 0, false, 0, 0, 0);

if (igc_init_tx_empty_descriptor(tx_ring,
empty,
empty_info) < 0)
dev_kfree_skb_any(empty);
/* Reset the launch time if the required empty frame fails to
* be inserted. However, this packet is not dropped, so it
* "dirties" the current Qbv cycle. This ensures that the
* upcoming packet, which is scheduled in the next Qbv cycle,
* does not require an empty frame. This way, the launch time
* continues to function correctly despite the current failure
* to insert the empty frame.
*/
if (igc_insert_empty_frame(tx_ring))
launch_time = 0;
}

done:
Expand Down Expand Up @@ -2953,9 +2971,48 @@ static u64 igc_xsk_fill_timestamp(void *_priv)
return *(u64 *)_priv;
}

static void igc_xsk_request_launch_time(u64 launch_time, void *_priv)
{
struct igc_metadata_request *meta_req = _priv;
struct igc_ring *tx_ring = meta_req->tx_ring;
__le32 launch_time_offset;
bool insert_empty = false;
bool first_flag = false;
u16 used_desc = 0;

if (!tx_ring->launchtime_enable)
return;

launch_time_offset = igc_tx_launchtime(tx_ring,
ns_to_ktime(launch_time),
&first_flag, &insert_empty);
if (insert_empty) {
/* Disregard the launch time request if the required empty frame
* fails to be inserted.
*/
if (igc_insert_empty_frame(tx_ring))
return;

meta_req->tx_buffer =
&tx_ring->tx_buffer_info[tx_ring->next_to_use];
/* Inserting an empty packet requires two descriptors:
* one data descriptor and one context descriptor.
*/
used_desc += 2;
}

/* Use one context descriptor to specify launch time and first flag. */
igc_tx_ctxtdesc(tx_ring, launch_time_offset, first_flag, 0, 0, 0);
used_desc += 1;

/* Update the number of used descriptors in this request */
meta_req->used_desc += used_desc;
}

const struct xsk_tx_metadata_ops igc_xsk_tx_metadata_ops = {
.tmo_request_timestamp = igc_xsk_request_timestamp,
.tmo_fill_timestamp = igc_xsk_fill_timestamp,
.tmo_request_launch_time = igc_xsk_request_launch_time,
};

static void igc_xdp_xmit_zc(struct igc_ring *ring)
Expand All @@ -2978,7 +3035,13 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
ntu = ring->next_to_use;
budget = igc_desc_unused(ring);

while (xsk_tx_peek_desc(pool, &xdp_desc) && budget--) {
/* Packets with launch time require one data descriptor and one context
* descriptor. When the launch time falls into the next Qbv cycle, we
* may need to insert an empty packet, which requires two more
* descriptors. Therefore, to be safe, we always ensure we have at least
* 4 descriptors available.
*/
while (xsk_tx_peek_desc(pool, &xdp_desc) && budget >= 4) {
struct igc_metadata_request meta_req;
struct xsk_tx_metadata *meta = NULL;
struct igc_tx_buffer *bi;
Expand All @@ -2999,9 +3062,19 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
meta_req.tx_ring = ring;
meta_req.tx_buffer = bi;
meta_req.meta = meta;
meta_req.used_desc = 0;
xsk_tx_metadata_request(meta, &igc_xsk_tx_metadata_ops,
&meta_req);

/* xsk_tx_metadata_request() may have updated next_to_use */
ntu = ring->next_to_use;

/* xsk_tx_metadata_request() may have updated Tx buffer info */
bi = meta_req.tx_buffer;

/* xsk_tx_metadata_request() may use a few descriptors */
budget -= meta_req.used_desc;

tx_desc = IGC_TX_DESC(ring, ntu);
tx_desc->read.cmd_type_len = cpu_to_le32(meta_req.cmd_type);
tx_desc->read.olinfo_status = cpu_to_le32(olinfo_status);
Expand All @@ -3019,9 +3092,11 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
ntu++;
if (ntu == ring->count)
ntu = 0;

ring->next_to_use = ntu;
budget--;
}

ring->next_to_use = ntu;
if (tx_desc) {
igc_flush_tx_descriptors(ring);
xsk_tx_release(pool);
Expand Down
2 changes: 2 additions & 0 deletions drivers/net/ethernet/stmicro/stmmac/stmmac.h
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ struct stmmac_metadata_request {
struct stmmac_priv *priv;
struct dma_desc *tx_desc;
bool *set_ic;
struct dma_edesc *edesc;
int tbs;
};

struct stmmac_xsk_tx_complete {
Expand Down
13 changes: 13 additions & 0 deletions drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -2491,9 +2491,20 @@ static u64 stmmac_xsk_fill_timestamp(void *_priv)
return 0;
}

static void stmmac_xsk_request_launch_time(u64 launch_time, void *_priv)
{
struct timespec64 ts = ns_to_timespec64(launch_time);
struct stmmac_metadata_request *meta_req = _priv;

if (meta_req->tbs & STMMAC_TBS_EN)
stmmac_set_desc_tbs(meta_req->priv, meta_req->edesc, ts.tv_sec,
ts.tv_nsec);
}

static const struct xsk_tx_metadata_ops stmmac_xsk_tx_metadata_ops = {
.tmo_request_timestamp = stmmac_xsk_request_timestamp,
.tmo_fill_timestamp = stmmac_xsk_fill_timestamp,
.tmo_request_launch_time = stmmac_xsk_request_launch_time,
};

static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
Expand Down Expand Up @@ -2577,6 +2588,8 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
meta_req.priv = priv;
meta_req.tx_desc = tx_desc;
meta_req.set_ic = &set_ic;
meta_req.tbs = tx_q->tbs;
meta_req.edesc = &tx_q->dma_entx[entry];
xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops,
&meta_req);
if (set_ic) {
Expand Down
10 changes: 10 additions & 0 deletions include/net/xdp_sock.h
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,16 @@ struct xdp_sock {
* indicates position where checksumming should start.
* csum_offset indicates position where checksum should be stored.
*
* void (*tmo_request_launch_time)(u64 launch_time, void *priv)
* Called when AF_XDP frame requested launch time HW offload support.
* launch_time indicates the PTP time at which the device can schedule the
* packet for transmission.
*/
struct xsk_tx_metadata_ops {
void (*tmo_request_timestamp)(void *priv);
u64 (*tmo_fill_timestamp)(void *priv);
void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
void (*tmo_request_launch_time)(u64 launch_time, void *priv);
};

#ifdef CONFIG_XDP_SOCKETS
Expand Down Expand Up @@ -162,6 +167,11 @@ static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
if (!meta)
return;

if (ops->tmo_request_launch_time)
if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
ops->tmo_request_launch_time(meta->request.launch_time,
priv);

if (ops->tmo_request_timestamp)
if (meta->flags & XDP_TXMD_FLAGS_TIMESTAMP)
ops->tmo_request_timestamp(priv);
Expand Down
Loading

0 comments on commit 494a044

Please sign in to comment.