From 194ef9d0de9021df4a0ba8b112f91e56adaddd22 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Fri, 13 Sep 2024 15:12:30 +0300 Subject: [PATCH 01/35] net: phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present The author of the blamed commit apparently did not notice something about aqr_wait_reset_complete(): it polls the exact same register - MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID - as aqr_firmware_load(). Thus, the entire logic after the introduction of aqr_wait_reset_complete() is now completely side-stepped, because if aqr_wait_reset_complete() succeeds, MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID could have only been a non-zero value. The handling of the case where the register reads as 0 is dead code, due to the previous -ETIMEDOUT having stopped execution and returning a fatal error to the caller. We never attempt to load new firmware if no firmware is present. Based on static code analysis, I guess we should simply introduce a switch/case statement based on the return code from aqr_wait_reset_complete(), to determine whether to load firmware or not. I am not intending to change the procedure through which the driver determines whether to load firmware or not, as I am unaware of alternative possibilities. At the same time, Russell King suggests that if aqr_wait_reset_complete() is expected to return -ETIMEDOUT as part of normal operation and not just catastrophic failure, the use of phy_read_mmd_poll_timeout() is improper, since that has an embedded print inside. Just open-code a call to read_poll_timeout() to avoid printing -ETIMEDOUT, but continue printing actual read errors from the MDIO bus. Fixes: ad649a1fac37 ("net: phy: aquantia: wait for FW reset before checking the vendor ID") Reported-by: Clark Wang Reported-by: Jon Hunter Closes: https://lore.kernel.org/netdev/8ac00a45-ac61-41b4-9f74-d18157b8b6bf@nvidia.com/ Reported-by: Hans-Frieder Vogt Closes: https://lore.kernel.org/netdev/c7c1a3ae-be97-4929-8d89-04c8aa870209@gmx.net/ Signed-off-by: Vladimir Oltean Tested-by: Bartosz Golaszewski Tested-by: Hans-Frieder Vogt Link: https://patch.msgid.link/20240913121230.2620122-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni --- drivers/net/phy/aquantia/aquantia_firmware.c | 42 +++++++++++--------- drivers/net/phy/aquantia/aquantia_main.c | 19 +++++++-- 2 files changed, 39 insertions(+), 22 deletions(-) diff --git a/drivers/net/phy/aquantia/aquantia_firmware.c b/drivers/net/phy/aquantia/aquantia_firmware.c index 524627a36c6fc..dac6464b5fe2e 100644 --- a/drivers/net/phy/aquantia/aquantia_firmware.c +++ b/drivers/net/phy/aquantia/aquantia_firmware.c @@ -353,26 +353,32 @@ int aqr_firmware_load(struct phy_device *phydev) { int ret; - ret = aqr_wait_reset_complete(phydev); - if (ret) - return ret; - - /* Check if the firmware is not already loaded by pooling - * the current version returned by the PHY. If 0 is returned, - * no firmware is loaded. + /* Check if the firmware is not already loaded by polling + * the current version returned by the PHY. */ - ret = phy_read_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_FW_ID); - if (ret > 0) - goto exit; - - ret = aqr_firmware_load_nvmem(phydev); - if (!ret) - goto exit; - - ret = aqr_firmware_load_fs(phydev); - if (ret) + ret = aqr_wait_reset_complete(phydev); + switch (ret) { + case 0: + /* Some firmware is loaded => do nothing */ + return 0; + case -ETIMEDOUT: + /* VEND1_GLOBAL_FW_ID still reads 0 after 2 seconds of polling. + * We don't have full confidence that no firmware is loaded (in + * theory it might just not have loaded yet), but we will + * assume that, and load a new image. + */ + ret = aqr_firmware_load_nvmem(phydev); + if (!ret) + return ret; + + ret = aqr_firmware_load_fs(phydev); + if (ret) + return ret; + break; + default: + /* PHY read error, propagate it to the caller */ return ret; + } -exit: return 0; } diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c index e982e9ce44a59..57b8b8f400fd4 100644 --- a/drivers/net/phy/aquantia/aquantia_main.c +++ b/drivers/net/phy/aquantia/aquantia_main.c @@ -435,6 +435,9 @@ static int aqr107_set_tunable(struct phy_device *phydev, } } +#define AQR_FW_WAIT_SLEEP_US 20000 +#define AQR_FW_WAIT_TIMEOUT_US 2000000 + /* If we configure settings whilst firmware is still initializing the chip, * then these settings may be overwritten. Therefore make sure chip * initialization has completed. Use presence of the firmware ID as @@ -444,11 +447,19 @@ static int aqr107_set_tunable(struct phy_device *phydev, */ int aqr_wait_reset_complete(struct phy_device *phydev) { - int val; + int ret, val; + + ret = read_poll_timeout(phy_read_mmd, val, val != 0, + AQR_FW_WAIT_SLEEP_US, AQR_FW_WAIT_TIMEOUT_US, + false, phydev, MDIO_MMD_VEND1, + VEND1_GLOBAL_FW_ID); + if (val < 0) { + phydev_err(phydev, "Failed to read VEND1_GLOBAL_FW_ID: %pe\n", + ERR_PTR(val)); + return val; + } - return phy_read_mmd_poll_timeout(phydev, MDIO_MMD_VEND1, - VEND1_GLOBAL_FW_ID, val, val != 0, - 20000, 2000000, false); + return ret; } static void aqr107_chip_info(struct phy_device *phydev) From ba0da2dc934ec5ac32bbeecbd0670da16ba03565 Mon Sep 17 00:00:00 2001 From: Sean Anderson Date: Fri, 13 Sep 2024 10:57:11 -0400 Subject: [PATCH 02/35] net: xilinx: axienet: Schedule NAPI in two steps As advised by Documentation/networking/napi.rst, masking IRQs after calling napi_schedule can be racy. Avoid this by only masking/scheduling if napi_schedule_prep returns true. Fixes: 9e2bc267e780 ("net: axienet: Use NAPI for TX completion path") Fixes: cc37610caaf8 ("net: axienet: implement NAPI and GRO receive") Signed-off-by: Sean Anderson Reviewed-by: Shannon Nelson Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20240913145711.2284295-1-sean.anderson@linux.dev Signed-off-by: Paolo Abeni --- drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c index ea7d7c03f48e5..a3da236489ec2 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c @@ -1282,9 +1282,10 @@ static irqreturn_t axienet_tx_irq(int irq, void *_ndev) u32 cr = lp->tx_dma_cr; cr &= ~(XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK); - axienet_dma_out32(lp, XAXIDMA_TX_CR_OFFSET, cr); - - napi_schedule(&lp->napi_tx); + if (napi_schedule_prep(&lp->napi_tx)) { + axienet_dma_out32(lp, XAXIDMA_TX_CR_OFFSET, cr); + __napi_schedule(&lp->napi_tx); + } } return IRQ_HANDLED; @@ -1326,9 +1327,10 @@ static irqreturn_t axienet_rx_irq(int irq, void *_ndev) u32 cr = lp->rx_dma_cr; cr &= ~(XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK); - axienet_dma_out32(lp, XAXIDMA_RX_CR_OFFSET, cr); - - napi_schedule(&lp->napi_rx); + if (napi_schedule_prep(&lp->napi_rx)) { + axienet_dma_out32(lp, XAXIDMA_RX_CR_OFFSET, cr); + __napi_schedule(&lp->napi_rx); + } } return IRQ_HANDLED; From 5a6caa2cfabb559309b5ce29ee7c8e9ce1a9a9df Mon Sep 17 00:00:00 2001 From: Sean Anderson Date: Fri, 13 Sep 2024 10:51:56 -0400 Subject: [PATCH 03/35] net: xilinx: axienet: Fix packet counting axienet_free_tx_chain returns the number of DMA descriptors it's handled. However, axienet_tx_poll treats the return as the number of packets. When scatter-gather SKBs are enabled, a single packet may use multiple DMA descriptors, which causes incorrect packet counts. Fix this by explicitly keepting track of the number of packets processed as separate from the DMA descriptors. Budget does not affect the number of Tx completions we can process for NAPI, so we use the ring size as the limit instead of budget. As we no longer return the number of descriptors processed to axienet_tx_poll, we now update tx_bd_ci in axienet_free_tx_chain. Fixes: 8a3b7a252dca ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver") Signed-off-by: Sean Anderson Link: https://patch.msgid.link/20240913145156.2283067-1-sean.anderson@linux.dev Signed-off-by: Paolo Abeni --- .../net/ethernet/xilinx/xilinx_axienet_main.c | 23 +++++++++++-------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c index a3da236489ec2..fc35fcb22d94f 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c @@ -736,15 +736,15 @@ static int axienet_device_reset(struct net_device *ndev) * * Would either be called after a successful transmit operation, or after * there was an error when setting up the chain. - * Returns the number of descriptors handled. + * Returns the number of packets handled. */ static int axienet_free_tx_chain(struct axienet_local *lp, u32 first_bd, int nr_bds, bool force, u32 *sizep, int budget) { struct axidma_bd *cur_p; unsigned int status; + int i, packets = 0; dma_addr_t phys; - int i; for (i = 0; i < nr_bds; i++) { cur_p = &lp->tx_bd_v[(first_bd + i) % lp->tx_bd_num]; @@ -763,8 +763,10 @@ static int axienet_free_tx_chain(struct axienet_local *lp, u32 first_bd, (cur_p->cntrl & XAXIDMA_BD_CTRL_LENGTH_MASK), DMA_TO_DEVICE); - if (cur_p->skb && (status & XAXIDMA_BD_STS_COMPLETE_MASK)) + if (cur_p->skb && (status & XAXIDMA_BD_STS_COMPLETE_MASK)) { napi_consume_skb(cur_p->skb, budget); + packets++; + } cur_p->app0 = 0; cur_p->app1 = 0; @@ -780,7 +782,13 @@ static int axienet_free_tx_chain(struct axienet_local *lp, u32 first_bd, *sizep += status & XAXIDMA_BD_STS_ACTUAL_LEN_MASK; } - return i; + if (!force) { + lp->tx_bd_ci += i; + if (lp->tx_bd_ci >= lp->tx_bd_num) + lp->tx_bd_ci %= lp->tx_bd_num; + } + + return packets; } /** @@ -953,13 +961,10 @@ static int axienet_tx_poll(struct napi_struct *napi, int budget) u32 size = 0; int packets; - packets = axienet_free_tx_chain(lp, lp->tx_bd_ci, budget, false, &size, budget); + packets = axienet_free_tx_chain(lp, lp->tx_bd_ci, lp->tx_bd_num, false, + &size, budget); if (packets) { - lp->tx_bd_ci += packets; - if (lp->tx_bd_ci >= lp->tx_bd_num) - lp->tx_bd_ci %= lp->tx_bd_num; - u64_stats_update_begin(&lp->tx_stat_sync); u64_stats_add(&lp->tx_packets, packets); u64_stats_add(&lp->tx_bytes, size); From 9c778fe48d20ef362047e3376dee56d77f8500d4 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 13 Sep 2024 17:06:15 +0000 Subject: [PATCH 04/35] netfilter: nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put() syzbot reported that nf_reject_ip6_tcphdr_put() was possibly sending garbage on the four reserved tcp bits (th->res1) Use skb_put_zero() to clear the whole TCP header, as done in nf_reject_ip_tcphdr_put() BUG: KMSAN: uninit-value in nf_reject_ip6_tcphdr_put+0x688/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:255 nf_reject_ip6_tcphdr_put+0x688/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:255 nf_send_reset6+0xd84/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:344 nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775 process_backlog+0x4ad/0xa50 net/core/dev.c:6108 __napi_poll+0xe7/0x980 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963 handle_softirqs+0x1ce/0x800 kernel/softirq.c:554 __do_softirq+0x14/0x1a kernel/softirq.c:588 do_softirq+0x9a/0x100 kernel/softirq.c:455 __local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x2692/0x5610 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_resolve_output+0x9ca/0xae0 net/core/neighbour.c:1565 neigh_output include/net/neighbour.h:542 [inline] ip6_finish_output2+0x2347/0x2ba0 net/ipv6/ip6_output.c:141 __ip6_finish_output net/ipv6/ip6_output.c:215 [inline] ip6_finish_output+0xbb8/0x14b0 net/ipv6/ip6_output.c:226 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip6_output+0x356/0x620 net/ipv6/ip6_output.c:247 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] ip6_xmit+0x1ba6/0x25d0 net/ipv6/ip6_output.c:366 inet6_csk_xmit+0x442/0x530 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x3b07/0x4880 net/ipv4/tcp_output.c:1466 tcp_transmit_skb net/ipv4/tcp_output.c:1484 [inline] tcp_connect+0x35b6/0x7130 net/ipv4/tcp_output.c:4143 tcp_v6_connect+0x1bcc/0x1e40 net/ipv6/tcp_ipv6.c:333 __inet_stream_connect+0x2ef/0x1730 net/ipv4/af_inet.c:679 inet_stream_connect+0x6a/0xd0 net/ipv4/af_inet.c:750 __sys_connect_file net/socket.c:2061 [inline] __sys_connect+0x606/0x690 net/socket.c:2078 __do_sys_connect net/socket.c:2088 [inline] __se_sys_connect net/socket.c:2085 [inline] __x64_sys_connect+0x91/0xe0 net/socket.c:2085 x64_sys_call+0x27a5/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:43 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was stored to memory at: nf_reject_ip6_tcphdr_put+0x60c/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:249 nf_send_reset6+0xd84/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:344 nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775 process_backlog+0x4ad/0xa50 net/core/dev.c:6108 __napi_poll+0xe7/0x980 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963 handle_softirqs+0x1ce/0x800 kernel/softirq.c:554 __do_softirq+0x14/0x1a kernel/softirq.c:588 Uninit was stored to memory at: nf_reject_ip6_tcphdr_put+0x2ca/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:231 nf_send_reset6+0xd84/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:344 nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775 process_backlog+0x4ad/0xa50 net/core/dev.c:6108 __napi_poll+0xe7/0x980 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963 handle_softirqs+0x1ce/0x800 kernel/softirq.c:554 __do_softirq+0x14/0x1a kernel/softirq.c:588 Uninit was created at: slab_post_alloc_hook mm/slub.c:3998 [inline] slab_alloc_node mm/slub.c:4041 [inline] kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4084 kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:583 __alloc_skb+0x363/0x7b0 net/core/skbuff.c:674 alloc_skb include/linux/skbuff.h:1320 [inline] nf_send_reset6+0x98d/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:327 nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775 process_backlog+0x4ad/0xa50 net/core/dev.c:6108 __napi_poll+0xe7/0x980 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963 handle_softirqs+0x1ce/0x800 kernel/softirq.c:554 __do_softirq+0x14/0x1a kernel/softirq.c:588 Fixes: c8d7b98bec43 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules") Reported-by: syzbot Signed-off-by: Eric Dumazet Reviewed-by: Simon Horman Reviewed-by: Pablo Neira Ayuso Link: https://patch.msgid.link/20240913170615.3670897-1-edumazet@google.com Signed-off-by: Paolo Abeni --- net/ipv6/netfilter/nf_reject_ipv6.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c index dedee264b8f6c..b9457473c176d 100644 --- a/net/ipv6/netfilter/nf_reject_ipv6.c +++ b/net/ipv6/netfilter/nf_reject_ipv6.c @@ -223,33 +223,23 @@ void nf_reject_ip6_tcphdr_put(struct sk_buff *nskb, const struct tcphdr *oth, unsigned int otcplen) { struct tcphdr *tcph; - int needs_ack; skb_reset_transport_header(nskb); - tcph = skb_put(nskb, sizeof(struct tcphdr)); + tcph = skb_put_zero(nskb, sizeof(struct tcphdr)); /* Truncate to length (no data) */ tcph->doff = sizeof(struct tcphdr)/4; tcph->source = oth->dest; tcph->dest = oth->source; if (oth->ack) { - needs_ack = 0; tcph->seq = oth->ack_seq; - tcph->ack_seq = 0; } else { - needs_ack = 1; tcph->ack_seq = htonl(ntohl(oth->seq) + oth->syn + oth->fin + otcplen - (oth->doff<<2)); - tcph->seq = 0; + tcph->ack = 1; } - /* Reset flags */ - ((u_int8_t *)tcph)[13] = 0; tcph->rst = 1; - tcph->ack = needs_ack; - tcph->window = 0; - tcph->urg_ptr = 0; - tcph->check = 0; /* Adjust TCP checksum */ tcph->check = csum_ipv6_magic(&ipv6_hdr(nskb)->saddr, From b5109b60ee4fcb2f2bb24f589575e10cc5283ad4 Mon Sep 17 00:00:00 2001 From: Kaixin Wang Date: Sun, 15 Sep 2024 22:40:46 +0800 Subject: [PATCH 05/35] net: seeq: Fix use after free vulnerability in ether3 Driver Due to Race Condition In the ether3_probe function, a timer is initialized with a callback function ether3_ledoff, bound to &prev(dev)->timer. Once the timer is started, there is a risk of a race condition if the module or device is removed, triggering the ether3_remove function to perform cleanup. The sequence of operations that may lead to a UAF bug is as follows: CPU0 CPU1 | ether3_ledoff ether3_remove | free_netdev(dev); | put_devic | kfree(dev); | | ether3_outw(priv(dev)->regs.config2 |= CFG2_CTRLO, REG_CONFIG2); | // use dev Fix it by ensuring that the timer is canceled before proceeding with the cleanup in ether3_remove. Fixes: 6fd9c53f7186 ("net: seeq: Convert timers to use timer_setup()") Signed-off-by: Kaixin Wang Link: https://patch.msgid.link/20240915144045.451-1-kxwang23@m.fudan.edu.cn Signed-off-by: Paolo Abeni --- drivers/net/ethernet/seeq/ether3.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/seeq/ether3.c b/drivers/net/ethernet/seeq/ether3.c index c672f92d65e97..9319a2675e7b6 100644 --- a/drivers/net/ethernet/seeq/ether3.c +++ b/drivers/net/ethernet/seeq/ether3.c @@ -847,9 +847,11 @@ static void ether3_remove(struct expansion_card *ec) { struct net_device *dev = ecard_get_drvdata(ec); + ether3_outw(priv(dev)->regs.config2 |= CFG2_CTRLO, REG_CONFIG2); ecard_set_drvdata(ec, NULL); unregister_netdev(dev); + del_timer_sync(&priv(dev)->timer); free_netdev(dev); ecard_release_resources(ec); } From 93c21077bb9ba08807c459982d440dbbee4c7af3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Thomas=20Wei=C3=9Fschuh?= Date: Mon, 16 Sep 2024 20:57:13 +0200 Subject: [PATCH 06/35] net: ipv6: select DST_CACHE from IPV6_RPL_LWTUNNEL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The rpl sr tunnel code contains calls to dst_cache_*() which are only present when the dst cache is built. Select DST_CACHE to build the dst cache, similar to other kconfig options in the same file. Compiling the rpl sr tunnel without DST_CACHE will lead to linker errors. Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel") Signed-off-by: Thomas Weißschuh Reviewed-by: Simon Horman Tested-by: Simon Horman # build-tested Signed-off-by: David S. Miller --- net/ipv6/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index 08d4b7132d4c4..1c9c686d9522f 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -323,6 +323,7 @@ config IPV6_RPL_LWTUNNEL bool "IPv6: RPL Source Routing Header support" depends on IPV6 select LWTUNNEL + select DST_CACHE help Support for RFC6554 RPL Source Routing Header using the lightweight tunnels mechanism. From 7ebf44c910690a7097442d4dd68f12315569b2f4 Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Tue, 17 Sep 2024 13:15:03 +0200 Subject: [PATCH 07/35] MAINTAINERS: adjust file entry of the oa_tc6 header Commit aa58bec064ab ("net: ethernet: oa_tc6: implement register write operation") adds two new file entries to OPEN ALLIANCE 10BASE-T1S MACPHY SERIAL INTERFACE FRAMEWORK. One of the two entries mistakenly refers to drivers/include/linux/oa_tc6.h, whereas the intent is clearly to refer to include/linux/oa_tc6.h. Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a broken reference. Adjust the file entry to the intended location. Signed-off-by: Lukas Bulwahn Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 3bce6cc05553d..3af7f228a010c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17131,8 +17131,8 @@ M: Parthiban Veerasooran L: netdev@vger.kernel.org S: Maintained F: Documentation/networking/oa-tc6-framework.rst -F: drivers/include/linux/oa_tc6.h F: drivers/net/ethernet/oa_tc6.c +F: include/linux/oa_tc6.h OPEN FIRMWARE AND FLATTENED DEVICE TREE M: Rob Herring From c8770db2d54437a5f49417ae7b46f7de23d14db6 Mon Sep 17 00:00:00 2001 From: Josh Hunt Date: Tue, 10 Sep 2024 15:08:22 -0400 Subject: [PATCH 08/35] tcp: check skb is non-NULL in tcp_rto_delta_us() We have some machines running stock Ubuntu 20.04.6 which is their 5.4.0-174-generic kernel that are running ceph and recently hit a null ptr dereference in tcp_rearm_rto(). Initially hitting it from the TLP path, but then later we also saw it getting hit from the RACK case as well. Here are examples of the oops messages we saw in each of those cases: Jul 26 15:05:02 rx [11061395.780353] BUG: kernel NULL pointer dereference, address: 0000000000000020 Jul 26 15:05:02 rx [11061395.787572] #PF: supervisor read access in kernel mode Jul 26 15:05:02 rx [11061395.792971] #PF: error_code(0x0000) - not-present page Jul 26 15:05:02 rx [11061395.798362] PGD 0 P4D 0 Jul 26 15:05:02 rx [11061395.801164] Oops: 0000 [#1] SMP NOPTI Jul 26 15:05:02 rx [11061395.805091] CPU: 0 PID: 9180 Comm: msgr-worker-1 Tainted: G W 5.4.0-174-generic #193-Ubuntu Jul 26 15:05:02 rx [11061395.814996] Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023 Jul 26 15:05:02 rx [11061395.825952] RIP: 0010:tcp_rearm_rto+0xe4/0x160 Jul 26 15:05:02 rx [11061395.830656] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3 Jul 26 15:05:02 rx [11061395.849665] RSP: 0018:ffffb75d40003e08 EFLAGS: 00010246 Jul 26 15:05:02 rx [11061395.855149] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000 Jul 26 15:05:02 rx [11061395.862542] RDX: 0000000062177c30 RSI: 000000000000231c RDI: ffff9874ad283a60 Jul 26 15:05:02 rx [11061395.869933] RBP: ffffb75d40003e20 R08: 0000000000000000 R09: ffff987605e20aa8 Jul 26 15:05:02 rx [11061395.877318] R10: ffffb75d40003f00 R11: ffffb75d4460f740 R12: ffff9874ad283900 Jul 26 15:05:02 rx [11061395.884710] R13: ffff9874ad283a60 R14: ffff9874ad283980 R15: ffff9874ad283d30 Jul 26 15:05:02 rx [11061395.892095] FS: 00007f1ef4a2e700(0000) GS:ffff987605e00000(0000) knlGS:0000000000000000 Jul 26 15:05:02 rx [11061395.900438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 15:05:02 rx [11061395.906435] CR2: 0000000000000020 CR3: 0000003e450ba003 CR4: 0000000000760ef0 Jul 26 15:05:02 rx [11061395.913822] PKRU: 55555554 Jul 26 15:05:02 rx [11061395.916786] Call Trace: Jul 26 15:05:02 rx [11061395.919488] Jul 26 15:05:02 rx [11061395.921765] ? show_regs.cold+0x1a/0x1f Jul 26 15:05:02 rx [11061395.925859] ? __die+0x90/0xd9 Jul 26 15:05:02 rx [11061395.929169] ? no_context+0x196/0x380 Jul 26 15:05:02 rx [11061395.933088] ? ip6_protocol_deliver_rcu+0x4e0/0x4e0 Jul 26 15:05:02 rx [11061395.938216] ? ip6_sublist_rcv_finish+0x3d/0x50 Jul 26 15:05:02 rx [11061395.943000] ? __bad_area_nosemaphore+0x50/0x1a0 Jul 26 15:05:02 rx [11061395.947873] ? bad_area_nosemaphore+0x16/0x20 Jul 26 15:05:02 rx [11061395.952486] ? do_user_addr_fault+0x267/0x450 Jul 26 15:05:02 rx [11061395.957104] ? ipv6_list_rcv+0x112/0x140 Jul 26 15:05:02 rx [11061395.961279] ? __do_page_fault+0x58/0x90 Jul 26 15:05:02 rx [11061395.965458] ? do_page_fault+0x2c/0xe0 Jul 26 15:05:02 rx [11061395.969465] ? page_fault+0x34/0x40 Jul 26 15:05:02 rx [11061395.973217] ? tcp_rearm_rto+0xe4/0x160 Jul 26 15:05:02 rx [11061395.977313] ? tcp_rearm_rto+0xe4/0x160 Jul 26 15:05:02 rx [11061395.981408] tcp_send_loss_probe+0x10b/0x220 Jul 26 15:05:02 rx [11061395.985937] tcp_write_timer_handler+0x1b4/0x240 Jul 26 15:05:02 rx [11061395.990809] tcp_write_timer+0x9e/0xe0 Jul 26 15:05:02 rx [11061395.994814] ? tcp_write_timer_handler+0x240/0x240 Jul 26 15:05:02 rx [11061395.999866] call_timer_fn+0x32/0x130 Jul 26 15:05:02 rx [11061396.003782] __run_timers.part.0+0x180/0x280 Jul 26 15:05:02 rx [11061396.008309] ? recalibrate_cpu_khz+0x10/0x10 Jul 26 15:05:02 rx [11061396.012841] ? native_x2apic_icr_write+0x30/0x30 Jul 26 15:05:02 rx [11061396.017718] ? lapic_next_event+0x21/0x30 Jul 26 15:05:02 rx [11061396.021984] ? clockevents_program_event+0x8f/0xe0 Jul 26 15:05:02 rx [11061396.027035] run_timer_softirq+0x2a/0x50 Jul 26 15:05:02 rx [11061396.031212] __do_softirq+0xd1/0x2c1 Jul 26 15:05:02 rx [11061396.035044] do_softirq_own_stack+0x2a/0x40 Jul 26 15:05:02 rx [11061396.039480] Jul 26 15:05:02 rx [11061396.041840] do_softirq.part.0+0x46/0x50 Jul 26 15:05:02 rx [11061396.046022] __local_bh_enable_ip+0x50/0x60 Jul 26 15:05:02 rx [11061396.050460] _raw_spin_unlock_bh+0x1e/0x20 Jul 26 15:05:02 rx [11061396.054817] nf_conntrack_tcp_packet+0x29e/0xbe0 [nf_conntrack] Jul 26 15:05:02 rx [11061396.060994] ? get_l4proto+0xe7/0x190 [nf_conntrack] Jul 26 15:05:02 rx [11061396.066220] nf_conntrack_in+0xe9/0x670 [nf_conntrack] Jul 26 15:05:02 rx [11061396.071618] ipv6_conntrack_local+0x14/0x20 [nf_conntrack] Jul 26 15:05:02 rx [11061396.077356] nf_hook_slow+0x45/0xb0 Jul 26 15:05:02 rx [11061396.081098] ip6_xmit+0x3f0/0x5d0 Jul 26 15:05:02 rx [11061396.084670] ? ipv6_anycast_cleanup+0x50/0x50 Jul 26 15:05:02 rx [11061396.089282] ? __sk_dst_check+0x38/0x70 Jul 26 15:05:02 rx [11061396.093381] ? inet6_csk_route_socket+0x13b/0x200 Jul 26 15:05:02 rx [11061396.098346] inet6_csk_xmit+0xa7/0xf0 Jul 26 15:05:02 rx [11061396.102263] __tcp_transmit_skb+0x550/0xb30 Jul 26 15:05:02 rx [11061396.106701] tcp_write_xmit+0x3c6/0xc20 Jul 26 15:05:02 rx [11061396.110792] ? __alloc_skb+0x98/0x1d0 Jul 26 15:05:02 rx [11061396.114708] __tcp_push_pending_frames+0x37/0x100 Jul 26 15:05:02 rx [11061396.119667] tcp_push+0xfd/0x100 Jul 26 15:05:02 rx [11061396.123150] tcp_sendmsg_locked+0xc70/0xdd0 Jul 26 15:05:02 rx [11061396.127588] tcp_sendmsg+0x2d/0x50 Jul 26 15:05:02 rx [11061396.131245] inet6_sendmsg+0x43/0x70 Jul 26 15:05:02 rx [11061396.135075] __sock_sendmsg+0x48/0x70 Jul 26 15:05:02 rx [11061396.138994] ____sys_sendmsg+0x212/0x280 Jul 26 15:05:02 rx [11061396.143172] ___sys_sendmsg+0x88/0xd0 Jul 26 15:05:02 rx [11061396.147098] ? __seccomp_filter+0x7e/0x6b0 Jul 26 15:05:02 rx [11061396.151446] ? __switch_to+0x39c/0x460 Jul 26 15:05:02 rx [11061396.155453] ? __switch_to_asm+0x42/0x80 Jul 26 15:05:02 rx [11061396.159636] ? __switch_to_asm+0x5a/0x80 Jul 26 15:05:02 rx [11061396.163816] __sys_sendmsg+0x5c/0xa0 Jul 26 15:05:02 rx [11061396.167647] __x64_sys_sendmsg+0x1f/0x30 Jul 26 15:05:02 rx [11061396.171832] do_syscall_64+0x57/0x190 Jul 26 15:05:02 rx [11061396.175748] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 Jul 26 15:05:02 rx [11061396.181055] RIP: 0033:0x7f1ef692618d Jul 26 15:05:02 rx [11061396.184893] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ca ee ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 48 89 44 24 08 e8 fe ee ff ff 48 Jul 26 15:05:02 rx [11061396.203889] RSP: 002b:00007f1ef4a26aa0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e Jul 26 15:05:02 rx [11061396.211708] RAX: ffffffffffffffda RBX: 000000000000084b RCX: 00007f1ef692618d Jul 26 15:05:02 rx [11061396.219091] RDX: 0000000000004000 RSI: 00007f1ef4a26b10 RDI: 0000000000000275 Jul 26 15:05:02 rx [11061396.226475] RBP: 0000000000004000 R08: 0000000000000000 R09: 0000000000000020 Jul 26 15:05:02 rx [11061396.233859] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000084b Jul 26 15:05:02 rx [11061396.241243] R13: 00007f1ef4a26b10 R14: 0000000000000275 R15: 000055592030f1e8 Jul 26 15:05:02 rx [11061396.248628] Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet mii ast drm_vram_helper ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp mac_hid ipmi_si ipmi_devintf ipmi_msghandler nft_ct sch_fq_codel nf_tables_set nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ramoops reed_solomon efi_pstore drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid1 mlx5_core hid_generic pci_hyperv_intf crc32_pclmul tls usbhid ahci mlxfw bnxt_en libahci hid nvme i2c_piix4 nvme_core wmi Jul 26 15:05:02 rx [11061396.324334] CR2: 0000000000000020 Jul 26 15:05:02 rx [11061396.327944] ---[ end trace 68a2b679d1cfb4f1 ]--- Jul 26 15:05:02 rx [11061396.433435] RIP: 0010:tcp_rearm_rto+0xe4/0x160 Jul 26 15:05:02 rx [11061396.438137] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3 Jul 26 15:05:02 rx [11061396.457144] RSP: 0018:ffffb75d40003e08 EFLAGS: 00010246 Jul 26 15:05:02 rx [11061396.462629] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000 Jul 26 15:05:02 rx [11061396.470012] RDX: 0000000062177c30 RSI: 000000000000231c RDI: ffff9874ad283a60 Jul 26 15:05:02 rx [11061396.477396] RBP: ffffb75d40003e20 R08: 0000000000000000 R09: ffff987605e20aa8 Jul 26 15:05:02 rx [11061396.484779] R10: ffffb75d40003f00 R11: ffffb75d4460f740 R12: ffff9874ad283900 Jul 26 15:05:02 rx [11061396.492164] R13: ffff9874ad283a60 R14: ffff9874ad283980 R15: ffff9874ad283d30 Jul 26 15:05:02 rx [11061396.499547] FS: 00007f1ef4a2e700(0000) GS:ffff987605e00000(0000) knlGS:0000000000000000 Jul 26 15:05:02 rx [11061396.507886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 15:05:02 rx [11061396.513884] CR2: 0000000000000020 CR3: 0000003e450ba003 CR4: 0000000000760ef0 Jul 26 15:05:02 rx [11061396.521267] PKRU: 55555554 Jul 26 15:05:02 rx [11061396.524230] Kernel panic - not syncing: Fatal exception in interrupt Jul 26 15:05:02 rx [11061396.530885] Kernel Offset: 0x1b200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Jul 26 15:05:03 rx [11061396.660181] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- After we hit this we disabled TLP by setting tcp_early_retrans to 0 and then hit the crash in the RACK case: Aug 7 07:26:16 rx [1006006.265582] BUG: kernel NULL pointer dereference, address: 0000000000000020 Aug 7 07:26:16 rx [1006006.272719] #PF: supervisor read access in kernel mode Aug 7 07:26:16 rx [1006006.278030] #PF: error_code(0x0000) - not-present page Aug 7 07:26:16 rx [1006006.283343] PGD 0 P4D 0 Aug 7 07:26:16 rx [1006006.286057] Oops: 0000 [#1] SMP NOPTI Aug 7 07:26:16 rx [1006006.289896] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.4.0-174-generic #193-Ubuntu Aug 7 07:26:16 rx [1006006.299107] Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023 Aug 7 07:26:16 rx [1006006.309970] RIP: 0010:tcp_rearm_rto+0xe4/0x160 Aug 7 07:26:16 rx [1006006.314584] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3 Aug 7 07:26:16 rx [1006006.333499] RSP: 0018:ffffb42600a50960 EFLAGS: 00010246 Aug 7 07:26:16 rx [1006006.338895] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000 Aug 7 07:26:16 rx [1006006.346193] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff92d687ed8160 Aug 7 07:26:16 rx [1006006.353489] RBP: ffffb42600a50978 R08: 0000000000000000 R09: 00000000cd896dcc Aug 7 07:26:16 rx [1006006.360786] R10: ffff92dc3404f400 R11: 0000000000000001 R12: ffff92d687ed8000 Aug 7 07:26:16 rx [1006006.368084] R13: ffff92d687ed8160 R14: 00000000cd896dcc R15: 00000000cd8fca81 Aug 7 07:26:16 rx [1006006.375381] FS: 0000000000000000(0000) GS:ffff93158ad40000(0000) knlGS:0000000000000000 Aug 7 07:26:16 rx [1006006.383632] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 7 07:26:16 rx [1006006.389544] CR2: 0000000000000020 CR3: 0000003e775ce006 CR4: 0000000000760ee0 Aug 7 07:26:16 rx [1006006.396839] PKRU: 55555554 Aug 7 07:26:16 rx [1006006.399717] Call Trace: Aug 7 07:26:16 rx [1006006.402335] Aug 7 07:26:16 rx [1006006.404525] ? show_regs.cold+0x1a/0x1f Aug 7 07:26:16 rx [1006006.408532] ? __die+0x90/0xd9 Aug 7 07:26:16 rx [1006006.411760] ? no_context+0x196/0x380 Aug 7 07:26:16 rx [1006006.415599] ? __bad_area_nosemaphore+0x50/0x1a0 Aug 7 07:26:16 rx [1006006.420392] ? _raw_spin_lock+0x1e/0x30 Aug 7 07:26:16 rx [1006006.424401] ? bad_area_nosemaphore+0x16/0x20 Aug 7 07:26:16 rx [1006006.428927] ? do_user_addr_fault+0x267/0x450 Aug 7 07:26:16 rx [1006006.433450] ? __do_page_fault+0x58/0x90 Aug 7 07:26:16 rx [1006006.437542] ? do_page_fault+0x2c/0xe0 Aug 7 07:26:16 rx [1006006.441470] ? page_fault+0x34/0x40 Aug 7 07:26:16 rx [1006006.445134] ? tcp_rearm_rto+0xe4/0x160 Aug 7 07:26:16 rx [1006006.449145] tcp_ack+0xa32/0xb30 Aug 7 07:26:16 rx [1006006.452542] tcp_rcv_established+0x13c/0x670 Aug 7 07:26:16 rx [1006006.456981] ? sk_filter_trim_cap+0x48/0x220 Aug 7 07:26:16 rx [1006006.461419] tcp_v6_do_rcv+0xdb/0x450 Aug 7 07:26:16 rx [1006006.465257] tcp_v6_rcv+0xc2b/0xd10 Aug 7 07:26:16 rx [1006006.468918] ip6_protocol_deliver_rcu+0xd3/0x4e0 Aug 7 07:26:16 rx [1006006.473706] ip6_input_finish+0x15/0x20 Aug 7 07:26:16 rx [1006006.477710] ip6_input+0xa2/0xb0 Aug 7 07:26:16 rx [1006006.481109] ? ip6_protocol_deliver_rcu+0x4e0/0x4e0 Aug 7 07:26:16 rx [1006006.486151] ip6_sublist_rcv_finish+0x3d/0x50 Aug 7 07:26:16 rx [1006006.490679] ip6_sublist_rcv+0x1aa/0x250 Aug 7 07:26:16 rx [1006006.494779] ? ip6_rcv_finish_core.isra.0+0xa0/0xa0 Aug 7 07:26:16 rx [1006006.499828] ipv6_list_rcv+0x112/0x140 Aug 7 07:26:16 rx [1006006.503748] __netif_receive_skb_list_core+0x1a4/0x250 Aug 7 07:26:16 rx [1006006.509057] netif_receive_skb_list_internal+0x1a1/0x2b0 Aug 7 07:26:16 rx [1006006.514538] gro_normal_list.part.0+0x1e/0x40 Aug 7 07:26:16 rx [1006006.519068] napi_complete_done+0x91/0x130 Aug 7 07:26:16 rx [1006006.523352] mlx5e_napi_poll+0x18e/0x610 [mlx5_core] Aug 7 07:26:16 rx [1006006.528481] net_rx_action+0x142/0x390 Aug 7 07:26:16 rx [1006006.532398] __do_softirq+0xd1/0x2c1 Aug 7 07:26:16 rx [1006006.536142] irq_exit+0xae/0xb0 Aug 7 07:26:16 rx [1006006.539452] do_IRQ+0x5a/0xf0 Aug 7 07:26:16 rx [1006006.542590] common_interrupt+0xf/0xf Aug 7 07:26:16 rx [1006006.546421] Aug 7 07:26:16 rx [1006006.548695] RIP: 0010:native_safe_halt+0xe/0x10 Aug 7 07:26:16 rx [1006006.553399] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 2c 50 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 2c 50 00 fb f4 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 dd 5e 61 ff 65 Aug 7 07:26:16 rx [1006006.572309] RSP: 0018:ffffb42600177e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffc2 Aug 7 07:26:16 rx [1006006.580040] RAX: ffffffff8ed08b20 RBX: 0000000000000005 RCX: 0000000000000001 Aug 7 07:26:16 rx [1006006.587337] RDX: 00000000f48eeca2 RSI: 0000000000000082 RDI: 0000000000000082 Aug 7 07:26:16 rx [1006006.594635] RBP: ffffb42600177e90 R08: 0000000000000000 R09: 000000000000020f Aug 7 07:26:16 rx [1006006.601931] R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000000005 Aug 7 07:26:16 rx [1006006.609229] R13: ffff93157deb5f00 R14: 0000000000000000 R15: 0000000000000000 Aug 7 07:26:16 rx [1006006.616530] ? __cpuidle_text_start+0x8/0x8 Aug 7 07:26:16 rx [1006006.620886] ? default_idle+0x20/0x140 Aug 7 07:26:16 rx [1006006.624804] arch_cpu_idle+0x15/0x20 Aug 7 07:26:16 rx [1006006.628545] default_idle_call+0x23/0x30 Aug 7 07:26:16 rx [1006006.632640] do_idle+0x1fb/0x270 Aug 7 07:26:16 rx [1006006.636035] cpu_startup_entry+0x20/0x30 Aug 7 07:26:16 rx [1006006.640126] start_secondary+0x178/0x1d0 Aug 7 07:26:16 rx [1006006.644218] secondary_startup_64+0xa4/0xb0 Aug 7 07:26:17 rx [1006006.648568] Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 nft_ct amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet ast mii drm_vram_helper ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp mac_hid ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel nf_tables_set nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ramoops reed_solomon efi_pstore drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid1 hid_generic mlx5_core pci_hyperv_intf crc32_pclmul usbhid ahci tls mlxfw bnxt_en hid libahci nvme i2c_piix4 nvme_core wmi [last unloaded: cpuid] Aug 7 07:26:17 rx [1006006.726180] CR2: 0000000000000020 Aug 7 07:26:17 rx [1006006.729718] ---[ end trace e0e2e37e4e612984 ]--- Prior to seeing the first crash and on other machines we also see the warning in tcp_send_loss_probe() where packets_out is non-zero, but both transmit and retrans queues are empty so we know the box is seeing some accounting issue in this area: Jul 26 09:15:27 kernel: ------------[ cut here ]------------ Jul 26 09:15:27 kernel: invalid inflight: 2 state 1 cwnd 68 mss 8988 Jul 26 09:15:27 kernel: WARNING: CPU: 16 PID: 0 at net/ipv4/tcp_output.c:2605 tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 nft_ct amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif joydev input_leds rndis_host cdc_ether usbnet mii ast drm_vram_helper ttm drm_kms_he> Jul 26 09:15:27 kernel: CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.4.0-174-generic #193-Ubuntu Jul 26 09:15:27 kernel: Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023 Jul 26 09:15:27 kernel: RIP: 0010:tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: Code: 08 26 01 00 75 e2 41 0f b6 54 24 12 41 8b 8c 24 c0 06 00 00 45 89 f0 48 c7 c7 e0 b4 20 a7 c6 05 8d 08 26 01 01 e8 4a c0 0f 00 <0f> 0b eb ba 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 Jul 26 09:15:27 kernel: RSP: 0018:ffffb7838088ce00 EFLAGS: 00010286 Jul 26 09:15:27 kernel: RAX: 0000000000000000 RBX: ffff9b84b5630430 RCX: 0000000000000006 Jul 26 09:15:27 kernel: RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9b8e4621c8c0 Jul 26 09:15:27 kernel: RBP: ffffb7838088ce18 R08: 0000000000000927 R09: 0000000000000004 Jul 26 09:15:27 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b84b5630000 Jul 26 09:15:27 kernel: R13: 0000000000000000 R14: 000000000000231c R15: ffff9b84b5630430 Jul 26 09:15:27 kernel: FS: 0000000000000000(0000) GS:ffff9b8e46200000(0000) knlGS:0000000000000000 Jul 26 09:15:27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 09:15:27 kernel: CR2: 000056238cec2380 CR3: 0000003e49ede005 CR4: 0000000000760ee0 Jul 26 09:15:27 kernel: PKRU: 55555554 Jul 26 09:15:27 kernel: Call Trace: Jul 26 09:15:27 kernel: Jul 26 09:15:27 kernel: ? show_regs.cold+0x1a/0x1f Jul 26 09:15:27 kernel: ? __warn+0x98/0xe0 Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: ? report_bug+0xd1/0x100 Jul 26 09:15:27 kernel: ? do_error_trap+0x9b/0xc0 Jul 26 09:15:27 kernel: ? do_invalid_op+0x3c/0x50 Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: ? invalid_op+0x1e/0x30 Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: tcp_write_timer_handler+0x1b4/0x240 Jul 26 09:15:27 kernel: tcp_write_timer+0x9e/0xe0 Jul 26 09:15:27 kernel: ? tcp_write_timer_handler+0x240/0x240 Jul 26 09:15:27 kernel: call_timer_fn+0x32/0x130 Jul 26 09:15:27 kernel: __run_timers.part.0+0x180/0x280 Jul 26 09:15:27 kernel: ? timerqueue_add+0x9b/0xb0 Jul 26 09:15:27 kernel: ? enqueue_hrtimer+0x3d/0x90 Jul 26 09:15:27 kernel: ? do_error_trap+0x9b/0xc0 Jul 26 09:15:27 kernel: ? do_invalid_op+0x3c/0x50 Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: ? invalid_op+0x1e/0x30 Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220 Jul 26 09:15:27 kernel: tcp_write_timer_handler+0x1b4/0x240 Jul 26 09:15:27 kernel: tcp_write_timer+0x9e/0xe0 Jul 26 09:15:27 kernel: ? tcp_write_timer_handler+0x240/0x240 Jul 26 09:15:27 kernel: call_timer_fn+0x32/0x130 Jul 26 09:15:27 kernel: __run_timers.part.0+0x180/0x280 Jul 26 09:15:27 kernel: ? timerqueue_add+0x9b/0xb0 Jul 26 09:15:27 kernel: ? enqueue_hrtimer+0x3d/0x90 Jul 26 09:15:27 kernel: ? recalibrate_cpu_khz+0x10/0x10 Jul 26 09:15:27 kernel: ? ktime_get+0x3e/0xa0 Jul 26 09:15:27 kernel: ? native_x2apic_icr_write+0x30/0x30 Jul 26 09:15:27 kernel: run_timer_softirq+0x2a/0x50 Jul 26 09:15:27 kernel: __do_softirq+0xd1/0x2c1 Jul 26 09:15:27 kernel: irq_exit+0xae/0xb0 Jul 26 09:15:27 kernel: smp_apic_timer_interrupt+0x7b/0x140 Jul 26 09:15:27 kernel: apic_timer_interrupt+0xf/0x20 Jul 26 09:15:27 kernel: Jul 26 09:15:27 kernel: RIP: 0010:native_safe_halt+0xe/0x10 Jul 26 09:15:27 kernel: Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 2c 50 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 2c 50 00 fb f4 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 dd 5e 61 ff 65 Jul 26 09:15:27 kernel: RSP: 0018:ffffb783801cfe70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 Jul 26 09:15:27 kernel: RAX: ffffffffa6908b20 RBX: 0000000000000010 RCX: 0000000000000001 Jul 26 09:15:27 kernel: RDX: 000000006fc0c97e RSI: 0000000000000082 RDI: 0000000000000082 Jul 26 09:15:27 kernel: RBP: ffffb783801cfe90 R08: 0000000000000000 R09: 0000000000000225 Jul 26 09:15:27 kernel: R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000000010 Jul 26 09:15:27 kernel: R13: ffff9b8e390b0000 R14: 0000000000000000 R15: 0000000000000000 Jul 26 09:15:27 kernel: ? __cpuidle_text_start+0x8/0x8 Jul 26 09:15:27 kernel: ? default_idle+0x20/0x140 Jul 26 09:15:27 kernel: arch_cpu_idle+0x15/0x20 Jul 26 09:15:27 kernel: default_idle_call+0x23/0x30 Jul 26 09:15:27 kernel: do_idle+0x1fb/0x270 Jul 26 09:15:27 kernel: cpu_startup_entry+0x20/0x30 Jul 26 09:15:27 kernel: start_secondary+0x178/0x1d0 Jul 26 09:15:27 kernel: secondary_startup_64+0xa4/0xb0 Jul 26 09:15:27 kernel: ---[ end trace e7ac822987e33be1 ]--- The NULL ptr deref is coming from tcp_rto_delta_us() attempting to pull an skb off the head of the retransmit queue and then dereferencing that skb to get the skb_mstamp_ns value via tcp_skb_timestamp_us(skb). The crash is the same one that was reported a # of years ago here: https://lore.kernel.org/netdev/86c0f836-9a7c-438b-d81a-839be45f1f58@gmail.com/T/#t and the kernel we're running has the fix which was added to resolve this issue. Unfortunately we've been unsuccessful so far in reproducing this problem in the lab and do not have the luxury of pushing out a new kernel to try and test if newer kernels resolve this issue at the moment. I realize this is a report against both an Ubuntu kernel and also an older 5.4 kernel. I have reported this issue to Ubuntu here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2077657 however I feel like since this issue has possibly cropped up again it makes sense to build in some protection in this path (even on the latest kernel versions) since the code in question just blindly assumes there's a valid skb without testing if it's NULL b/f it looks at the timestamp. Given we have seen crashes in this path before and now this case it seems like we should protect ourselves for when packets_out accounting is incorrect. While we should fix that root cause we should also just make sure the skb is not NULL before dereferencing it. Also add a warn once here to capture some information if/when the problem case is hit again. Fixes: e1a10ef7fa87 ("tcp: introduce tcp_rto_delta_us() helper for xmit timer fix") Signed-off-by: Josh Hunt Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- include/net/tcp.h | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index f77f812bfbe72..d1948d357dade 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2435,9 +2435,26 @@ static inline s64 tcp_rto_delta_us(const struct sock *sk) { const struct sk_buff *skb = tcp_rtx_queue_head(sk); u32 rto = inet_csk(sk)->icsk_rto; - u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto); - return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; + if (likely(skb)) { + u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto); + + return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; + } else { + WARN_ONCE(1, + "rtx queue emtpy: " + "out:%u sacked:%u lost:%u retrans:%u " + "tlp_high_seq:%u sk_state:%u ca_state:%u " + "advmss:%u mss_cache:%u pmtu:%u\n", + tcp_sk(sk)->packets_out, tcp_sk(sk)->sacked_out, + tcp_sk(sk)->lost_out, tcp_sk(sk)->retrans_out, + tcp_sk(sk)->tlp_high_seq, sk->sk_state, + inet_csk(sk)->icsk_ca_state, + tcp_sk(sk)->advmss, tcp_sk(sk)->mss_cache, + inet_csk(sk)->icsk_pmtu_cookie); + return jiffies_to_usecs(rto); + } + } /* From f011b313e8ebd5b7abd8521b5119aecef403de45 Mon Sep 17 00:00:00 2001 From: Youssef Samir Date: Mon, 16 Sep 2024 19:08:58 +0200 Subject: [PATCH 09/35] net: qrtr: Update packets cloning when broadcasting When broadcasting data to multiple nodes via MHI, using skb_clone() causes all nodes to receive the same header data. This can result in packets being discarded by endpoints, leading to lost data. This issue occurs when a socket is closed, and a QRTR_TYPE_DEL_CLIENT packet is broadcasted. All nodes receive the same destination node ID, causing the node connected to the client to discard the packet and remain unaware of the client's deletion. Replace skb_clone() with pskb_copy(), to create a separate copy of the header for each sk_buff. Fixes: bdabad3e363d ("net: Add Qualcomm IPC router") Signed-off-by: Youssef Samir Reviewed-by: Jeffery Hugo Reviewed-by: Carl Vanderlip Reviewed-by: Chris Lew Link: https://patch.msgid.link/20240916170858.2382247-1-quic_yabdulra@quicinc.com Signed-off-by: Paolo Abeni --- net/qrtr/af_qrtr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/qrtr/af_qrtr.c b/net/qrtr/af_qrtr.c index 41ece61eb57ab..00c51cf693f3d 100644 --- a/net/qrtr/af_qrtr.c +++ b/net/qrtr/af_qrtr.c @@ -884,7 +884,7 @@ static int qrtr_bcast_enqueue(struct qrtr_node *node, struct sk_buff *skb, mutex_lock(&qrtr_node_lock); list_for_each_entry(node, &qrtr_all_nodes, item) { - skbn = skb_clone(skb, GFP_KERNEL); + skbn = pskb_copy(skb, GFP_KERNEL); if (!skbn) break; skb_set_owner_w(skbn, skb->sk); From d2b366c43443a21d9bcf047f3ee1f09cf9792dc4 Mon Sep 17 00:00:00 2001 From: Daniel Golle Date: Tue, 17 Sep 2024 14:49:40 +0100 Subject: [PATCH 10/35] net: phy: aquantia: fix setting active_low bit phy_modify_mmd was used wrongly in aqr_phy_led_active_low_set() resulting in a no-op instead of setting the VEND1_GLOBAL_LED_DRIVE_VDD bit. Correctly set VEND1_GLOBAL_LED_DRIVE_VDD bit. Fixes: 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") Signed-off-by: Daniel Golle Reviewed-by: Russell King (Oracle) Link: https://patch.msgid.link/ab963584b0a7e3b4dac39472a4b82ca264d79630.1726580902.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni --- drivers/net/phy/aquantia/aquantia_leds.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/aquantia/aquantia_leds.c b/drivers/net/phy/aquantia/aquantia_leds.c index 0516ac02c3f81..201c8df93fad9 100644 --- a/drivers/net/phy/aquantia/aquantia_leds.c +++ b/drivers/net/phy/aquantia/aquantia_leds.c @@ -120,7 +120,8 @@ int aqr_phy_led_hw_control_set(struct phy_device *phydev, u8 index, int aqr_phy_led_active_low_set(struct phy_device *phydev, int index, bool enable) { return phy_modify_mmd(phydev, MDIO_MMD_VEND1, AQR_LED_DRIVE(index), - VEND1_GLOBAL_LED_DRIVE_VDD, enable); + VEND1_GLOBAL_LED_DRIVE_VDD, + enable ? VEND1_GLOBAL_LED_DRIVE_VDD : 0); } int aqr_phy_led_polarity_set(struct phy_device *phydev, int index, unsigned long modes) From 6f9defaf99122d1af9c2562181c77bc99be0672d Mon Sep 17 00:00:00 2001 From: Daniel Golle Date: Tue, 17 Sep 2024 14:49:55 +0100 Subject: [PATCH 11/35] net: phy: aquantia: fix applying active_low bit after reset for_each_set_bit was used wrongly in aqr107_config_init() when iterating over LEDs. Drop misleading 'index' variable and call aqr_phy_led_active_low_set() for each set bit representing an LED which is driven by VDD instead of GND pin. Fixes: 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") Signed-off-by: Daniel Golle Reviewed-by: Russell King (Oracle) Link: https://patch.msgid.link/9b1f0cd91f4cda54c8be56b4fe780480baf4aa0f.1726580902.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni --- drivers/net/phy/aquantia/aquantia_main.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c index 57b8b8f400fd4..4d156d406bab9 100644 --- a/drivers/net/phy/aquantia/aquantia_main.c +++ b/drivers/net/phy/aquantia/aquantia_main.c @@ -489,7 +489,7 @@ static int aqr107_config_init(struct phy_device *phydev) { struct aqr107_priv *priv = phydev->priv; u32 led_active_low; - int ret, index = 0; + int ret; /* Check that the PHY interface type is compatible */ if (phydev->interface != PHY_INTERFACE_MODE_SGMII && @@ -516,10 +516,9 @@ static int aqr107_config_init(struct phy_device *phydev) /* Restore LED polarity state after reset */ for_each_set_bit(led_active_low, &priv->leds_active_low, AQR_MAX_LEDS) { - ret = aqr_phy_led_active_low_set(phydev, index, led_active_low); + ret = aqr_phy_led_active_low_set(phydev, led_active_low, true); if (ret) return ret; - index++; } return 0; From ced8e8b8f40accfcce4a2bbd8b150aa76d5eff9a Mon Sep 17 00:00:00 2001 From: Heiner Kallweit Date: Tue, 17 Sep 2024 23:04:46 +0200 Subject: [PATCH 12/35] r8169: add tally counter fields added with RTL8125 RTL8125 added fields to the tally counter, what may result in the chip dma'ing these new fields to unallocated memory. Therefore make sure that the allocated memory area is big enough to hold all of the tally counter values, even if we use only parts of it. Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125") Cc: stable@vger.kernel.org Signed-off-by: Heiner Kallweit Reviewed-by: Simon Horman Link: https://patch.msgid.link/741d26a9-2b2b-485d-91d9-ecb302e345b5@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/realtek/r8169_main.c | 27 +++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 45ac8befba292..3ddba7aa4914b 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -579,6 +579,33 @@ struct rtl8169_counters { __le32 rx_multicast; __le16 tx_aborted; __le16 tx_underrun; + /* new since RTL8125 */ + __le64 tx_octets; + __le64 rx_octets; + __le64 rx_multicast64; + __le64 tx_unicast64; + __le64 tx_broadcast64; + __le64 tx_multicast64; + __le32 tx_pause_on; + __le32 tx_pause_off; + __le32 tx_pause_all; + __le32 tx_deferred; + __le32 tx_late_collision; + __le32 tx_all_collision; + __le32 tx_aborted32; + __le32 align_errors32; + __le32 rx_frame_too_long; + __le32 rx_runt; + __le32 rx_pause_on; + __le32 rx_pause_off; + __le32 rx_pause_all; + __le32 rx_unknown_opcode; + __le32 rx_mac_error; + __le32 tx_underrun32; + __le32 rx_mac_missed; + __le32 rx_tcam_dropped; + __le32 tdu; + __le32 rdu; }; struct rtl8169_tc_offsets { From 675faf5a14c14a2be0b870db30a70764df81e2df Mon Sep 17 00:00:00 2001 From: KhaiWenTan Date: Wed, 18 Sep 2024 14:14:22 +0800 Subject: [PATCH 13/35] net: stmmac: Fix zero-division error when disabling tc cbs The commit b8c43360f6e4 ("net: stmmac: No need to calculate speed divider when offload is disabled") allows the "port_transmit_rate_kbps" to be set to a value of 0, which is then passed to the "div_s64" function when tc-cbs is disabled. This leads to a zero-division error. When tc-cbs is disabled, the idleslope, sendslope, and credit values the credit values are not required to be configured. Therefore, adding a return statement after setting the txQ mode to DCB when tc-cbs is disabled would prevent a zero-division error. Fixes: b8c43360f6e4 ("net: stmmac: No need to calculate speed divider when offload is disabled") Cc: Co-developed-by: Choong Yong Liang Signed-off-by: Choong Yong Liang Signed-off-by: KhaiWenTan Reviewed-by: Simon Horman Link: https://patch.msgid.link/20240918061422.1589662-1-khai.wen.tan@linux.intel.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c index 832998bc020bd..75ad2da1a37f1 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c @@ -386,6 +386,7 @@ static int tc_setup_cbs(struct stmmac_priv *priv, return ret; priv->plat->tx_queues_cfg[queue].mode_to_use = MTL_QUEUE_DCB; + return 0; } /* Final adjustments for HW */ From 1d63864299cafa7c8cbde56491c9932afdbff7ea Mon Sep 17 00:00:00 2001 From: Paul Barker Date: Wed, 18 Sep 2024 09:18:38 +0100 Subject: [PATCH 14/35] net: ravb: Fix maximum TX frame size for GbEth devices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The datasheets for all SoCs using the GbEth IP specify a maximum transmission frame size of 1.5 kByte. I've confirmed through internal discussions that support for 1522 byte frames has been validated, which allows us to support the default MTU of 1500 bytes after reserving space for the Ethernet header, frame checksums and an optional VLAN tag. Fixes: 2e95e08ac009 ("ravb: Add rx_max_buf_size to struct ravb_hw_info") Reviewed-by: Niklas Söderlund Reviewed-by: Sergey Shtylyov Signed-off-by: Paul Barker Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/renesas/ravb.h | 1 + drivers/net/ethernet/renesas/ravb_main.c | 6 +++++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index 9893c91af1050..a7de5cf6b3174 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -1052,6 +1052,7 @@ struct ravb_hw_info { netdev_features_t net_features; int stats_len; u32 tccr_mask; + u32 tx_max_frame_size; u32 rx_max_frame_size; u32 rx_buffer_size; u32 rx_desc_size; diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index c7ec23688d56a..e73cdba585165 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -2674,6 +2674,7 @@ static const struct ravb_hw_info ravb_gen2_hw_info = { .net_features = NETIF_F_RXCSUM, .stats_len = ARRAY_SIZE(ravb_gstrings_stats), .tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3, + .tx_max_frame_size = SZ_2K, .rx_max_frame_size = SZ_2K, .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), @@ -2696,6 +2697,7 @@ static const struct ravb_hw_info ravb_gen3_hw_info = { .net_features = NETIF_F_RXCSUM, .stats_len = ARRAY_SIZE(ravb_gstrings_stats), .tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3, + .tx_max_frame_size = SZ_2K, .rx_max_frame_size = SZ_2K, .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), @@ -2721,6 +2723,7 @@ static const struct ravb_hw_info ravb_gen4_hw_info = { .net_features = NETIF_F_RXCSUM, .stats_len = ARRAY_SIZE(ravb_gstrings_stats), .tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3, + .tx_max_frame_size = SZ_2K, .rx_max_frame_size = SZ_2K, .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), @@ -2770,6 +2773,7 @@ static const struct ravb_hw_info gbeth_hw_info = { .net_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM, .stats_len = ARRAY_SIZE(ravb_gstrings_stats_gbeth), .tccr_mask = TCCR_TSRQ0, + .tx_max_frame_size = 1522, .rx_max_frame_size = SZ_8K, .rx_buffer_size = SZ_2K, .rx_desc_size = sizeof(struct ravb_rx_desc), @@ -2981,7 +2985,7 @@ static int ravb_probe(struct platform_device *pdev) priv->avb_link_active_low = of_property_read_bool(np, "renesas,ether-link-active-low"); - ndev->max_mtu = info->rx_max_frame_size - + ndev->max_mtu = info->tx_max_frame_size - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN); ndev->min_mtu = ETH_MIN_MTU; From ec8234717db8589078d08b17efa528a235c61f4f Mon Sep 17 00:00:00 2001 From: Paul Barker Date: Wed, 18 Sep 2024 09:18:39 +0100 Subject: [PATCH 15/35] net: ravb: Fix R-Car RX frame size limit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The RX frame size limit should not be based on the current MTU setting. Instead it should be based on the hardware capabilities. While we're here, improve the description of the receive frame length setting as suggested by Niklas. Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Reviewed-by: Sergey Shtylyov Reviewed-by: Niklas Söderlund Signed-off-by: Paul Barker Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/renesas/ravb_main.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index e73cdba585165..d2a6518532f37 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -555,8 +555,16 @@ static void ravb_emac_init_gbeth(struct net_device *ndev) static void ravb_emac_init_rcar(struct net_device *ndev) { - /* Receive frame limit set register */ - ravb_write(ndev, ndev->mtu + ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN, RFLR); + struct ravb_private *priv = netdev_priv(ndev); + + /* Set receive frame length + * + * The length set here describes the frame from the destination address + * up to and including the CRC data. However only the frame data, + * excluding the CRC, are transferred to memory. To allow for the + * largest frames add the CRC length to the maximum Rx descriptor size. + */ + ravb_write(ndev, priv->info->rx_max_frame_size + ETH_FCS_LEN, RFLR); /* EMAC Mode: PAUSE prohibition; Duplex; RX Checksum; TX; RX */ ravb_write(ndev, ECMR_ZPF | ECMR_DM | From 3b067536daa4842adbf685accf47c899a26367d3 Mon Sep 17 00:00:00 2001 From: Heiner Kallweit Date: Wed, 18 Sep 2024 20:45:15 +0200 Subject: [PATCH 16/35] r8169: add missing MODULE_FIRMWARE entry for RTL8126A rev.b Add a missing MODULE_FIRMWARE entry. Fixes: 69cb89981c7a ("r8169: add support for RTL8126A rev.b") Signed-off-by: Heiner Kallweit Link: https://patch.msgid.link/bb307611-d129-43f5-a7ff-bdb6b4044fce@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/realtek/r8169_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 3ddba7aa4914b..305ec19ccef13 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -708,6 +708,7 @@ MODULE_FIRMWARE(FIRMWARE_8107E_2); MODULE_FIRMWARE(FIRMWARE_8125A_3); MODULE_FIRMWARE(FIRMWARE_8125B_2); MODULE_FIRMWARE(FIRMWARE_8126A_2); +MODULE_FIRMWARE(FIRMWARE_8126A_3); static inline struct device *tp_to_dev(struct rtl8169_private *tp) { From 0cbfd45fbcf0cb26d85c981b91c62fe73cdee01c Mon Sep 17 00:00:00 2001 From: Jiwon Kim Date: Wed, 18 Sep 2024 14:06:02 +0000 Subject: [PATCH 17/35] bonding: Fix unnecessary warnings and logs from bond_xdp_get_xmit_slave() syzbot reported a WARNING in bond_xdp_get_xmit_slave. To reproduce this[1], one bond device (bond1) has xdpdrv, which increases bpf_master_redirect_enabled_key. Another bond device (bond0) which is unsupported by XDP but its slave (veth3) has xdpgeneric that returns XDP_TX. This triggers WARN_ON_ONCE() from the xdp_master_redirect(). To reduce unnecessary warnings and improve log management, we need to delete the WARN_ON_ONCE() and add ratelimit to the netdev_err(). [1] Steps to reproduce: # Needs tx_xdp with return XDP_TX; ip l add veth0 type veth peer veth1 ip l add veth3 type veth peer veth4 ip l add bond0 type bond mode 6 # BOND_MODE_ALB, unsupported by XDP ip l add bond1 type bond # BOND_MODE_ROUNDROBIN by default ip l set veth0 master bond1 ip l set bond1 up # Increases bpf_master_redirect_enabled_key ip l set dev bond1 xdpdrv object tx_xdp.o section xdp_tx ip l set veth3 master bond0 ip l set bond0 up ip l set veth4 up # Triggers WARN_ON_ONCE() from the xdp_master_redirect() ip l set veth3 xdpgeneric object tx_xdp.o section xdp_tx Reported-by: syzbot+c187823a52ed505b2257@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c187823a52ed505b2257 Fixes: 9e2ee5c7e7c3 ("net, bonding: Add XDP support to the bonding driver") Signed-off-by: Jiwon Kim Signed-off-by: Nikolay Aleksandrov Link: https://patch.msgid.link/20240918140602.18644-1-jiwonaid0@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/bonding/bond_main.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b560644ee1b10..b1bffd8e9a958 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -5610,9 +5610,9 @@ bond_xdp_get_xmit_slave(struct net_device *bond_dev, struct xdp_buff *xdp) break; default: - /* Should never happen. Mode guarded by bond_xdp_check() */ - netdev_err(bond_dev, "Unknown bonding mode %d for xdp xmit\n", BOND_MODE(bond)); - WARN_ON_ONCE(1); + if (net_ratelimit()) + netdev_err(bond_dev, "Unknown bonding mode %d for xdp xmit\n", + BOND_MODE(bond)); return NULL; } From c11a49d58ad229a1be1ebe08a2b68fedf83db6c8 Mon Sep 17 00:00:00 2001 From: Wenbo Li Date: Thu, 19 Sep 2024 16:13:51 +0800 Subject: [PATCH 18/35] virtio_net: Fix mismatched buf address when unmapping for small packets Currently, the virtio-net driver will perform a pre-dma-mapping for small or mergeable RX buffer. But for small packets, a mismatched address without VIRTNET_RX_PAD and xdp_headroom is used for unmapping. That will result in unsynchronized buffers when SWIOTLB is enabled, for example, when running as a TDX guest. This patch unifies the address passed to the virtio core as the address of the virtnet header and fixes the mismatched buffer address. Changes from v2: unify the buf that passed to the virtio core in small and merge mode. Changes from v1: Use ctx to get xdp_headroom. Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers") Signed-off-by: Wenbo Li Signed-off-by: Jiahui Cen Signed-off-by: Ying Fang Reviewed-by: Xuan Zhuo Link: https://patch.msgid.link/20240919081351.51772-1-liwenbo.martin@bytedance.com Signed-off-by: Paolo Abeni --- drivers/net/virtio_net.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 6f4781ec2b369..f8131f92a3928 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1807,6 +1807,11 @@ static struct sk_buff *receive_small(struct net_device *dev, struct page *page = virt_to_head_page(buf); struct sk_buff *skb; + /* We passed the address of virtnet header to virtio-core, + * so truncate the padding. + */ + buf -= VIRTNET_RX_PAD + xdp_headroom; + len -= vi->hdr_len; u64_stats_add(&stats->bytes, len); @@ -2422,8 +2427,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq, if (unlikely(!buf)) return -ENOMEM; - virtnet_rq_init_one_sg(rq, buf + VIRTNET_RX_PAD + xdp_headroom, - vi->hdr_len + GOOD_PACKET_LEN); + buf += VIRTNET_RX_PAD + xdp_headroom; + + virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN); err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); if (err < 0) { From b514c47ebf41a6536551ed28a05758036e6eca7c Mon Sep 17 00:00:00 2001 From: Furong Xu <0x1207@gmail.com> Date: Thu, 19 Sep 2024 20:10:28 +0800 Subject: [PATCH 19/35] net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled Commit 5fabb01207a2 ("net: stmmac: Add initial XDP support") sets PP_FLAG_DMA_SYNC_DEV flag for page_pool unconditionally, page_pool_recycle_direct() will call page_pool_dma_sync_for_device() on every page even the page is not going to be reused by XDP program. When XDP is not enabled, the page which holds the received buffer will be recycled once the buffer is copied into new SKB by skb_copy_to_linear_data(), then the MAC core will never reuse this page any longer. Always setting PP_FLAG_DMA_SYNC_DEV wastes CPU cycles on unnecessary calling of page_pool_dma_sync_for_device(). After this patch, up to 9% noticeable performance improvement was observed on certain platforms. Fixes: 5fabb01207a2 ("net: stmmac: Add initial XDP support") Signed-off-by: Furong Xu <0x1207@gmail.com> Link: https://patch.msgid.link/20240919121028.1348023-1-0x1207@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index d3895d7eecfc5..e2140482270a8 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -2035,7 +2035,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv, rx_q->queue_index = queue; rx_q->priv_data = priv; - pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; + pp_params.flags = PP_FLAG_DMA_MAP | (xdp_prog ? PP_FLAG_DMA_SYNC_DEV : 0); pp_params.pool_size = dma_conf->dma_rx_size; num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE); pp_params.order = ilog2(num_pages); From 04e906839a053f092ef53f4fb2d610983412b904 Mon Sep 17 00:00:00 2001 From: Oliver Neukum Date: Thu, 19 Sep 2024 14:33:42 +0200 Subject: [PATCH 20/35] usbnet: fix cyclical race on disconnect with work queue The work can submit URBs and the URBs can schedule the work. This cycle needs to be broken, when a device is to be stopped. Use a flag to do so. This is a design issue as old as the driver. Signed-off-by: Oliver Neukum Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Link: https://patch.msgid.link/20240919123525.688065-1-oneukum@suse.com Signed-off-by: Paolo Abeni --- drivers/net/usb/usbnet.c | 37 ++++++++++++++++++++++++++++--------- include/linux/usb/usbnet.h | 15 +++++++++++++++ 2 files changed, 43 insertions(+), 9 deletions(-) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index 18eb5ba436df6..2506aa8c603ec 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -464,10 +464,15 @@ static enum skb_state defer_bh(struct usbnet *dev, struct sk_buff *skb, void usbnet_defer_kevent (struct usbnet *dev, int work) { set_bit (work, &dev->flags); - if (!schedule_work (&dev->kevent)) - netdev_dbg(dev->net, "kevent %s may have been dropped\n", usbnet_event_names[work]); - else - netdev_dbg(dev->net, "kevent %s scheduled\n", usbnet_event_names[work]); + if (!usbnet_going_away(dev)) { + if (!schedule_work(&dev->kevent)) + netdev_dbg(dev->net, + "kevent %s may have been dropped\n", + usbnet_event_names[work]); + else + netdev_dbg(dev->net, + "kevent %s scheduled\n", usbnet_event_names[work]); + } } EXPORT_SYMBOL_GPL(usbnet_defer_kevent); @@ -535,7 +540,8 @@ static int rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags) tasklet_schedule (&dev->bh); break; case 0: - __usbnet_queue_skb(&dev->rxq, skb, rx_start); + if (!usbnet_going_away(dev)) + __usbnet_queue_skb(&dev->rxq, skb, rx_start); } } else { netif_dbg(dev, ifdown, dev->net, "rx: stopped\n"); @@ -843,9 +849,18 @@ int usbnet_stop (struct net_device *net) /* deferred work (timer, softirq, task) must also stop */ dev->flags = 0; - del_timer_sync (&dev->delay); - tasklet_kill (&dev->bh); + del_timer_sync(&dev->delay); + tasklet_kill(&dev->bh); cancel_work_sync(&dev->kevent); + + /* We have cyclic dependencies. Those calls are needed + * to break a cycle. We cannot fall into the gaps because + * we have a flag + */ + tasklet_kill(&dev->bh); + del_timer_sync(&dev->delay); + cancel_work_sync(&dev->kevent); + if (!pm) usb_autopm_put_interface(dev->intf); @@ -1171,7 +1186,8 @@ usbnet_deferred_kevent (struct work_struct *work) status); } else { clear_bit (EVENT_RX_HALT, &dev->flags); - tasklet_schedule (&dev->bh); + if (!usbnet_going_away(dev)) + tasklet_schedule(&dev->bh); } } @@ -1196,7 +1212,8 @@ usbnet_deferred_kevent (struct work_struct *work) usb_autopm_put_interface(dev->intf); fail_lowmem: if (resched) - tasklet_schedule (&dev->bh); + if (!usbnet_going_away(dev)) + tasklet_schedule(&dev->bh); } } @@ -1559,6 +1576,7 @@ static void usbnet_bh (struct timer_list *t) } else if (netif_running (dev->net) && netif_device_present (dev->net) && netif_carrier_ok(dev->net) && + !usbnet_going_away(dev) && !timer_pending(&dev->delay) && !test_bit(EVENT_RX_PAUSED, &dev->flags) && !test_bit(EVENT_RX_HALT, &dev->flags)) { @@ -1606,6 +1624,7 @@ void usbnet_disconnect (struct usb_interface *intf) usb_set_intfdata(intf, NULL); if (!dev) return; + usbnet_mark_going_away(dev); xdev = interface_to_usbdev (intf); diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h index 9f08a584d7078..0b9f1e598e3a6 100644 --- a/include/linux/usb/usbnet.h +++ b/include/linux/usb/usbnet.h @@ -76,8 +76,23 @@ struct usbnet { # define EVENT_LINK_CHANGE 11 # define EVENT_SET_RX_MODE 12 # define EVENT_NO_IP_ALIGN 13 +/* This one is special, as it indicates that the device is going away + * there are cyclic dependencies between tasklet, timer and bh + * that must be broken + */ +# define EVENT_UNPLUG 31 }; +static inline bool usbnet_going_away(struct usbnet *ubn) +{ + return test_bit(EVENT_UNPLUG, &ubn->flags); +} + +static inline void usbnet_mark_going_away(struct usbnet *ubn) +{ + set_bit(EVENT_UNPLUG, &ubn->flags); +} + static inline struct usb_driver *driver_of(struct usb_interface *intf) { return to_usb_driver(intf->dev.driver); From 72ef07554c5dcabb0053a147c4fd221a8e39bcfd Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Thu, 19 Sep 2024 08:43:42 -0400 Subject: [PATCH 21/35] selftests/net: packetdrill: increase timing tolerance in debug mode Some packetdrill tests are flaky in debug mode. As discussed, increase tolerance. We have been doing this for debug builds outside ksft too. Previous setting was 10000. A manual 50 runs in virtme-ng showed two failures that needed 12000. To be on the safe side, Increase to 14000. Link: https://lore.kernel.org/netdev/Zuhhe4-MQHd3EkfN@mini-arch/ Fixes: 1e42f73fd3c2 ("selftests/net: packetdrill: import tcp/zerocopy") Reported-by: Stanislav Fomichev Signed-off-by: Willem de Bruijn Reviewed-by: Simon Horman Acked-by: Stanislav Fomichev Acked-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20240919124412.3014326-1-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/packetdrill/ksft_runner.sh | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/packetdrill/ksft_runner.sh b/tools/testing/selftests/net/packetdrill/ksft_runner.sh index 7478c0c0c9aac..4071c133f29e8 100755 --- a/tools/testing/selftests/net/packetdrill/ksft_runner.sh +++ b/tools/testing/selftests/net/packetdrill/ksft_runner.sh @@ -30,12 +30,17 @@ if [ -z "$(which packetdrill)" ]; then exit "$KSFT_SKIP" fi +declare -a optargs +if [[ -n "${KSFT_MACHINE_SLOW}" ]]; then + optargs+=('--tolerance_usecs=14000') +fi + ktap_print_header ktap_set_plan 2 -unshare -n packetdrill ${ipv4_args[@]} $(basename $script) > /dev/null \ +unshare -n packetdrill ${ipv4_args[@]} ${optargs[@]} $(basename $script) > /dev/null \ && ktap_test_pass "ipv4" || ktap_test_fail "ipv4" -unshare -n packetdrill ${ipv6_args[@]} $(basename $script) > /dev/null \ +unshare -n packetdrill ${ipv6_args[@]} ${optargs[@]} $(basename $script) > /dev/null \ && ktap_test_pass "ipv6" || ktap_test_fail "ipv6" ktap_finished From d8f84a9bc7c4e07fdc4edc00f9e868b8db974ccb Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Tue, 10 Sep 2024 11:38:14 +0200 Subject: [PATCH 22/35] netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash A conntrack entry can be inserted to the connection tracking table if there is no existing entry with an identical tuple in either direction. Example: INITIATOR -> NAT/PAT -> RESPONDER Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite). Then, later, NAT/PAT machine itself also wants to connect to RESPONDER. This will not work if the SNAT done earlier has same IP:PORT source pair. Conntrack table has: ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT and new locally originating connection wants: ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT This is handled by the NAT engine which will do a source port reallocation for the locally originating connection that is colliding with an existing tuple by attempting a source port rewrite. This is done even if this new connection attempt did not go through a masquerade/snat rule. There is a rare race condition with connection-less protocols like UDP, where we do the port reallocation even though its not needed. This happens when new packets from the same, pre-existing flow are received in both directions at the exact same time on different CPUs after the conntrack table was flushed (or conntrack becomes active for first time). With strict ordering/single cpu, the first packet creates new ct entry and second packet is resolved as established reply packet. With parallel processing, both packets are picked up as new and both get their own ct entry. In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by NAT engine because a port collision is detected. This change isn't enough to prevent a packet drop later during nf_conntrack_confirm(), the existing clash resolution strategy will not detect such reverse clash case. This is resolved by a followup patch. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_nat_core.c | 120 +++++++++++++++++++++++++++++++++++- 1 file changed, 118 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index 6d8da6dddf998..d9ea2c26f3098 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -183,7 +183,35 @@ hash_by_src(const struct net *net, return reciprocal_scale(hash, nf_nat_htable_size); } -/* Is this tuple already taken? (not by us) */ +/** + * nf_nat_used_tuple - check if proposed nat tuple clashes with existing entry + * @tuple: proposed NAT binding + * @ignored_conntrack: our (unconfirmed) conntrack entry + * + * A conntrack entry can be inserted to the connection tracking table + * if there is no existing entry with an identical tuple in either direction. + * + * Example: + * INITIATOR -> NAT/PAT -> RESPONDER + * + * INITIATOR passes through NAT/PAT ("us") and SNAT is done (saddr rewrite). + * Then, later, NAT/PAT itself also connects to RESPONDER. + * + * This will not work if the SNAT done earlier has same IP:PORT source pair. + * + * Conntrack table has: + * ORIGINAL: $IP_INITIATOR:$SPORT -> $IP_RESPONDER:$DPORT + * REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT + * + * and new locally originating connection wants: + * ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT + * REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT + * + * ... which would mean incoming packets cannot be distinguished between + * the existing and the newly added entry (identical IP_CT_DIR_REPLY tuple). + * + * @return: true if the proposed NAT mapping collides with an existing entry. + */ static int nf_nat_used_tuple(const struct nf_conntrack_tuple *tuple, const struct nf_conn *ignored_conntrack) @@ -200,6 +228,94 @@ nf_nat_used_tuple(const struct nf_conntrack_tuple *tuple, return nf_conntrack_tuple_taken(&reply, ignored_conntrack); } +static bool nf_nat_allow_clash(const struct nf_conn *ct) +{ + return nf_ct_l4proto_find(nf_ct_protonum(ct))->allow_clash; +} + +/** + * nf_nat_used_tuple_new - check if to-be-inserted conntrack collides with existing entry + * @tuple: proposed NAT binding + * @ignored_ct: our (unconfirmed) conntrack entry + * + * Same as nf_nat_used_tuple, but also check for rare clash in reverse + * direction. Should be called only when @tuple has not been altered, i.e. + * @ignored_conntrack will not be subject to NAT. + * + * @return: true if the proposed NAT mapping collides with existing entry. + */ +static noinline bool +nf_nat_used_tuple_new(const struct nf_conntrack_tuple *tuple, + const struct nf_conn *ignored_ct) +{ + static const unsigned long uses_nat = IPS_NAT_MASK | IPS_SEQ_ADJUST_BIT; + const struct nf_conntrack_tuple_hash *thash; + const struct nf_conntrack_zone *zone; + struct nf_conn *ct; + bool taken = true; + struct net *net; + + if (!nf_nat_used_tuple(tuple, ignored_ct)) + return false; + + if (!nf_nat_allow_clash(ignored_ct)) + return true; + + /* Initial choice clashes with existing conntrack. + * Check for (rare) reverse collision. + * + * This can happen when new packets are received in both directions + * at the exact same time on different CPUs. + * + * Without SMP, first packet creates new conntrack entry and second + * packet is resolved as established reply packet. + * + * With parallel processing, both packets could be picked up as + * new and both get their own ct entry allocated. + * + * If ignored_conntrack and colliding ct are not subject to NAT then + * pretend the tuple is available and let later clash resolution + * handle this at insertion time. + * + * Without it, the 'reply' packet has its source port rewritten + * by nat engine. + */ + if (READ_ONCE(ignored_ct->status) & uses_nat) + return true; + + net = nf_ct_net(ignored_ct); + zone = nf_ct_zone(ignored_ct); + + thash = nf_conntrack_find_get(net, zone, tuple); + if (unlikely(!thash)) /* clashing entry went away */ + return false; + + ct = nf_ct_tuplehash_to_ctrack(thash); + + /* NB: IP_CT_DIR_ORIGINAL should be impossible because + * nf_nat_used_tuple() handles origin collisions. + * + * Handle remote chance other CPU confirmed its ct right after. + */ + if (thash->tuple.dst.dir != IP_CT_DIR_REPLY) + goto out; + + /* clashing connection subject to NAT? Retry with new tuple. */ + if (READ_ONCE(ct->status) & uses_nat) + goto out; + + if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, + &ignored_ct->tuplehash[IP_CT_DIR_REPLY].tuple) && + nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, + &ignored_ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple)) { + taken = false; + goto out; + } +out: + nf_ct_put(ct); + return taken; +} + static bool nf_nat_may_kill(struct nf_conn *ct, unsigned long flags) { static const unsigned long flags_refuse = IPS_FIXED_TIMEOUT | @@ -611,7 +727,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple, !(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) { /* try the original tuple first */ if (nf_in_range(orig_tuple, range)) { - if (!nf_nat_used_tuple(orig_tuple, ct)) { + if (!nf_nat_used_tuple_new(orig_tuple, ct)) { *tuple = *orig_tuple; return; } From a4e6a1031e7769c63d17b8e97d79e25dd7271fd3 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Tue, 10 Sep 2024 11:38:15 +0200 Subject: [PATCH 23/35] netfilter: conntrack: add clash resolution for reverse collisions Given existing entry: ORIGIN: a:b -> c:d REPLY: c:d -> a:b And colliding entry: ORIGIN: c:d -> a:b REPLY: a:b -> c:d The colliding ct (and the associated skb) get dropped on insert. Permit this by checking if the colliding entry matches the reply direction. Happens when both ends send packets at same time, both requests are picked up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the second packet. This is an esoteric condition, as ruleset must permit NEW connections in either direction and both peers must already have a bidirectional traffic flow at the time conntrack gets enabled. Allow the 'reverse' skb to pass and assign the existing (clashing) entry. While at it, also drop the extra 'dying' check, this is already tested earlier by the calling function. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 56 ++++++++++++++++++++++++++++--- 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index d3cb53b008f5a..7c63fbfb8c1dd 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -988,6 +988,56 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct) tstamp->start = ktime_get_real_ns(); } +/** + * nf_ct_match_reverse - check if ct1 and ct2 refer to identical flow + * @ct1: conntrack in hash table to check against + * @ct2: merge candidate + * + * returns true if ct1 and ct2 happen to refer to the same flow, but + * in opposing directions, i.e. + * ct1: a:b -> c:d + * ct2: c:d -> a:b + * for both directions. If so, @ct2 should not have been created + * as the skb should have been picked up as ESTABLISHED flow. + * But ct1 was not yet committed to hash table before skb that created + * ct2 had arrived. + * + * Note we don't compare netns because ct entries in different net + * namespace cannot clash to begin with. + * + * @return: true if ct1 and ct2 are identical when swapping origin/reply. + */ +static bool +nf_ct_match_reverse(const struct nf_conn *ct1, const struct nf_conn *ct2) +{ + u16 id1, id2; + + if (!nf_ct_tuple_equal(&ct1->tuplehash[IP_CT_DIR_ORIGINAL].tuple, + &ct2->tuplehash[IP_CT_DIR_REPLY].tuple)) + return false; + + if (!nf_ct_tuple_equal(&ct1->tuplehash[IP_CT_DIR_REPLY].tuple, + &ct2->tuplehash[IP_CT_DIR_ORIGINAL].tuple)) + return false; + + id1 = nf_ct_zone_id(nf_ct_zone(ct1), IP_CT_DIR_ORIGINAL); + id2 = nf_ct_zone_id(nf_ct_zone(ct2), IP_CT_DIR_REPLY); + if (id1 != id2) + return false; + + id1 = nf_ct_zone_id(nf_ct_zone(ct1), IP_CT_DIR_REPLY); + id2 = nf_ct_zone_id(nf_ct_zone(ct2), IP_CT_DIR_ORIGINAL); + + return id1 == id2; +} + +static int nf_ct_can_merge(const struct nf_conn *ct, + const struct nf_conn *loser_ct) +{ + return nf_ct_match(ct, loser_ct) || + nf_ct_match_reverse(ct, loser_ct); +} + /* caller must hold locks to prevent concurrent changes */ static int __nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h) @@ -999,11 +1049,7 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb, loser_ct = nf_ct_get(skb, &ctinfo); - if (nf_ct_is_dying(ct)) - return NF_DROP; - - if (((ct->status & IPS_NAT_DONE_MASK) == 0) || - nf_ct_match(ct, loser_ct)) { + if (nf_ct_can_merge(ct, loser_ct)) { struct net *net = nf_ct_net(ct); nf_conntrack_get(&ct->ct_general); From a57856c0bbc238779e56ec9e48a7ba8e06d8bebf Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Tue, 10 Sep 2024 11:38:16 +0200 Subject: [PATCH 24/35] selftests: netfilter: add reverse-clash resolution test case Add test program that is sending UDP packets in both directions and check that packets arrive without source port modification. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- .../testing/selftests/net/netfilter/Makefile | 2 + .../net/netfilter/conntrack_reverse_clash.c | 125 ++++++++++++++++++ .../net/netfilter/conntrack_reverse_clash.sh | 51 +++++++ 3 files changed, 178 insertions(+) create mode 100644 tools/testing/selftests/net/netfilter/conntrack_reverse_clash.c create mode 100755 tools/testing/selftests/net/netfilter/conntrack_reverse_clash.sh diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile index d13fb5ea3e894..98535c60b1958 100644 --- a/tools/testing/selftests/net/netfilter/Makefile +++ b/tools/testing/selftests/net/netfilter/Makefile @@ -13,6 +13,7 @@ TEST_PROGS += conntrack_ipip_mtu.sh TEST_PROGS += conntrack_tcp_unreplied.sh TEST_PROGS += conntrack_sctp_collision.sh TEST_PROGS += conntrack_vrf.sh +TEST_PROGS += conntrack_reverse_clash.sh TEST_PROGS += ipvs.sh TEST_PROGS += nf_conntrack_packetdrill.sh TEST_PROGS += nf_nat_edemux.sh @@ -36,6 +37,7 @@ TEST_GEN_PROGS = conntrack_dump_flush TEST_GEN_FILES = audit_logread TEST_GEN_FILES += connect_close nf_queue +TEST_GEN_FILES += conntrack_reverse_clash TEST_GEN_FILES += sctp_collision include ../../lib.mk diff --git a/tools/testing/selftests/net/netfilter/conntrack_reverse_clash.c b/tools/testing/selftests/net/netfilter/conntrack_reverse_clash.c new file mode 100644 index 0000000000000..507930cee8cb6 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/conntrack_reverse_clash.c @@ -0,0 +1,125 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Needs something like: + * + * iptables -t nat -A POSTROUTING -o nomatch -j MASQUERADE + * + * so NAT engine attaches a NAT null-binding to each connection. + * + * With unmodified kernels, child or parent will exit with + * "Port number changed" error, even though no port translation + * was requested. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define LEN 512 +#define PORT 56789 +#define TEST_TIME 5 + +static void die(const char *e) +{ + perror(e); + exit(111); +} + +static void die_port(uint16_t got, uint16_t want) +{ + fprintf(stderr, "Port number changed, wanted %d got %d\n", want, ntohs(got)); + exit(1); +} + +static int udp_socket(void) +{ + static const struct timeval tv = { + .tv_sec = 1, + }; + int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); + + if (fd < 0) + die("socket"); + + setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + return fd; +} + +int main(int argc, char *argv[]) +{ + struct sockaddr_in sa1 = { + .sin_family = AF_INET, + }; + struct sockaddr_in sa2 = { + .sin_family = AF_INET, + }; + int s1, s2, status; + time_t end, now; + socklen_t plen; + char buf[LEN]; + bool child; + + sa1.sin_port = htons(PORT); + sa2.sin_port = htons(PORT + 1); + + s1 = udp_socket(); + s2 = udp_socket(); + + inet_pton(AF_INET, "127.0.0.11", &sa1.sin_addr); + inet_pton(AF_INET, "127.0.0.12", &sa2.sin_addr); + + if (bind(s1, (struct sockaddr *)&sa1, sizeof(sa1)) < 0) + die("bind 1"); + if (bind(s2, (struct sockaddr *)&sa2, sizeof(sa2)) < 0) + die("bind 2"); + + child = fork() == 0; + + now = time(NULL); + end = now + TEST_TIME; + + while (now < end) { + struct sockaddr_in peer; + socklen_t plen = sizeof(peer); + + now = time(NULL); + + if (child) { + if (sendto(s1, buf, LEN, 0, (struct sockaddr *)&sa2, sizeof(sa2)) != LEN) + continue; + + if (recvfrom(s2, buf, LEN, 0, (struct sockaddr *)&peer, &plen) < 0) + die("child recvfrom"); + + if (peer.sin_port != htons(PORT)) + die_port(peer.sin_port, PORT); + } else { + if (sendto(s2, buf, LEN, 0, (struct sockaddr *)&sa1, sizeof(sa1)) != LEN) + continue; + + if (recvfrom(s1, buf, LEN, 0, (struct sockaddr *)&peer, &plen) < 0) + die("parent recvfrom"); + + if (peer.sin_port != htons((PORT + 1))) + die_port(peer.sin_port, PORT + 1); + } + } + + if (child) + return 0; + + wait(&status); + + if (WIFEXITED(status)) + return WEXITSTATUS(status); + + return 1; +} diff --git a/tools/testing/selftests/net/netfilter/conntrack_reverse_clash.sh b/tools/testing/selftests/net/netfilter/conntrack_reverse_clash.sh new file mode 100755 index 0000000000000..a24c896347a88 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/conntrack_reverse_clash.sh @@ -0,0 +1,51 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +source lib.sh + +cleanup() +{ + cleanup_all_ns +} + +checktool "nft --version" "run test without nft" +checktool "conntrack --version" "run test without conntrack" + +trap cleanup EXIT + +setup_ns ns0 + +# make loopback connections get nat null bindings assigned +ip netns exec "$ns0" nft -f - </dev/null + now=$(date +%s) + done +} + +do_flush & + +if ip netns exec "$ns0" ./conntrack_reverse_clash; then + echo "PASS: No SNAT performed for null bindings" +else + echo "ERROR: SNAT performed without any matching snat rule" + exit 1 +fi + +exit 0 From 7e37e0eacd22c41e354e4b5d6d448b13a201954a Mon Sep 17 00:00:00 2001 From: Antonio Ojea Date: Thu, 12 Sep 2024 06:17:54 +0000 Subject: [PATCH 25/35] selftests: netfilter: nft_tproxy.sh: add tcp tests The TPROXY functionality is widely used, however, there are only mptcp selftests covering this feature. The selftests represent the most common scenarios and can also be used as selfdocumentation of the feature. UDP and TCP testcases are split in different files because of the different nature of the protocols, specially due to the challenges that present to reliable test UDP due to the connectionless nature of the protocol. UDP only covers the scenarios involving the prerouting hook. The UDP tests are signfinicantly slower than the TCP ones, hence they use a larger timeout, it takes 20 seconds to run the full UDP suite on a 48 vCPU Intel(R) Xeon(R) CPU @2.60GHz. Signed-off-by: Antonio Ojea Signed-off-by: Pablo Neira Ayuso --- .../testing/selftests/net/netfilter/Makefile | 2 + tools/testing/selftests/net/netfilter/config | 1 + .../selftests/net/netfilter/nft_tproxy_tcp.sh | 358 ++++++++++++++++++ .../selftests/net/netfilter/nft_tproxy_udp.sh | 262 +++++++++++++ 4 files changed, 623 insertions(+) create mode 100755 tools/testing/selftests/net/netfilter/nft_tproxy_tcp.sh create mode 100755 tools/testing/selftests/net/netfilter/nft_tproxy_udp.sh diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile index 98535c60b1958..e6c9e777feadc 100644 --- a/tools/testing/selftests/net/netfilter/Makefile +++ b/tools/testing/selftests/net/netfilter/Makefile @@ -27,6 +27,8 @@ TEST_PROGS += nft_nat.sh TEST_PROGS += nft_nat_zones.sh TEST_PROGS += nft_queue.sh TEST_PROGS += nft_synproxy.sh +TEST_PROGS += nft_tproxy_tcp.sh +TEST_PROGS += nft_tproxy_udp.sh TEST_PROGS += nft_zones_many.sh TEST_PROGS += rpath.sh TEST_PROGS += xt_string.sh diff --git a/tools/testing/selftests/net/netfilter/config b/tools/testing/selftests/net/netfilter/config index b2dd4db452150..c5fe7b34eaf19 100644 --- a/tools/testing/selftests/net/netfilter/config +++ b/tools/testing/selftests/net/netfilter/config @@ -81,6 +81,7 @@ CONFIG_NFT_QUEUE=m CONFIG_NFT_QUOTA=m CONFIG_NFT_REDIR=m CONFIG_NFT_SYNPROXY=m +CONFIG_NFT_TPROXY=m CONFIG_VETH=m CONFIG_VLAN_8021Q=m CONFIG_XFRM_USER=m diff --git a/tools/testing/selftests/net/netfilter/nft_tproxy_tcp.sh b/tools/testing/selftests/net/netfilter/nft_tproxy_tcp.sh new file mode 100755 index 0000000000000..e208fb03eeb76 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/nft_tproxy_tcp.sh @@ -0,0 +1,358 @@ +#!/bin/bash +# +# This tests tproxy on the following scenario: +# +# +------------+ +# +-------+ | nsrouter | +-------+ +# |ns1 |.99 .1| |.1 .99| ns2| +# | eth0|---------------|veth0 veth1|------------------|eth0 | +# | | 10.0.1.0/24 | | 10.0.2.0/24 | | +# +-------+ dead:1::/64 | veth2 | dead:2::/64 +-------+ +# +------------+ +# |.1 +# | +# | +# | +-------+ +# | .99| ns3| +# +------------------------|eth0 | +# 10.0.3.0/24 | | +# dead:3::/64 +-------+ +# +# The tproxy implementation acts as an echo server so the client +# must receive the same message it sent if it has been proxied. +# If is not proxied the servers return PONG_NS# with the number +# of the namespace the server is running. +# +# shellcheck disable=SC2162,SC2317 + +source lib.sh +ret=0 +timeout=5 + +cleanup() +{ + ip netns pids "$ns1" | xargs kill 2>/dev/null + ip netns pids "$ns2" | xargs kill 2>/dev/null + ip netns pids "$ns3" | xargs kill 2>/dev/null + ip netns pids "$nsrouter" | xargs kill 2>/dev/null + + cleanup_all_ns +} + +checktool "nft --version" "test without nft tool" +checktool "socat -h" "run test without socat" + +trap cleanup EXIT +setup_ns ns1 ns2 ns3 nsrouter + +if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > /dev/null 2>&1; then + echo "SKIP: No virtual ethernet pair device support in kernel" + exit $ksft_skip +fi +ip link add veth1 netns "$nsrouter" type veth peer name eth0 netns "$ns2" +ip link add veth2 netns "$nsrouter" type veth peer name eth0 netns "$ns3" + +ip -net "$nsrouter" link set veth0 up +ip -net "$nsrouter" addr add 10.0.1.1/24 dev veth0 +ip -net "$nsrouter" addr add dead:1::1/64 dev veth0 nodad + +ip -net "$nsrouter" link set veth1 up +ip -net "$nsrouter" addr add 10.0.2.1/24 dev veth1 +ip -net "$nsrouter" addr add dead:2::1/64 dev veth1 nodad + +ip -net "$nsrouter" link set veth2 up +ip -net "$nsrouter" addr add 10.0.3.1/24 dev veth2 +ip -net "$nsrouter" addr add dead:3::1/64 dev veth2 nodad + +ip -net "$ns1" link set eth0 up +ip -net "$ns2" link set eth0 up +ip -net "$ns3" link set eth0 up + +ip -net "$ns1" addr add 10.0.1.99/24 dev eth0 +ip -net "$ns1" addr add dead:1::99/64 dev eth0 nodad +ip -net "$ns1" route add default via 10.0.1.1 +ip -net "$ns1" route add default via dead:1::1 + +ip -net "$ns2" addr add 10.0.2.99/24 dev eth0 +ip -net "$ns2" addr add dead:2::99/64 dev eth0 nodad +ip -net "$ns2" route add default via 10.0.2.1 +ip -net "$ns2" route add default via dead:2::1 + +ip -net "$ns3" addr add 10.0.3.99/24 dev eth0 +ip -net "$ns3" addr add dead:3::99/64 dev eth0 nodad +ip -net "$ns3" route add default via 10.0.3.1 +ip -net "$ns3" route add default via dead:3::1 + +ip netns exec "$nsrouter" sysctl net.ipv6.conf.all.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth2.forwarding=1 > /dev/null + +test_ping() { + if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.99 > /dev/null; then + return 1 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q dead:2::99 > /dev/null; then + return 2 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q 10.0.3.99 > /dev/null; then + return 1 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q dead:3::99 > /dev/null; then + return 2 + fi + + return 0 +} + +test_ping_router() { + if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.1 > /dev/null; then + return 3 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q dead:2::1 > /dev/null; then + return 4 + fi + + return 0 +} + + +listener_ready() +{ + local ns="$1" + local port="$2" + local proto="$3" + ss -N "$ns" -ln "$proto" -o "sport = :$port" | grep -q "$port" +} + +test_tproxy() +{ + local traffic_origin="$1" + local ip_proto="$2" + local expect_ns1_ns2="$3" + local expect_ns1_ns3="$4" + local expect_nsrouter_ns2="$5" + local expect_nsrouter_ns3="$6" + + # derived variables + local testname="test_${ip_proto}_tcp_${traffic_origin}" + local socat_ipproto + local ns1_ip + local ns2_ip + local ns3_ip + local ns2_target + local ns3_target + local nftables_subject + local ip_command + + # socat 1.8.0 has a bug that requires to specify the IP family to bind (fixed in 1.8.0.1) + case $ip_proto in + "ip") + socat_ipproto="-4" + ns1_ip=10.0.1.99 + ns2_ip=10.0.2.99 + ns3_ip=10.0.3.99 + ns2_target="tcp:$ns2_ip:8080" + ns3_target="tcp:$ns3_ip:8080" + nftables_subject="ip daddr $ns2_ip tcp dport 8080" + ip_command="ip" + ;; + "ip6") + socat_ipproto="-6" + ns1_ip=dead:1::99 + ns2_ip=dead:2::99 + ns3_ip=dead:3::99 + ns2_target="tcp:[$ns2_ip]:8080" + ns3_target="tcp:[$ns3_ip]:8080" + nftables_subject="ip6 daddr $ns2_ip tcp dport 8080" + ip_command="ip -6" + ;; + *) + echo "FAIL: unsupported protocol" + exit 255 + ;; + esac + + case $traffic_origin in + # to capture the local originated traffic we need to mark the outgoing + # traffic so the policy based routing rule redirects it and can be processed + # in the prerouting chain. + "local") + nftables_rules=" +flush ruleset +table inet filter { + chain divert { + type filter hook prerouting priority 0; policy accept; + $nftables_subject tproxy $ip_proto to :12345 meta mark set 1 accept + } + chain output { + type route hook output priority 0; policy accept; + $nftables_subject meta mark set 1 accept + } +}" + ;; + "forward") + nftables_rules=" +flush ruleset +table inet filter { + chain divert { + type filter hook prerouting priority 0; policy accept; + $nftables_subject tproxy $ip_proto to :12345 meta mark set 1 accept + } +}" + ;; + *) + echo "FAIL: unsupported parameter for traffic origin" + exit 255 + ;; + esac + + # shellcheck disable=SC2046 # Intended splitting of ip_command + ip netns exec "$nsrouter" $ip_command rule add fwmark 1 table 100 + ip netns exec "$nsrouter" $ip_command route add local "${ns2_ip}" dev lo table 100 + echo "$nftables_rules" | ip netns exec "$nsrouter" nft -f /dev/stdin + + timeout "$timeout" ip netns exec "$nsrouter" socat "$socat_ipproto" tcp-listen:12345,fork,ip-transparent SYSTEM:"cat" 2>/dev/null & + local tproxy_pid=$! + + timeout "$timeout" ip netns exec "$ns2" socat "$socat_ipproto" tcp-listen:8080,fork SYSTEM:"echo PONG_NS2" 2>/dev/null & + local server2_pid=$! + + timeout "$timeout" ip netns exec "$ns3" socat "$socat_ipproto" tcp-listen:8080,fork SYSTEM:"echo PONG_NS3" 2>/dev/null & + local server3_pid=$! + + busywait "$BUSYWAIT_TIMEOUT" listener_ready "$nsrouter" 12345 "-t" + busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns2" 8080 "-t" + busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns3" 8080 "-t" + + local result + # request from ns1 to ns2 (forwarded traffic) + result=$(echo I_M_PROXIED | ip netns exec "$ns1" socat -t 2 -T 2 STDIO "$ns2_target") + if [ "$result" == "$expect_ns1_ns2" ] ;then + echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns2" + else + echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns2, not \"${expect_ns1_ns2}\" as intended" + ret=1 + fi + + # request from ns1 to ns3(forwarded traffic) + result=$(echo I_M_PROXIED | ip netns exec "$ns1" socat -t 2 -T 2 STDIO "$ns3_target") + if [ "$result" = "$expect_ns1_ns3" ] ;then + echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3" + else + echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3, not \"$expect_ns1_ns3\" as intended" + ret=1 + fi + + # request from nsrouter to ns2 (localy originated traffic) + result=$(echo I_M_PROXIED | ip netns exec "$nsrouter" socat -t 2 -T 2 STDIO "$ns2_target") + if [ "$result" == "$expect_nsrouter_ns2" ] ;then + echo "PASS: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns2" + else + echo "ERROR: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns2, not \"$expect_nsrouter_ns2\" as intended" + ret=1 + fi + + # request from nsrouter to ns3 (localy originated traffic) + result=$(echo I_M_PROXIED | ip netns exec "$nsrouter" socat -t 2 -T 2 STDIO "$ns3_target") + if [ "$result" = "$expect_nsrouter_ns3" ] ;then + echo "PASS: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns3" + else + echo "ERROR: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns3, not \"$expect_nsrouter_ns3\" as intended" + ret=1 + fi + + # cleanup + kill "$tproxy_pid" "$server2_pid" "$server3_pid" 2>/dev/null + # shellcheck disable=SC2046 # Intended splitting of ip_command + ip netns exec "$nsrouter" $ip_command rule del fwmark 1 table 100 + ip netns exec "$nsrouter" $ip_command route flush table 100 +} + + +test_ipv4_tcp_forward() +{ + local traffic_origin="forward" + local ip_proto="ip" + local expect_ns1_ns2="I_M_PROXIED" + local expect_ns1_ns3="PONG_NS3" + local expect_nsrouter_ns2="PONG_NS2" + local expect_nsrouter_ns3="PONG_NS3" + + test_tproxy "$traffic_origin" \ + "$ip_proto" \ + "$expect_ns1_ns2" \ + "$expect_ns1_ns3" \ + "$expect_nsrouter_ns2" \ + "$expect_nsrouter_ns3" +} + +test_ipv4_tcp_local() +{ + local traffic_origin="local" + local ip_proto="ip" + local expect_ns1_ns2="I_M_PROXIED" + local expect_ns1_ns3="PONG_NS3" + local expect_nsrouter_ns2="I_M_PROXIED" + local expect_nsrouter_ns3="PONG_NS3" + + test_tproxy "$traffic_origin" \ + "$ip_proto" \ + "$expect_ns1_ns2" \ + "$expect_ns1_ns3" \ + "$expect_nsrouter_ns2" \ + "$expect_nsrouter_ns3" +} + +test_ipv6_tcp_forward() +{ + local traffic_origin="forward" + local ip_proto="ip6" + local expect_ns1_ns2="I_M_PROXIED" + local expect_ns1_ns3="PONG_NS3" + local expect_nsrouter_ns2="PONG_NS2" + local expect_nsrouter_ns3="PONG_NS3" + + test_tproxy "$traffic_origin" \ + "$ip_proto" \ + "$expect_ns1_ns2" \ + "$expect_ns1_ns3" \ + "$expect_nsrouter_ns2" \ + "$expect_nsrouter_ns3" +} + +test_ipv6_tcp_local() +{ + local traffic_origin="local" + local ip_proto="ip6" + local expect_ns1_ns2="I_M_PROXIED" + local expect_ns1_ns3="PONG_NS3" + local expect_nsrouter_ns2="I_M_PROXIED" + local expect_nsrouter_ns3="PONG_NS3" + + test_tproxy "$traffic_origin" \ + "$ip_proto" \ + "$expect_ns1_ns2" \ + "$expect_ns1_ns3" \ + "$expect_nsrouter_ns2" \ + "$expect_nsrouter_ns3" +} + +if test_ping; then + # queue bypass works (rules were skipped, no listener) + echo "PASS: ${ns1} can reach ${ns2}" +else + echo "FAIL: ${ns1} cannot reach ${ns2}: $ret" 1>&2 + exit $ret +fi + +test_ipv4_tcp_forward +test_ipv4_tcp_local +test_ipv6_tcp_forward +test_ipv6_tcp_local + +exit $ret diff --git a/tools/testing/selftests/net/netfilter/nft_tproxy_udp.sh b/tools/testing/selftests/net/netfilter/nft_tproxy_udp.sh new file mode 100755 index 0000000000000..d16de13fe5a75 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/nft_tproxy_udp.sh @@ -0,0 +1,262 @@ +#!/bin/bash +# +# This tests tproxy on the following scenario: +# +# +------------+ +# +-------+ | nsrouter | +-------+ +# |ns1 |.99 .1| |.1 .99| ns2| +# | eth0|---------------|veth0 veth1|------------------|eth0 | +# | | 10.0.1.0/24 | | 10.0.2.0/24 | | +# +-------+ dead:1::/64 | veth2 | dead:2::/64 +-------+ +# +------------+ +# |.1 +# | +# | +# | +-------+ +# | .99| ns3| +# +------------------------|eth0 | +# 10.0.3.0/24 | | +# dead:3::/64 +-------+ +# +# The tproxy implementation acts as an echo server so the client +# must receive the same message it sent if it has been proxied. +# If is not proxied the servers return PONG_NS# with the number +# of the namespace the server is running. +# shellcheck disable=SC2162,SC2317 + +source lib.sh +ret=0 +# UDP is slow +timeout=15 + +cleanup() +{ + ip netns pids "$ns1" | xargs kill 2>/dev/null + ip netns pids "$ns2" | xargs kill 2>/dev/null + ip netns pids "$ns3" | xargs kill 2>/dev/null + ip netns pids "$nsrouter" | xargs kill 2>/dev/null + + cleanup_all_ns +} + +checktool "nft --version" "test without nft tool" +checktool "socat -h" "run test without socat" + +trap cleanup EXIT +setup_ns ns1 ns2 ns3 nsrouter + +if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > /dev/null 2>&1; then + echo "SKIP: No virtual ethernet pair device support in kernel" + exit $ksft_skip +fi +ip link add veth1 netns "$nsrouter" type veth peer name eth0 netns "$ns2" +ip link add veth2 netns "$nsrouter" type veth peer name eth0 netns "$ns3" + +ip -net "$nsrouter" link set veth0 up +ip -net "$nsrouter" addr add 10.0.1.1/24 dev veth0 +ip -net "$nsrouter" addr add dead:1::1/64 dev veth0 nodad + +ip -net "$nsrouter" link set veth1 up +ip -net "$nsrouter" addr add 10.0.2.1/24 dev veth1 +ip -net "$nsrouter" addr add dead:2::1/64 dev veth1 nodad + +ip -net "$nsrouter" link set veth2 up +ip -net "$nsrouter" addr add 10.0.3.1/24 dev veth2 +ip -net "$nsrouter" addr add dead:3::1/64 dev veth2 nodad + +ip -net "$ns1" link set eth0 up +ip -net "$ns2" link set eth0 up +ip -net "$ns3" link set eth0 up + +ip -net "$ns1" addr add 10.0.1.99/24 dev eth0 +ip -net "$ns1" addr add dead:1::99/64 dev eth0 nodad +ip -net "$ns1" route add default via 10.0.1.1 +ip -net "$ns1" route add default via dead:1::1 + +ip -net "$ns2" addr add 10.0.2.99/24 dev eth0 +ip -net "$ns2" addr add dead:2::99/64 dev eth0 nodad +ip -net "$ns2" route add default via 10.0.2.1 +ip -net "$ns2" route add default via dead:2::1 + +ip -net "$ns3" addr add 10.0.3.99/24 dev eth0 +ip -net "$ns3" addr add dead:3::99/64 dev eth0 nodad +ip -net "$ns3" route add default via 10.0.3.1 +ip -net "$ns3" route add default via dead:3::1 + +ip netns exec "$nsrouter" sysctl net.ipv6.conf.all.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth2.forwarding=1 > /dev/null + +test_ping() { + if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.99 > /dev/null; then + return 1 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q dead:2::99 > /dev/null; then + return 2 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q 10.0.3.99 > /dev/null; then + return 1 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q dead:3::99 > /dev/null; then + return 2 + fi + + return 0 +} + +test_ping_router() { + if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.1 > /dev/null; then + return 3 + fi + + if ! ip netns exec "$ns1" ping -c 1 -q dead:2::1 > /dev/null; then + return 4 + fi + + return 0 +} + + +listener_ready() +{ + local ns="$1" + local port="$2" + local proto="$3" + ss -N "$ns" -ln "$proto" -o "sport = :$port" | grep -q "$port" +} + +test_tproxy_udp_forward() +{ + local ip_proto="$1" + + local expect_ns1_ns2="I_M_PROXIED" + local expect_ns1_ns3="PONG_NS3" + local expect_nsrouter_ns2="PONG_NS2" + local expect_nsrouter_ns3="PONG_NS3" + + # derived variables + local testname="test_${ip_proto}_udp_forward" + local socat_ipproto + local ns1_ip + local ns2_ip + local ns3_ip + local ns1_ip_port + local ns2_ip_port + local ns3_ip_port + local ip_command + + # socat 1.8.0 has a bug that requires to specify the IP family to bind (fixed in 1.8.0.1) + case $ip_proto in + "ip") + socat_ipproto="-4" + ns1_ip=10.0.1.99 + ns2_ip=10.0.2.99 + ns3_ip=10.0.3.99 + ns1_ip_port="$ns1_ip:18888" + ns2_ip_port="$ns2_ip:8080" + ns3_ip_port="$ns3_ip:8080" + ip_command="ip" + ;; + "ip6") + socat_ipproto="-6" + ns1_ip=dead:1::99 + ns2_ip=dead:2::99 + ns3_ip=dead:3::99 + ns1_ip_port="[$ns1_ip]:18888" + ns2_ip_port="[$ns2_ip]:8080" + ns3_ip_port="[$ns3_ip]:8080" + ip_command="ip -6" + ;; + *) + echo "FAIL: unsupported protocol" + exit 255 + ;; + esac + + # shellcheck disable=SC2046 # Intended splitting of ip_command + ip netns exec "$nsrouter" $ip_command rule add fwmark 1 table 100 + ip netns exec "$nsrouter" $ip_command route add local "$ns2_ip" dev lo table 100 + ip netns exec "$nsrouter" nft -f /dev/stdin </dev/null & + local tproxy_pid=$! + + timeout "$timeout" ip netns exec "$ns2" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS2" 2>/dev/null & + local server2_pid=$! + + timeout "$timeout" ip netns exec "$ns3" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS3" 2>/dev/null & + local server3_pid=$! + + busywait "$BUSYWAIT_TIMEOUT" listener_ready "$nsrouter" 12345 "-u" + busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns2" 8080 "-u" + busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns3" 8080 "-u" + + local result + # request from ns1 to ns2 (forwarded traffic) + result=$(echo I_M_PROXIED | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns2_ip_port",sourceport=18888) + if [ "$result" == "$expect_ns1_ns2" ] ;then + echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns2" + else + echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns2, not \"${expect_ns1_ns2}\" as intended" + ret=1 + fi + + # request from ns1 to ns3 (forwarded traffic) + result=$(echo I_M_PROXIED | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns3_ip_port") + if [ "$result" = "$expect_ns1_ns3" ] ;then + echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3" + else + echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3, not \"$expect_ns1_ns3\" as intended" + ret=1 + fi + + # request from nsrouter to ns2 (localy originated traffic) + result=$(echo I_M_PROXIED | ip netns exec "$nsrouter" socat -t 2 -T 2 STDIO udp:"$ns2_ip_port") + if [ "$result" == "$expect_nsrouter_ns2" ] ;then + echo "PASS: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns2" + else + echo "ERROR: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns2, not \"$expect_nsrouter_ns2\" as intended" + ret=1 + fi + + # request from nsrouter to ns3 (localy originated traffic) + result=$(echo I_M_PROXIED | ip netns exec "$nsrouter" socat -t 2 -T 2 STDIO udp:"$ns3_ip_port") + if [ "$result" = "$expect_nsrouter_ns3" ] ;then + echo "PASS: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns3" + else + echo "ERROR: tproxy test $testname: nsrouter got reply \"$result\" connecting to ns3, not \"$expect_nsrouter_ns3\" as intended" + ret=1 + fi + + # cleanup + kill "$tproxy_pid" "$server2_pid" "$server3_pid" 2>/dev/null + # shellcheck disable=SC2046 # Intended splitting of ip_command + ip netns exec "$nsrouter" $ip_command rule del fwmark 1 table 100 + ip netns exec "$nsrouter" $ip_command route flush table 100 +} + + +if test_ping; then + # queue bypass works (rules were skipped, no listener) + echo "PASS: ${ns1} can reach ${ns2}" +else + echo "FAIL: ${ns1} cannot reach ${ns2}: $ret" 1>&2 + exit $ret +fi + +test_tproxy_udp_forward "ip" +test_tproxy_udp_forward "ip6" + +exit $ret From 2cadd3b17738a9160597e17b267876d6a10b7be5 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Tue, 10 Sep 2024 11:35:33 +0300 Subject: [PATCH 26/35] netfilter: ctnetlink: Guard possible unused functions Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: net/netfilter/nf_conntrack_netlink.c:657:22: error: unused function 'ctnetlink_acct_size' [-Werror,-Wunused-function] 657 | static inline size_t ctnetlink_acct_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:667:19: error: unused function 'ctnetlink_secctx_size' [-Werror,-Wunused-function] 667 | static inline int ctnetlink_secctx_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:683:22: error: unused function 'ctnetlink_timestamp_size' [-Werror,-Wunused-function] 683 | static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~~~~~ Fix this by guarding possible unused functions with ifdeffery. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko Reviewed-by: Simon Horman Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_netlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 123e2e933e9b2..8fd2b9e392a7e 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -652,7 +652,6 @@ static size_t ctnetlink_proto_size(const struct nf_conn *ct) return len + len4; } -#endif static inline size_t ctnetlink_acct_size(const struct nf_conn *ct) { @@ -690,6 +689,7 @@ static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct) return 0; #endif } +#endif #ifdef CONFIG_NF_CONNTRACK_EVENTS static size_t ctnetlink_nlmsg_size(const struct nf_conn *ct) From aa758763be6ddcc1c500c6e4e8a15d604e8eadba Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E8=87=B4=E9=82=A6=20=28XIE=20Zhibang=29?= Date: Thu, 12 Sep 2024 11:59:33 +0000 Subject: [PATCH 27/35] docs: tproxy: ignore non-transparent sockets in iptables MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The iptables example was added in commit d2f26037a38a (netfilter: Add documentation for tproxy, 2008-10-08), but xt_socket 'transparent' option was added in commit a31e1ffd2231 (netfilter: xt_socket: added new revision of the 'socket' match supporting flags, 2009-06-09). Now add the 'transparent' option to the iptables example to ignore non-transparent sockets, which is also consistent with the nft example. Signed-off-by: 谢致邦 (XIE Zhibang) Signed-off-by: Pablo Neira Ayuso --- Documentation/networking/tproxy.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst index 00dc3a1a66b48..7f7c1ff6f159e 100644 --- a/Documentation/networking/tproxy.rst +++ b/Documentation/networking/tproxy.rst @@ -17,7 +17,7 @@ The idea is that you identify packets with destination address matching a local socket on your box, set the packet mark to a certain value:: # iptables -t mangle -N DIVERT - # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT + # iptables -t mangle -A PREROUTING -p tcp -m socket --transparent -j DIVERT # iptables -t mangle -A DIVERT -j MARK --set-mark 1 # iptables -t mangle -A DIVERT -j ACCEPT From 642c89c475419b4d0c0d90e29d9c1a0e4351f379 Mon Sep 17 00:00:00 2001 From: Phil Sutter Date: Thu, 12 Sep 2024 14:21:33 +0200 Subject: [PATCH 28/35] netfilter: nf_tables: Keep deleted flowtable hooks until after RCU Documentation of list_del_rcu() warns callers to not immediately free the deleted list item. While it seems not necessary to use the RCU-variant of list_del() here in the first place, doing so seems to require calling kfree_rcu() on the deleted item as well. Fixes: 3f0465a9ef02 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables") Signed-off-by: Phil Sutter Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 57259b5f3ef58..042080aeb46c1 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -9207,7 +9207,7 @@ static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable) flowtable->data.type->setup(&flowtable->data, hook->ops.dev, FLOW_BLOCK_UNBIND); list_del_rcu(&hook->list); - kfree(hook); + kfree_rcu(hook, rcu); } kfree(flowtable->name); module_put(flowtable->data.type->owner); From fc56878ca1c288e49b5cbb43860a5938e3463654 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Mon, 16 Sep 2024 10:50:34 +0100 Subject: [PATCH 29/35] netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64 defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1 using gcc-14 results in the following warnings, which are treated as errors: net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset': net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable] 243 | struct iphdr *niph; | ^~~~ cc1: all warnings being treated as errors net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6': net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable] 286 | struct ipv6hdr *ip6h; | ^~~~ cc1: all warnings being treated as errors Address this by reducing the scope of these local variables to where they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER enabled. Compile tested and run through netfilter selftests. Reported-by: Andy Shevchenko Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/ Signed-off-by: Simon Horman Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/nf_reject_ipv4.c | 10 ++++------ net/ipv6/netfilter/nf_reject_ipv6.c | 5 ++--- 2 files changed, 6 insertions(+), 9 deletions(-) diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c index 04504b2b51df5..87fd945a0d27a 100644 --- a/net/ipv4/netfilter/nf_reject_ipv4.c +++ b/net/ipv4/netfilter/nf_reject_ipv4.c @@ -239,9 +239,8 @@ static int nf_reject_fill_skb_dst(struct sk_buff *skb_in) void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb, int hook) { - struct sk_buff *nskb; - struct iphdr *niph; const struct tcphdr *oth; + struct sk_buff *nskb; struct tcphdr _oth; oth = nf_reject_ip_tcphdr_get(oldskb, &_oth, hook); @@ -266,14 +265,12 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb, nskb->mark = IP4_REPLY_MARK(net, oldskb->mark); skb_reserve(nskb, LL_MAX_HEADER); - niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP, - ip4_dst_hoplimit(skb_dst(nskb))); + nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP, + ip4_dst_hoplimit(skb_dst(nskb))); nf_reject_ip_tcphdr_put(nskb, oldskb, oth); if (ip_route_me_harder(net, sk, nskb, RTN_UNSPEC)) goto free_nskb; - niph = ip_hdr(nskb); - /* "Never happens" */ if (nskb->len > dst_mtu(skb_dst(nskb))) goto free_nskb; @@ -290,6 +287,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb, */ if (nf_bridge_info_exists(oldskb)) { struct ethhdr *oeth = eth_hdr(oldskb); + struct iphdr *niph = ip_hdr(nskb); struct net_device *br_indev; br_indev = nf_bridge_get_physindev(oldskb, net); diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c index dedee264b8f6c..69a78550261fb 100644 --- a/net/ipv6/netfilter/nf_reject_ipv6.c +++ b/net/ipv6/netfilter/nf_reject_ipv6.c @@ -283,7 +283,6 @@ void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, const struct tcphdr *otcph; unsigned int otcplen, hh_len; const struct ipv6hdr *oip6h = ipv6_hdr(oldskb); - struct ipv6hdr *ip6h; struct dst_entry *dst = NULL; struct flowi6 fl6; @@ -339,8 +338,7 @@ void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, nskb->mark = fl6.flowi6_mark; skb_reserve(nskb, hh_len + dst->header_len); - ip6h = nf_reject_ip6hdr_put(nskb, oldskb, IPPROTO_TCP, - ip6_dst_hoplimit(dst)); + nf_reject_ip6hdr_put(nskb, oldskb, IPPROTO_TCP, ip6_dst_hoplimit(dst)); nf_reject_ip6_tcphdr_put(nskb, oldskb, otcph, otcplen); nf_ct_attach(nskb, oldskb); @@ -355,6 +353,7 @@ void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, */ if (nf_bridge_info_exists(oldskb)) { struct ethhdr *oeth = eth_hdr(oldskb); + struct ipv6hdr *ip6h = ipv6_hdr(nskb); struct net_device *br_indev; br_indev = nf_bridge_get_physindev(oldskb, net); From e1f1ee0e9ad8cbe660f5c104e791c5f1a7cf4c31 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Mon, 16 Sep 2024 16:14:41 +0100 Subject: [PATCH 30/35] netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS Only provide ctnetlink_label_size when it is used, which is when CONFIG_NF_CONNTRACK_EVENTS is configured. Flagged by clang-18 W=1 builds as: .../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function] 385 | static inline int ctnetlink_label_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~ The condition on CONFIG_NF_CONNTRACK_LABELS being removed by this patch guards compilation of non-trivial implementations of ctnetlink_dump_labels() and ctnetlink_label_size(). However, this is not necessary as each of these functions will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined as each function starts with the equivalent of: struct nf_conn_labels *labels = nf_ct_labels_find(ct); if (!labels) return 0; And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS is not enabled. So I believe that the compiler optimises the code away in such cases anyway. Found by inspection. Compile tested only. Originally splitted in two patches, Pablo Neira Ayuso collapsed them and added Fixes: tag. Fixes: 0ceabd83875b ("netfilter: ctnetlink: deliver labels to userspace") Link: https://lore.kernel.org/netfilter-devel/20240909151712.GZ2097826@kernel.org/ Signed-off-by: Simon Horman Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_netlink.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 8fd2b9e392a7e..6a1239433830f 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -382,7 +382,7 @@ static int ctnetlink_dump_secctx(struct sk_buff *skb, const struct nf_conn *ct) #define ctnetlink_dump_secctx(a, b) (0) #endif -#ifdef CONFIG_NF_CONNTRACK_LABELS +#ifdef CONFIG_NF_CONNTRACK_EVENTS static inline int ctnetlink_label_size(const struct nf_conn *ct) { struct nf_conn_labels *labels = nf_ct_labels_find(ct); @@ -391,6 +391,7 @@ static inline int ctnetlink_label_size(const struct nf_conn *ct) return 0; return nla_total_size(sizeof(labels->bits)); } +#endif static int ctnetlink_dump_labels(struct sk_buff *skb, const struct nf_conn *ct) @@ -411,10 +412,6 @@ ctnetlink_dump_labels(struct sk_buff *skb, const struct nf_conn *ct) return 0; } -#else -#define ctnetlink_dump_labels(a, b) (0) -#define ctnetlink_label_size(a) (0) -#endif #define master_tuple(ct) &(ct->master->tuplehash[IP_CT_DIR_ORIGINAL].tuple) From 4ffcf5ca81c3b83180473eb0d3c010a1a7c6c4de Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 17 Sep 2024 23:07:46 +0200 Subject: [PATCH 31/35] netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path Lockless iteration over hook list is possible from netlink dump path, use rcu variant to iterate over the hook list as is done with flowtable hooks. Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain") Reported-by: Phil Sutter Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 042080aeb46c1..8f073e6c772a5 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -1849,7 +1849,7 @@ static int nft_dump_basechain_hook(struct sk_buff *skb, int family, if (!hook_list) hook_list = &basechain->hook_list; - list_for_each_entry(hook, hook_list, list) { + list_for_each_entry_rcu(hook, hook_list, list) { if (!first) first = hook; From 69e687cea79fc99a17dfb0116c8644b9391b915e Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Wed, 18 Sep 2024 14:19:45 +0200 Subject: [PATCH 32/35] netfilter: nf_tables: missing objects with no memcg accounting Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for memory accounting, update them. This includes: - catchall elements - compat match large info area - log prefix - meta secctx - numgen counters - pipapo set backend datastructure - tunnel private objects Fixes: 33758c891479 ("memcg: enable accounting for nft objects") Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 2 +- net/netfilter/nft_compat.c | 6 +++--- net/netfilter/nft_log.c | 2 +- net/netfilter/nft_meta.c | 2 +- net/netfilter/nft_numgen.c | 2 +- net/netfilter/nft_set_pipapo.c | 13 +++++++------ net/netfilter/nft_tunnel.c | 5 +++-- 7 files changed, 17 insertions(+), 15 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 8f073e6c772a5..a24fe62650a75 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -6684,7 +6684,7 @@ static int nft_setelem_catchall_insert(const struct net *net, } } - catchall = kmalloc(sizeof(*catchall), GFP_KERNEL); + catchall = kmalloc(sizeof(*catchall), GFP_KERNEL_ACCOUNT); if (!catchall) return -ENOMEM; diff --git a/net/netfilter/nft_compat.c b/net/netfilter/nft_compat.c index 52cdfee17f73f..7ca4f0d21fe2a 100644 --- a/net/netfilter/nft_compat.c +++ b/net/netfilter/nft_compat.c @@ -535,7 +535,7 @@ nft_match_large_init(const struct nft_ctx *ctx, const struct nft_expr *expr, struct xt_match *m = expr->ops->data; int ret; - priv->info = kmalloc(XT_ALIGN(m->matchsize), GFP_KERNEL); + priv->info = kmalloc(XT_ALIGN(m->matchsize), GFP_KERNEL_ACCOUNT); if (!priv->info) return -ENOMEM; @@ -808,7 +808,7 @@ nft_match_select_ops(const struct nft_ctx *ctx, goto err; } - ops = kzalloc(sizeof(struct nft_expr_ops), GFP_KERNEL); + ops = kzalloc(sizeof(struct nft_expr_ops), GFP_KERNEL_ACCOUNT); if (!ops) { err = -ENOMEM; goto err; @@ -898,7 +898,7 @@ nft_target_select_ops(const struct nft_ctx *ctx, goto err; } - ops = kzalloc(sizeof(struct nft_expr_ops), GFP_KERNEL); + ops = kzalloc(sizeof(struct nft_expr_ops), GFP_KERNEL_ACCOUNT); if (!ops) { err = -ENOMEM; goto err; diff --git a/net/netfilter/nft_log.c b/net/netfilter/nft_log.c index 5defe6e4fd982..e355881379957 100644 --- a/net/netfilter/nft_log.c +++ b/net/netfilter/nft_log.c @@ -163,7 +163,7 @@ static int nft_log_init(const struct nft_ctx *ctx, nla = tb[NFTA_LOG_PREFIX]; if (nla != NULL) { - priv->prefix = kmalloc(nla_len(nla) + 1, GFP_KERNEL); + priv->prefix = kmalloc(nla_len(nla) + 1, GFP_KERNEL_ACCOUNT); if (priv->prefix == NULL) return -ENOMEM; nla_strscpy(priv->prefix, nla, nla_len(nla) + 1); diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c index 8c8eb14d647b0..05cd1e6e6a2f6 100644 --- a/net/netfilter/nft_meta.c +++ b/net/netfilter/nft_meta.c @@ -952,7 +952,7 @@ static int nft_secmark_obj_init(const struct nft_ctx *ctx, if (tb[NFTA_SECMARK_CTX] == NULL) return -EINVAL; - priv->ctx = nla_strdup(tb[NFTA_SECMARK_CTX], GFP_KERNEL); + priv->ctx = nla_strdup(tb[NFTA_SECMARK_CTX], GFP_KERNEL_ACCOUNT); if (!priv->ctx) return -ENOMEM; diff --git a/net/netfilter/nft_numgen.c b/net/netfilter/nft_numgen.c index 7d29db7c2ac0f..bd058babfc820 100644 --- a/net/netfilter/nft_numgen.c +++ b/net/netfilter/nft_numgen.c @@ -66,7 +66,7 @@ static int nft_ng_inc_init(const struct nft_ctx *ctx, if (priv->offset + priv->modulus - 1 < priv->offset) return -EOVERFLOW; - priv->counter = kmalloc(sizeof(*priv->counter), GFP_KERNEL); + priv->counter = kmalloc(sizeof(*priv->counter), GFP_KERNEL_ACCOUNT); if (!priv->counter) return -ENOMEM; diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c index eb4c4a4ac7ace..7be342b495f5f 100644 --- a/net/netfilter/nft_set_pipapo.c +++ b/net/netfilter/nft_set_pipapo.c @@ -663,7 +663,7 @@ static int pipapo_realloc_mt(struct nft_pipapo_field *f, check_add_overflow(rules, extra, &rules_alloc)) return -EOVERFLOW; - new_mt = kvmalloc_array(rules_alloc, sizeof(*new_mt), GFP_KERNEL); + new_mt = kvmalloc_array(rules_alloc, sizeof(*new_mt), GFP_KERNEL_ACCOUNT); if (!new_mt) return -ENOMEM; @@ -936,7 +936,7 @@ static void pipapo_lt_bits_adjust(struct nft_pipapo_field *f) return; } - new_lt = kvzalloc(lt_size + NFT_PIPAPO_ALIGN_HEADROOM, GFP_KERNEL); + new_lt = kvzalloc(lt_size + NFT_PIPAPO_ALIGN_HEADROOM, GFP_KERNEL_ACCOUNT); if (!new_lt) return; @@ -1212,7 +1212,7 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone, scratch = kzalloc_node(struct_size(scratch, map, bsize_max * 2) + NFT_PIPAPO_ALIGN_HEADROOM, - GFP_KERNEL, cpu_to_node(i)); + GFP_KERNEL_ACCOUNT, cpu_to_node(i)); if (!scratch) { /* On failure, there's no need to undo previous * allocations: this means that some scratch maps have @@ -1427,7 +1427,7 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old) struct nft_pipapo_match *new; int i; - new = kmalloc(struct_size(new, f, old->field_count), GFP_KERNEL); + new = kmalloc(struct_size(new, f, old->field_count), GFP_KERNEL_ACCOUNT); if (!new) return NULL; @@ -1457,7 +1457,7 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old) new_lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS(src->bb) * src->bsize * sizeof(*dst->lt) + NFT_PIPAPO_ALIGN_HEADROOM, - GFP_KERNEL); + GFP_KERNEL_ACCOUNT); if (!new_lt) goto out_lt; @@ -1470,7 +1470,8 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old) if (src->rules > 0) { dst->mt = kvmalloc_array(src->rules_alloc, - sizeof(*src->mt), GFP_KERNEL); + sizeof(*src->mt), + GFP_KERNEL_ACCOUNT); if (!dst->mt) goto out_mt; diff --git a/net/netfilter/nft_tunnel.c b/net/netfilter/nft_tunnel.c index 60a76e6e348e7..5c6ed68cc6e05 100644 --- a/net/netfilter/nft_tunnel.c +++ b/net/netfilter/nft_tunnel.c @@ -509,13 +509,14 @@ static int nft_tunnel_obj_init(const struct nft_ctx *ctx, return err; } - md = metadata_dst_alloc(priv->opts.len, METADATA_IP_TUNNEL, GFP_KERNEL); + md = metadata_dst_alloc(priv->opts.len, METADATA_IP_TUNNEL, + GFP_KERNEL_ACCOUNT); if (!md) return -ENOMEM; memcpy(&md->u.tun_info, &info, sizeof(info)); #ifdef CONFIG_DST_CACHE - err = dst_cache_init(&md->u.tun_info.dst_cache, GFP_KERNEL); + err = dst_cache_init(&md->u.tun_info.dst_cache, GFP_KERNEL_ACCOUNT); if (err < 0) { metadata_dst_free(md); return err; From 8af79d3edb5fd2dce35ea0a71595b6d4f9962350 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Wed, 18 Sep 2024 15:13:39 +0200 Subject: [PATCH 33/35] netfilter: nfnetlink_queue: remove old clash resolution logic For historical reasons there are two clash resolution spots in netfilter, one in nfnetlink_queue and one in conntrack core. nfnetlink_queue one was added first: If a colliding entry is found, NAT NAT transformation is reversed by calling nat engine again with altered tuple. See commit 368982cd7d1b ("netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks") for details. One problem is that nf_reroute() won't take an action if the queueing doesn't occur in the OUTPUT hook, i.e. when queueing in forward or postrouting, packet will be sent via the wrong path. Another problem is that the scenario addressed (2nd UDP packet sent with identical addresses while first packet is still being processed) can also occur without any nfqueue involvement due to threaded resolvers doing A and AAAA requests back-to-back. This lead us to add clash resolution logic to the conntrack core, see commit 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries"). Instead of fixing the nfqueue based logic, lets remove it and let conntrack core handle this instead. Retain the ->update hook for sake of nfqueue based conntrack helpers. We could axe this hook completely but we'd have to split confirm and helper logic again, see commit ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again"). This SHOULD NOT be backported to kernels earlier than v5.6; they lack adequate clash resolution handling. Patch was originally written by Pablo Neira Ayuso. Reported-by: Antonio Ojea Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1766 Signed-off-by: Florian Westphal Tested-by: Antonio Ojea Signed-off-by: Pablo Neira Ayuso --- include/linux/netfilter.h | 4 -- net/netfilter/nf_conntrack_core.c | 85 ------------------------------- net/netfilter/nf_nat_core.c | 1 - 3 files changed, 90 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 2683b2b77612a..2b8aac2c70ada 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -376,15 +376,11 @@ int nf_route(struct net *net, struct dst_entry **dst, struct flowi *fl, struct nf_conn; enum nf_nat_manip_type; struct nlattr; -enum ip_conntrack_dir; struct nf_nat_hook { int (*parse_nat_setup)(struct nf_conn *ct, enum nf_nat_manip_type manip, const struct nlattr *attr); void (*decode_session)(struct sk_buff *skb, struct flowi *fl); - unsigned int (*manip_pkt)(struct sk_buff *skb, struct nf_conn *ct, - enum nf_nat_manip_type mtype, - enum ip_conntrack_dir dir); void (*remove_nat_bysrc)(struct nf_conn *ct); }; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 7c63fbfb8c1dd..9db3e2b0b1c34 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2197,80 +2197,6 @@ static void nf_conntrack_attach(struct sk_buff *nskb, const struct sk_buff *skb) nf_conntrack_get(skb_nfct(nskb)); } -static int __nf_conntrack_update(struct net *net, struct sk_buff *skb, - struct nf_conn *ct, - enum ip_conntrack_info ctinfo) -{ - const struct nf_nat_hook *nat_hook; - struct nf_conntrack_tuple_hash *h; - struct nf_conntrack_tuple tuple; - unsigned int status; - int dataoff; - u16 l3num; - u8 l4num; - - l3num = nf_ct_l3num(ct); - - dataoff = get_l4proto(skb, skb_network_offset(skb), l3num, &l4num); - if (dataoff <= 0) - return NF_DROP; - - if (!nf_ct_get_tuple(skb, skb_network_offset(skb), dataoff, l3num, - l4num, net, &tuple)) - return NF_DROP; - - if (ct->status & IPS_SRC_NAT) { - memcpy(tuple.src.u3.all, - ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.all, - sizeof(tuple.src.u3.all)); - tuple.src.u.all = - ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u.all; - } - - if (ct->status & IPS_DST_NAT) { - memcpy(tuple.dst.u3.all, - ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u3.all, - sizeof(tuple.dst.u3.all)); - tuple.dst.u.all = - ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u.all; - } - - h = nf_conntrack_find_get(net, nf_ct_zone(ct), &tuple); - if (!h) - return NF_ACCEPT; - - /* Store status bits of the conntrack that is clashing to re-do NAT - * mangling according to what it has been done already to this packet. - */ - status = ct->status; - - nf_ct_put(ct); - ct = nf_ct_tuplehash_to_ctrack(h); - nf_ct_set(skb, ct, ctinfo); - - nat_hook = rcu_dereference(nf_nat_hook); - if (!nat_hook) - return NF_ACCEPT; - - if (status & IPS_SRC_NAT) { - unsigned int verdict = nat_hook->manip_pkt(skb, ct, - NF_NAT_MANIP_SRC, - IP_CT_DIR_ORIGINAL); - if (verdict != NF_ACCEPT) - return verdict; - } - - if (status & IPS_DST_NAT) { - unsigned int verdict = nat_hook->manip_pkt(skb, ct, - NF_NAT_MANIP_DST, - IP_CT_DIR_ORIGINAL); - if (verdict != NF_ACCEPT) - return verdict; - } - - return NF_ACCEPT; -} - /* This packet is coming from userspace via nf_queue, complete the packet * processing after the helper invocation in nf_confirm(). */ @@ -2334,17 +2260,6 @@ static int nf_conntrack_update(struct net *net, struct sk_buff *skb) if (!ct) return NF_ACCEPT; - if (!nf_ct_is_confirmed(ct)) { - int ret = __nf_conntrack_update(net, skb, ct, ctinfo); - - if (ret != NF_ACCEPT) - return ret; - - ct = nf_ct_get(skb, &ctinfo); - if (!ct) - return NF_ACCEPT; - } - return nf_confirm_cthelper(skb, ct, ctinfo); } diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index d9ea2c26f3098..4085c436e3062 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -1324,7 +1324,6 @@ static const struct nf_nat_hook nat_hook = { #ifdef CONFIG_XFRM .decode_session = __nf_nat_decode_session, #endif - .manip_pkt = nf_nat_manip_pkt, .remove_nat_bysrc = nf_nat_cleanup_conntrack, }; From e306e3739d9a35c89176281f9ff6c600fcc859a4 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Wed, 18 Sep 2024 15:16:33 +0200 Subject: [PATCH 34/35] kselftest: add test for nfqueue induced conntrack race The netfilter race happens when two packets with the same tuple are DNATed and enqueued with nfqueue in the postrouting hook. Once one of the packet is reinjected it may be DNATed again to a different destination, but the conntrack entry remains the same and the return packet was dropped. Based on earlier patch from Antonio Ojea. Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1766 Co-developed-by: Antonio Ojea Signed-off-by: Antonio Ojea Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- .../selftests/net/netfilter/nft_queue.sh | 92 ++++++++++++++++++- 1 file changed, 91 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/netfilter/nft_queue.sh b/tools/testing/selftests/net/netfilter/nft_queue.sh index d66e3c4dfec62..a9d109fcc15c2 100755 --- a/tools/testing/selftests/net/netfilter/nft_queue.sh +++ b/tools/testing/selftests/net/netfilter/nft_queue.sh @@ -31,7 +31,7 @@ modprobe -q sctp trap cleanup EXIT -setup_ns ns1 ns2 nsrouter +setup_ns ns1 ns2 ns3 nsrouter TMPFILE0=$(mktemp) TMPFILE1=$(mktemp) @@ -48,6 +48,7 @@ if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > exit $ksft_skip fi ip link add veth1 netns "$nsrouter" type veth peer name eth0 netns "$ns2" +ip link add veth2 netns "$nsrouter" type veth peer name eth0 netns "$ns3" ip -net "$nsrouter" link set veth0 up ip -net "$nsrouter" addr add 10.0.1.1/24 dev veth0 @@ -57,8 +58,13 @@ ip -net "$nsrouter" link set veth1 up ip -net "$nsrouter" addr add 10.0.2.1/24 dev veth1 ip -net "$nsrouter" addr add dead:2::1/64 dev veth1 nodad +ip -net "$nsrouter" link set veth2 up +ip -net "$nsrouter" addr add 10.0.3.1/24 dev veth2 +ip -net "$nsrouter" addr add dead:3::1/64 dev veth2 nodad + ip -net "$ns1" link set eth0 up ip -net "$ns2" link set eth0 up +ip -net "$ns3" link set eth0 up ip -net "$ns1" addr add 10.0.1.99/24 dev eth0 ip -net "$ns1" addr add dead:1::99/64 dev eth0 nodad @@ -70,6 +76,11 @@ ip -net "$ns2" addr add dead:2::99/64 dev eth0 nodad ip -net "$ns2" route add default via 10.0.2.1 ip -net "$ns2" route add default via dead:2::1 +ip -net "$ns3" addr add 10.0.3.99/24 dev eth0 +ip -net "$ns3" addr add dead:3::99/64 dev eth0 nodad +ip -net "$ns3" route add default via 10.0.3.1 +ip -net "$ns3" route add default via dead:3::1 + load_ruleset() { local name=$1 local prio=$2 @@ -473,6 +484,83 @@ EOF check_output_files "$TMPINPUT" "$TMPFILE1" "sctp output" } +udp_listener_ready() +{ + ss -S -N "$1" -uln -o "sport = :12345" | grep -q 12345 +} + +output_files_written() +{ + test -s "$1" && test -s "$2" +} + +test_udp_ct_race() +{ + ip netns exec "$nsrouter" nft -f /dev/stdin < "$TMPFILE1" + :> "$TMPFILE2" + + timeout 10 ip netns exec "$ns2" socat UDP-LISTEN:12345,fork OPEN:"$TMPFILE1",trunc & + local rpid1=$! + + timeout 10 ip netns exec "$ns3" socat UDP-LISTEN:12345,fork OPEN:"$TMPFILE2",trunc & + local rpid2=$! + + ip netns exec "$nsrouter" ./nf_queue -q 12 -d 1000 & + local nfqpid=$! + + busywait "$BUSYWAIT_TIMEOUT" udp_listener_ready "$ns2" + busywait "$BUSYWAIT_TIMEOUT" udp_listener_ready "$ns3" + busywait "$BUSYWAIT_TIMEOUT" nf_queue_wait "$nsrouter" 12 + + # Send two packets, one should end up in ns1, other in ns2. + # This is because nfqueue will delay packet for long enough so that + # second packet will not find existing conntrack entry. + echo "Packet 1" | ip netns exec "$ns1" socat STDIN UDP-DATAGRAM:10.6.6.6:12345,bind=0.0.0.0:55221 + echo "Packet 2" | ip netns exec "$ns1" socat STDIN UDP-DATAGRAM:10.6.6.6:12345,bind=0.0.0.0:55221 + + busywait 10000 output_files_written "$TMPFILE1" "$TMPFILE2" + + kill "$nfqpid" + + if ! ip netns exec "$nsrouter" bash -c 'conntrack -L -p udp --dport 12345 2>/dev/null | wc -l | grep -q "^1"'; then + echo "FAIL: Expected One udp conntrack entry" + ip netns exec "$nsrouter" conntrack -L -p udp --dport 12345 + ret=1 + fi + + if ! ip netns exec "$nsrouter" nft delete table inet udpq; then + echo "FAIL: Could not delete udpq table" + ret=1 + return + fi + + NUMLINES1=$(wc -l < "$TMPFILE1") + NUMLINES2=$(wc -l < "$TMPFILE2") + + if [ "$NUMLINES1" -ne 1 ] || [ "$NUMLINES2" -ne 1 ]; then + ret=1 + echo "FAIL: uneven udp packet distribution: $NUMLINES1 $NUMLINES2" + echo -n "$TMPFILE1: ";cat "$TMPFILE1" + echo -n "$TMPFILE2: ";cat "$TMPFILE2" + return + fi + + echo "PASS: both udp receivers got one packet each" +} + test_queue_removal() { read tainted_then < /proc/sys/kernel/tainted @@ -512,6 +600,7 @@ EOF ip netns exec "$nsrouter" sysctl net.ipv6.conf.all.forwarding=1 > /dev/null ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null +ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth2.forwarding=1 > /dev/null load_ruleset "filter" 0 @@ -549,6 +638,7 @@ test_tcp_localhost_connectclose test_tcp_localhost_requeue test_sctp_forward test_sctp_output +test_udp_ct_race # should be last, adds vrf device in ns1 and changes routes test_icmp_vrf From fc786304ad9803e8bb86b8599bc64d1c1746c75f Mon Sep 17 00:00:00 2001 From: Phil Sutter Date: Thu, 19 Sep 2024 14:40:00 +0200 Subject: [PATCH 35/35] selftests: netfilter: Avoid hanging ipvs.sh If the client can't reach the server, the latter remains listening forever. Kill it after 5s of waiting. Fixes: 867d2190799a ("selftests: netfilter: add ipvs test script") Signed-off-by: Phil Sutter Signed-off-by: Pablo Neira Ayuso --- tools/testing/selftests/net/netfilter/ipvs.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/netfilter/ipvs.sh b/tools/testing/selftests/net/netfilter/ipvs.sh index 4ceee9fb39495..d3edb16cd4b3f 100755 --- a/tools/testing/selftests/net/netfilter/ipvs.sh +++ b/tools/testing/selftests/net/netfilter/ipvs.sh @@ -97,7 +97,7 @@ cleanup() { } server_listen() { - ip netns exec "$ns2" socat -u -4 TCP-LISTEN:8080,reuseaddr STDOUT > "${outfile}" & + ip netns exec "$ns2" timeout 5 socat -u -4 TCP-LISTEN:8080,reuseaddr STDOUT > "${outfile}" & server_pid=$! sleep 0.2 }