Skip to content

Commit

Permalink
Merge branch 'hv_netvsc-fix-error-nvsp_rndis_pkt_complete-error-statu…
Browse files Browse the repository at this point in the history
…s-2'

Michael Kelley says:

====================
hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2"

Starting with commit dca5161 in the 6.3 kernel, the Linux driver
for Hyper-V synthetic networking (netvsc) occasionally reports
"nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
that Hyper-V has rejected a network packet transmit request from the
guest, and the outgoing network packet is dropped. Higher level
network protocols presumably recover and resend the packet so there is
no functional error, but performance is slightly impacted. Commit
dca5161 is not the cause of the error -- it only added reporting
of an error that was already happening without any notice. The error
has presumably been present since the netvsc driver was originally
introduced into Linux.

This patch set fixes the root cause of the problem, which is that the
netvsc driver in Linux may send an incorrectly formatted VMBus message
to Hyper-V when transmitting the network packet. The incorrect
formatting occurs when the rndis header of the VMBus message crosses a
page boundary due to how the Linux skb head memory is aligned. In such
a case, two PFNs are required to describe the location of the rndis
header, even though they are contiguous in guest physical address
(GPA) space. Hyper-V requires that two PFNs be in a single "GPA range"
data struture, but current netvsc code puts each PFN in its own GPA
range, which Hyper-V rejects as an error in the case of the rndis
header.

The incorrect formatting occurs only for larger packets that netvsc
must transmit via a VMBus "GPA Direct" message. There's no problem
when netvsc transmits a smaller packet by copying it into a pre-
allocated send buffer slot because the pre-allocated slots don't have
page crossing issues.

After commit 14ad6ed in the 6.14 kernel, the error occurs much
more frequently in VMs with 16 or more vCPUs. It may occur every few
seconds, or even more frequently, in a ssh session that outputs a lot
of text. Commit 14ad6ed subtly changes how skb head memory is
allocated, making it much more likely that the rndis header will cross
a page boundary when the vCPU count is 16 or more.  The changes in
commit 14ad6ed are perfectly valid -- they just had the side
effect of making the netvsc bug more prominent.

One fix is to check for adjacent PFNs in vmbus_sendpacket_pagebuffer()
and just combine them into a single GPA range. Such a fix is very
contained. But conceptually it is fixing the problem at the wrong
level. So this patch set takes the broader approach of maintaining
the already known grouping of contiguous PFNs at a higher level in
the netvsc driver code, and propagating that grouping down to the
creation of the VMBus message to send to Hyper-V. Maintaining the
grouping fixes this problem, and has the added benefit of allowing
netvsc_dma_map() to make fewer calls to dma_map_single() to do bounce
buffering in CoCo VMs.

Patch 1 is a preparatory change to allow vmbus_sendpacket_mpb_desc()
to specify multiple GPA ranges. In current code
vmbus_sendpacket_mpb_desc() is used only by the storvsc synthetic SCSI
driver, and it always creates a single GPA range.

Patch 2 updates the netvsc driver to use vmbus_sendpacket_mpb_desc()
instead of vmbus_sendpacket_pagebuffer(). Because the higher levels of
netvsc still don't group contiguous PFNs, this patch is functionally
neutral. The VMBus message to Hyper-V still has many GPA ranges, each
with a single PFN. But it lays the groundwork for the next patch.

Patch 3 changes the higher levels of netvsc to preserve the already
known grouping of contiguous PFNs. When the contiguous groupings are
passed to vmbus_sendpacket_mpb_desc(), GPA ranges containing multiple
PFNs are produced, as expected by Hyper-V. This is point at which the
core problem is fixed.

Patches 4 and 5 remove code that is no longer necessary after the
previous patches.

These changes provide a net reduction of about 65 lines of code, which
is an added benefit.

These changes have been tested in normal VMs, in SEV-SNP and TDX CoCo
VMs, and in Dv6-series VMs where the netvsp implementation is in the
OpenHCL paravisor instead of the Hyper-V host.

These changes are built against kernel version 6.15-rc6.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217503
====================

Link: https://patch.msgid.link/20250513000604.1396-1-mhklinux@outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
  • Loading branch information
Jakub Kicinski committed May 15, 2025
2 parents bf449f3 + 45a442f commit 09db7a4
Show file tree
Hide file tree
Showing 7 changed files with 83 additions and 146 deletions.
65 changes: 3 additions & 62 deletions drivers/hv/channel.c
Original file line number Diff line number Diff line change
Expand Up @@ -1077,68 +1077,10 @@ int vmbus_sendpacket(struct vmbus_channel *channel, void *buffer,
EXPORT_SYMBOL(vmbus_sendpacket);

/*
* vmbus_sendpacket_pagebuffer - Send a range of single-page buffer
* packets using a GPADL Direct packet type. This interface allows you
* to control notifying the host. This will be useful for sending
* batched data. Also the sender can control the send flags
* explicitly.
*/
int vmbus_sendpacket_pagebuffer(struct vmbus_channel *channel,
struct hv_page_buffer pagebuffers[],
u32 pagecount, void *buffer, u32 bufferlen,
u64 requestid)
{
int i;
struct vmbus_channel_packet_page_buffer desc;
u32 descsize;
u32 packetlen;
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;

if (pagecount > MAX_PAGE_BUFFER_COUNT)
return -EINVAL;

/*
* Adjust the size down since vmbus_channel_packet_page_buffer is the
* largest size we support
*/
descsize = sizeof(struct vmbus_channel_packet_page_buffer) -
((MAX_PAGE_BUFFER_COUNT - pagecount) *
sizeof(struct hv_page_buffer));
packetlen = descsize + bufferlen;
packetlen_aligned = ALIGN(packetlen, sizeof(u64));

/* Setup the descriptor */
desc.type = VM_PKT_DATA_USING_GPA_DIRECT;
desc.flags = VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED;
desc.dataoffset8 = descsize >> 3; /* in 8-bytes granularity */
desc.length8 = (u16)(packetlen_aligned >> 3);
desc.transactionid = VMBUS_RQST_ERROR; /* will be updated in hv_ringbuffer_write() */
desc.reserved = 0;
desc.rangecount = pagecount;

for (i = 0; i < pagecount; i++) {
desc.range[i].len = pagebuffers[i].len;
desc.range[i].offset = pagebuffers[i].offset;
desc.range[i].pfn = pagebuffers[i].pfn;
}

bufferlist[0].iov_base = &desc;
bufferlist[0].iov_len = descsize;
bufferlist[1].iov_base = buffer;
bufferlist[1].iov_len = bufferlen;
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);

return hv_ringbuffer_write(channel, bufferlist, 3, requestid, NULL);
}
EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer);

/*
* vmbus_sendpacket_multipagebuffer - Send a multi-page buffer packet
* vmbus_sendpacket_mpb_desc - Send one or more multi-page buffer packets
* using a GPADL Direct packet type.
* The buffer includes the vmbus descriptor.
* The desc argument must include space for the VMBus descriptor. The
* rangecount field must already be set.
*/
int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
struct vmbus_packet_mpb_array *desc,
Expand All @@ -1160,7 +1102,6 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
desc->length8 = (u16)(packetlen_aligned >> 3);
desc->transactionid = VMBUS_RQST_ERROR; /* will be updated in hv_ringbuffer_write() */
desc->reserved = 0;
desc->rangecount = 1;

bufferlist[0].iov_base = desc;
bufferlist[0].iov_len = desc_size;
Expand Down
13 changes: 12 additions & 1 deletion drivers/net/hyperv/hyperv_net.h
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,6 @@ struct hv_netvsc_packet {
u8 cp_partial; /* partial copy into send buffer */

u8 rmsg_size; /* RNDIS header and PPI size */
u8 rmsg_pgcnt; /* page count of RNDIS header and PPI */
u8 page_buf_cnt;

u16 q_idx;
Expand Down Expand Up @@ -893,6 +892,18 @@ struct nvsp_message {
sizeof(struct nvsp_message))
#define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor)

/* Maximum # of contiguous data ranges that can make up a trasmitted packet.
* Typically it's the max SKB fragments plus 2 for the rndis packet and the
* linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may
* need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries
* in a GPA direct packet sent to netvsp over VMBus.
*/
#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT
#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2)
#else
#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT
#endif

/* Estimated requestor size:
* out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size
*/
Expand Down
57 changes: 48 additions & 9 deletions drivers/net/hyperv/netvsc.c
Original file line number Diff line number Diff line change
Expand Up @@ -953,8 +953,7 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
+ pend_size;
int i;
u32 padding = 0;
u32 page_count = packet->cp_partial ? packet->rmsg_pgcnt :
packet->page_buf_cnt;
u32 page_count = packet->cp_partial ? 1 : packet->page_buf_cnt;
u32 remain;

/* Add padding */
Expand Down Expand Up @@ -1055,6 +1054,42 @@ static int netvsc_dma_map(struct hv_device *hv_dev,
return 0;
}

/* Build an "array" of mpb entries describing the data to be transferred
* over VMBus. After the desc header fields, each "array" entry is variable
* size, and each entry starts after the end of the previous entry. The
* "offset" and "len" fields for each entry imply the size of the entry.
*
* The pfns are in HV_HYP_PAGE_SIZE, because all communication with Hyper-V
* uses that granularity, even if the system page size of the guest is larger.
* Each entry in the input "pb" array must describe a contiguous range of
* guest physical memory so that the pfns are sequential if the range crosses
* a page boundary. The offset field must be < HV_HYP_PAGE_SIZE.
*/
static inline void netvsc_build_mpb_array(struct hv_page_buffer *pb,
u32 page_buffer_count,
struct vmbus_packet_mpb_array *desc,
u32 *desc_size)
{
struct hv_mpb_array *mpb_entry = &desc->range;
int i, j;

for (i = 0; i < page_buffer_count; i++) {
u32 offset = pb[i].offset;
u32 len = pb[i].len;

mpb_entry->offset = offset;
mpb_entry->len = len;

for (j = 0; j < HVPFN_UP(offset + len); j++)
mpb_entry->pfn_array[j] = pb[i].pfn + j;

mpb_entry = (struct hv_mpb_array *)&mpb_entry->pfn_array[j];
}

desc->rangecount = page_buffer_count;
*desc_size = (char *)mpb_entry - (char *)desc;
}

static inline int netvsc_send_pkt(
struct hv_device *device,
struct hv_netvsc_packet *packet,
Expand Down Expand Up @@ -1097,20 +1132,24 @@ static inline int netvsc_send_pkt(

packet->dma_range = NULL;
if (packet->page_buf_cnt) {
struct vmbus_channel_packet_page_buffer desc;
u32 desc_size;

if (packet->cp_partial)
pb += packet->rmsg_pgcnt;
pb++;

ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
if (ret) {
ret = -EAGAIN;
goto exit;
}

ret = vmbus_sendpacket_pagebuffer(out_channel,
pb, packet->page_buf_cnt,
&nvmsg, sizeof(nvmsg),
req_id);

netvsc_build_mpb_array(pb, packet->page_buf_cnt,
(struct vmbus_packet_mpb_array *)&desc,
&desc_size);
ret = vmbus_sendpacket_mpb_desc(out_channel,
(struct vmbus_packet_mpb_array *)&desc,
desc_size, &nvmsg, sizeof(nvmsg), req_id);
if (ret)
netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
} else {
Expand Down Expand Up @@ -1259,7 +1298,7 @@ int netvsc_send(struct net_device *ndev,
packet->send_buf_index = section_index;

if (packet->cp_partial) {
packet->page_buf_cnt -= packet->rmsg_pgcnt;
packet->page_buf_cnt--;
packet->total_data_buflen = msd_len + packet->rmsg_size;
} else {
packet->page_buf_cnt = 0;
Expand Down
62 changes: 14 additions & 48 deletions drivers/net/hyperv/netvsc_drv.c
Original file line number Diff line number Diff line change
Expand Up @@ -326,43 +326,10 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
return txq;
}

static u32 fill_pg_buf(unsigned long hvpfn, u32 offset, u32 len,
struct hv_page_buffer *pb)
{
int j = 0;

hvpfn += offset >> HV_HYP_PAGE_SHIFT;
offset = offset & ~HV_HYP_PAGE_MASK;

while (len > 0) {
unsigned long bytes;

bytes = HV_HYP_PAGE_SIZE - offset;
if (bytes > len)
bytes = len;
pb[j].pfn = hvpfn;
pb[j].offset = offset;
pb[j].len = bytes;

offset += bytes;
len -= bytes;

if (offset == HV_HYP_PAGE_SIZE && len) {
hvpfn++;
offset = 0;
j++;
}
}

return j + 1;
}

static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
struct hv_netvsc_packet *packet,
struct hv_page_buffer *pb)
{
u32 slots_used = 0;
char *data = skb->data;
int frags = skb_shinfo(skb)->nr_frags;
int i;

Expand All @@ -371,28 +338,27 @@ static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
* 2. skb linear data
* 3. skb fragment data
*/
slots_used += fill_pg_buf(virt_to_hvpfn(hdr),
offset_in_hvpage(hdr),
len,
&pb[slots_used]);

pb[0].offset = offset_in_hvpage(hdr);
pb[0].len = len;
pb[0].pfn = virt_to_hvpfn(hdr);
packet->rmsg_size = len;
packet->rmsg_pgcnt = slots_used;

slots_used += fill_pg_buf(virt_to_hvpfn(data),
offset_in_hvpage(data),
skb_headlen(skb),
&pb[slots_used]);
pb[1].offset = offset_in_hvpage(skb->data);
pb[1].len = skb_headlen(skb);
pb[1].pfn = virt_to_hvpfn(skb->data);

for (i = 0; i < frags; i++) {
skb_frag_t *frag = skb_shinfo(skb)->frags + i;
struct hv_page_buffer *cur_pb = &pb[i + 2];
u64 pfn = page_to_hvpfn(skb_frag_page(frag));
u32 offset = skb_frag_off(frag);

slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)),
skb_frag_off(frag),
skb_frag_size(frag),
&pb[slots_used]);
cur_pb->offset = offset_in_hvpage(offset);
cur_pb->len = skb_frag_size(frag);
cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
}
return slots_used;
return frags + 2;
}

static int count_skb_frag_slots(struct sk_buff *skb)
Expand Down Expand Up @@ -483,7 +449,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct net_device *net, bool xdp_tx)
struct net_device *vf_netdev;
u32 rndis_msg_size;
u32 hash;
struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT];
struct hv_page_buffer pb[MAX_DATA_RANGES];

/* If VF is present and up then redirect packets to it.
* Skip the VF if it is marked down or has no carrier.
Expand Down
24 changes: 5 additions & 19 deletions drivers/net/hyperv/rndis_filter.c
Original file line number Diff line number Diff line change
Expand Up @@ -225,8 +225,7 @@ static int rndis_filter_send_request(struct rndis_device *dev,
struct rndis_request *req)
{
struct hv_netvsc_packet *packet;
struct hv_page_buffer page_buf[2];
struct hv_page_buffer *pb = page_buf;
struct hv_page_buffer pb;
int ret;

/* Setup the packet to send it */
Expand All @@ -235,27 +234,14 @@ static int rndis_filter_send_request(struct rndis_device *dev,
packet->total_data_buflen = req->request_msg.msg_len;
packet->page_buf_cnt = 1;

pb[0].pfn = virt_to_phys(&req->request_msg) >>
HV_HYP_PAGE_SHIFT;
pb[0].len = req->request_msg.msg_len;
pb[0].offset = offset_in_hvpage(&req->request_msg);

/* Add one page_buf when request_msg crossing page boundary */
if (pb[0].offset + pb[0].len > HV_HYP_PAGE_SIZE) {
packet->page_buf_cnt++;
pb[0].len = HV_HYP_PAGE_SIZE -
pb[0].offset;
pb[1].pfn = virt_to_phys((void *)&req->request_msg
+ pb[0].len) >> HV_HYP_PAGE_SHIFT;
pb[1].offset = 0;
pb[1].len = req->request_msg.msg_len -
pb[0].len;
}
pb.pfn = virt_to_phys(&req->request_msg) >> HV_HYP_PAGE_SHIFT;
pb.len = req->request_msg.msg_len;
pb.offset = offset_in_hvpage(&req->request_msg);

trace_rndis_send(dev->ndev, 0, &req->request_msg);

rcu_read_lock_bh();
ret = netvsc_send(dev->ndev, packet, NULL, pb, NULL, false);
ret = netvsc_send(dev->ndev, packet, NULL, &pb, NULL, false);
rcu_read_unlock_bh();

return ret;
Expand Down
1 change: 1 addition & 0 deletions drivers/scsi/storvsc_drv.c
Original file line number Diff line number Diff line change
Expand Up @@ -1819,6 +1819,7 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
return SCSI_MLQUEUE_DEVICE_BUSY;
}

payload->rangecount = 1;
payload->range.len = length;
payload->range.offset = offset_in_hvpg;

Expand Down
7 changes: 0 additions & 7 deletions include/linux/hyperv.h
Original file line number Diff line number Diff line change
Expand Up @@ -1161,13 +1161,6 @@ extern int vmbus_sendpacket(struct vmbus_channel *channel,
enum vmbus_packet_type type,
u32 flags);

extern int vmbus_sendpacket_pagebuffer(struct vmbus_channel *channel,
struct hv_page_buffer pagebuffers[],
u32 pagecount,
void *buffer,
u32 bufferlen,
u64 requestid);

extern int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
struct vmbus_packet_mpb_array *mpb,
u32 desc_size,
Expand Down

0 comments on commit 09db7a4

Please sign in to comment.