Skip to content

Commit

Permalink
Merge branch 'splice-net-switch-over-users-of-sendpage-and-remove-it'
Browse files Browse the repository at this point in the history
David Howells says:

====================
splice, net: Switch over users of sendpage() and remove it

Here's the final set of patches towards the removal of sendpage.  All the
drivers that use sendpage() get switched over to using sendmsg() with
MSG_SPLICE_PAGES.

The following changes are made:

 (1) Make the protocol drivers behave according to MSG_MORE, not
     MSG_SENDPAGE_NOTLAST.  The latter is restricted to turning on MSG_MORE
     in the sendpage() wrappers.

 (2) Fix ocfs2 to allocate its global protocol buffers with folio_alloc()
     rather than kzalloc() so as not to invoke the !sendpage_ok warning in
     skb_splice_from_iter().

 (3) Make ceph/rds, skb_send_sock, dlm, nvme, smc, ocfs2, drbd and iscsi
     use sendmsg(), not sendpage and make them specify MSG_MORE instead of
     MSG_SENDPAGE_NOTLAST.

 (4) Kill off sendpage and clean up MSG_SENDPAGE_NOTLAST.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1
Link: https://lore.kernel.org/r/20230616161301.622169-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230617121146.716077-1-dhowells@redhat.com/ # v2
Link: https://lore.kernel.org/r/20230620145338.1300897-1-dhowells@redhat.com/ # v3
====================

Link: https://lore.kernel.org/r/20230623225513.2732256-1-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
  • Loading branch information
Jakub Kicinski committed Jun 24, 2023
2 parents b545a13 + b848b26 commit 9ae440b
Show file tree
Hide file tree
Showing 88 changed files with 230 additions and 748 deletions.
10 changes: 5 additions & 5 deletions Documentation/bpf/map_sockmap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,11 +240,11 @@ offsets into ``msg``, respectively.
If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only
parse data that the (``data``, ``data_end``) pointers have already consumed.
For ``sendmsg()`` hooks this is likely the first scatterlist element. But for
calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be
the range (**0**, **0**) because the data is shared with user space and by
default the objective is to avoid allowing user space to modify data while (or
after) BPF verdict is being decided. This helper can be used to pull in data
and to set the start and end pointers to given values. Data will be copied if
calls relying on MSG_SPLICE_PAGES (e.g., ``sendfile()``) this will be the
range (**0**, **0**) because the data is shared with user space and by default
the objective is to avoid allowing user space to modify data while (or after)
BPF verdict is being decided. This helper can be used to pull in data and to
set the start and end pointers to given values. Data will be copied if
necessary (i.e., if data was not linear and if start and end pointers do not
point to the same chunk).

Expand Down
2 changes: 0 additions & 2 deletions Documentation/filesystems/locking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -521,8 +521,6 @@ prototypes::
int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
Expand Down
1 change: 0 additions & 1 deletion Documentation/filesystems/vfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1086,7 +1086,6 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
Expand Down
4 changes: 2 additions & 2 deletions Documentation/networking/scaling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes.
rps_sock_flow_table is a global flow table that contains the *desired* CPU
for flows: the CPU that is currently processing the flow in userspace.
Each table value is a CPU index that is updated during calls to recvmsg
and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
and tcp_splice_read()).
and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
tcp_splice_read()).

When the scheduler moves a thread to a new CPU while it has outstanding
receive packets on the old CPU, packets may arrive out of order. To
Expand Down
28 changes: 0 additions & 28 deletions crypto/af_alg.c
Original file line number Diff line number Diff line change
Expand Up @@ -482,7 +482,6 @@ static const struct proto_ops alg_proto_ops = {
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.mmap = sock_no_mmap,
.sendpage = sock_no_sendpage,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,

Expand Down Expand Up @@ -1106,33 +1105,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
}
EXPORT_SYMBOL_GPL(af_alg_sendmsg);

/**
* af_alg_sendpage - sendpage system call handler
* @sock: socket of connection to user space to write to
* @page: data to send
* @offset: offset into page to begin sending
* @size: length of data
* @flags: message send/receive flags
*
* This is a generic implementation of sendpage to fill ctx->tsgl_list.
*/
ssize_t af_alg_sendpage(struct socket *sock, struct page *page,
int offset, size_t size, int flags)
{
struct bio_vec bvec;
struct msghdr msg = {
.msg_flags = flags | MSG_SPLICE_PAGES,
};

if (flags & MSG_SENDPAGE_NOTLAST)
msg.msg_flags |= MSG_MORE;

bvec_set_page(&bvec, page, size, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
return sock_sendmsg(sock, &msg);
}
EXPORT_SYMBOL_GPL(af_alg_sendpage);

/**
* af_alg_free_resources - release resources required for crypto request
* @areq: Request holding the TX and RX SGL
Expand Down
22 changes: 4 additions & 18 deletions crypto/algif_aead.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
* filled by user space with the data submitted via sendpage. Filling up
* the TX SGL does not cause a crypto operation -- the data will only be
* tracked by the kernel. Upon receipt of one recvmsg call, the caller must
* provide a buffer which is tracked with the RX SGL.
* filled by user space with the data submitted via sendmsg (maybe with
* MSG_SPLICE_PAGES). Filling up the TX SGL does not cause a crypto operation
* -- the data will only be tracked by the kernel. Upon receipt of one recvmsg
* call, the caller must provide a buffer which is tracked with the RX SGL.
*
* During the processing of the recvmsg operation, the cipher request is
* allocated and prepared. As part of the recvmsg operation, the processed
Expand Down Expand Up @@ -370,7 +370,6 @@ static struct proto_ops algif_aead_ops = {

.release = af_alg_release,
.sendmsg = aead_sendmsg,
.sendpage = af_alg_sendpage,
.recvmsg = aead_recvmsg,
.poll = af_alg_poll,
};
Expand Down Expand Up @@ -422,18 +421,6 @@ static int aead_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return aead_sendmsg(sock, msg, size);
}

static ssize_t aead_sendpage_nokey(struct socket *sock, struct page *page,
int offset, size_t size, int flags)
{
int err;

err = aead_check_key(sock);
if (err)
return err;

return af_alg_sendpage(sock, page, offset, size, flags);
}

static int aead_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
Expand Down Expand Up @@ -461,7 +448,6 @@ static struct proto_ops algif_aead_ops_nokey = {

.release = af_alg_release,
.sendmsg = aead_sendmsg_nokey,
.sendpage = aead_sendpage_nokey,
.recvmsg = aead_recvmsg_nokey,
.poll = af_alg_poll,
};
Expand Down
2 changes: 0 additions & 2 deletions crypto/algif_rng.c
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,6 @@ static struct proto_ops algif_rng_ops = {
.bind = sock_no_bind,
.accept = sock_no_accept,
.sendmsg = sock_no_sendmsg,
.sendpage = sock_no_sendpage,

.release = af_alg_release,
.recvmsg = rng_recvmsg,
Expand All @@ -192,7 +191,6 @@ static struct proto_ops __maybe_unused algif_rng_test_ops = {
.mmap = sock_no_mmap,
.bind = sock_no_bind,
.accept = sock_no_accept,
.sendpage = sock_no_sendpage,

.release = af_alg_release,
.recvmsg = rng_test_recvmsg,
Expand Down
14 changes: 0 additions & 14 deletions crypto/algif_skcipher.c
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,6 @@ static struct proto_ops algif_skcipher_ops = {

.release = af_alg_release,
.sendmsg = skcipher_sendmsg,
.sendpage = af_alg_sendpage,
.recvmsg = skcipher_recvmsg,
.poll = af_alg_poll,
};
Expand Down Expand Up @@ -246,18 +245,6 @@ static int skcipher_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return skcipher_sendmsg(sock, msg, size);
}

static ssize_t skcipher_sendpage_nokey(struct socket *sock, struct page *page,
int offset, size_t size, int flags)
{
int err;

err = skcipher_check_key(sock);
if (err)
return err;

return af_alg_sendpage(sock, page, offset, size, flags);
}

static int skcipher_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
Expand Down Expand Up @@ -285,7 +272,6 @@ static struct proto_ops algif_skcipher_ops_nokey = {

.release = af_alg_release,
.sendmsg = skcipher_sendmsg_nokey,
.sendpage = skcipher_sendpage_nokey,
.recvmsg = skcipher_recvmsg_nokey,
.poll = af_alg_poll,
};
Expand Down
12 changes: 8 additions & 4 deletions drivers/block/drbd/drbd_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -1540,6 +1540,8 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
int offset, size_t size, unsigned msg_flags)
{
struct socket *socket = peer_device->connection->data.socket;
struct msghdr msg = { .msg_flags = msg_flags, };
struct bio_vec bvec;
int len = size;
int err = -EIO;

Expand All @@ -1549,15 +1551,17 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
if (drbd_disable_sendpage || !sendpage_ok(page))
return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
if (!drbd_disable_sendpage && sendpage_ok(page))
msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;

msg_flags |= MSG_NOSIGNAL;
drbd_update_congested(peer_device->connection);
do {
int sent;

sent = socket->ops->sendpage(socket, page, offset, len, msg_flags);
bvec_set_page(&bvec, page, offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);

sent = sock_sendmsg(socket, &msg);
if (sent <= 0) {
if (sent == -EAGAIN) {
if (we_should_drop_the_connection(peer_device->connection, socket))
Expand Down
5 changes: 2 additions & 3 deletions drivers/infiniband/sw/siw/siw_qp_tx.c
Original file line number Diff line number Diff line change
Expand Up @@ -325,8 +325,7 @@ static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
{
struct bio_vec bvec;
struct msghdr msg = {
.msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST |
MSG_SPLICE_PAGES),
.msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SPLICE_PAGES),
};
struct sock *sk = s->sk;
int i = 0, rv = 0, sent = 0;
Expand All @@ -335,7 +334,7 @@ static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);

if (size + offset <= PAGE_SIZE)
msg.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
msg.msg_flags &= ~MSG_MORE;

tcp_rate_check_app_limited(sk);
bvec_set_page(&bvec, page[i], bytes, offset);
Expand Down
2 changes: 0 additions & 2 deletions drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
Original file line number Diff line number Diff line change
Expand Up @@ -569,8 +569,6 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
int chtls_recvmsg(struct sock *sk, struct msghdr *msg,
size_t len, int flags, int *addr_len);
void chtls_splice_eof(struct socket *sock);
int chtls_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
int send_tx_flowc_wr(struct sock *sk, int compl,
u32 snd_nxt, u32 rcv_nxt);
void chtls_tcp_push(struct sock *sk, int flags);
Expand Down
14 changes: 0 additions & 14 deletions drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
Original file line number Diff line number Diff line change
Expand Up @@ -1246,20 +1246,6 @@ void chtls_splice_eof(struct socket *sock)
release_sock(sk);
}

int chtls_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
struct bio_vec bvec;

if (flags & MSG_SENDPAGE_NOTLAST)
msg.msg_flags |= MSG_MORE;

bvec_set_page(&bvec, page, size, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
return chtls_sendmsg(sk, &msg, size);
}

static void chtls_select_window(struct sock *sk)
{
struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -607,7 +607,6 @@ static void __init chtls_init_ulp_ops(void)
chtls_cpl_prot.shutdown = chtls_shutdown;
chtls_cpl_prot.sendmsg = chtls_sendmsg;
chtls_cpl_prot.splice_eof = chtls_splice_eof;
chtls_cpl_prot.sendpage = chtls_sendpage;
chtls_cpl_prot.recvmsg = chtls_recvmsg;
chtls_cpl_prot.setsockopt = chtls_setsockopt;
chtls_cpl_prot.getsockopt = chtls_getsockopt;
Expand Down
49 changes: 27 additions & 22 deletions drivers/nvme/host/tcp.c
Original file line number Diff line number Diff line change
Expand Up @@ -997,25 +997,28 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
u32 h2cdata_left = req->h2cdata_left;

while (true) {
struct bio_vec bvec;
struct msghdr msg = {
.msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
};
struct page *page = nvme_tcp_req_cur_page(req);
size_t offset = nvme_tcp_req_cur_offset(req);
size_t len = nvme_tcp_req_cur_length(req);
bool last = nvme_tcp_pdu_last_send(req, len);
int req_data_sent = req->data_sent;
int ret, flags = MSG_DONTWAIT;
int ret;

if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
flags |= MSG_EOR;
msg.msg_flags |= MSG_EOR;
else
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
msg.msg_flags |= MSG_MORE;

if (sendpage_ok(page)) {
ret = kernel_sendpage(queue->sock, page, offset, len,
flags);
} else {
ret = sock_no_sendpage(queue->sock, page, offset, len,
flags);
}
if (!sendpage_ok(page))
msg.msg_flags &= ~MSG_SPLICE_PAGES,

bvec_set_page(&bvec, page, len, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
ret = sock_sendmsg(queue->sock, &msg);
if (ret <= 0)
return ret;

Expand Down Expand Up @@ -1054,22 +1057,24 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
struct bio_vec bvec;
struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
bool inline_data = nvme_tcp_has_inline_data(req);
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) + hdgst - req->offset;
int flags = MSG_DONTWAIT;
int ret;

if (inline_data || nvme_tcp_queue_more(queue))
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
msg.msg_flags |= MSG_MORE;
else
flags |= MSG_EOR;
msg.msg_flags |= MSG_EOR;

if (queue->hdr_digest && !req->offset)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
offset_in_page(pdu) + req->offset, len, flags);
bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;

Expand All @@ -1093,6 +1098,8 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
struct nvme_tcp_data_pdu *pdu = nvme_tcp_req_data_pdu(req);
struct bio_vec bvec;
struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_MORE, };
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) - req->offset + hdgst;
int ret;
Expand All @@ -1101,13 +1108,11 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

if (!req->h2cdata_left)
ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
offset_in_page(pdu) + req->offset, len,
MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
else
ret = sock_no_sendpage(queue->sock, virt_to_page(pdu),
offset_in_page(pdu) + req->offset, len,
MSG_DONTWAIT | MSG_MORE);
msg.msg_flags |= MSG_SPLICE_PAGES;

bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;

Expand Down
Loading

0 comments on commit 9ae440b

Please sign in to comment.