Skip to content

Commit

Permalink
io_uring/rsrc: get rid of per-ring io_rsrc_node list
Browse files Browse the repository at this point in the history
Work in progress, but get rid of the per-ring serialization of resource
nodes, like registered buffers and files. Main issue here is that one
node can otherwise hold up a bunch of other nodes from getting freed,
which is especially a problem for file resource nodes and networked
workloads where some descriptors may not see activity in a long time.

As an example, instantiate an io_uring ring fd and create a sparse
registered file table. Even 2 will do. Then create a socket and register
it as fixed file 0, F0. The number of open files in the app is now 5,
with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4
being the socket. Register this socket (eg "the listener") in slot 0 of
the registered file table. Now add an operation on the socket that uses
slot 0. Finally, loop N times, where each loop creates a new socket,
registers said socket as a file, then unregisters the socket, and
finally closes the socket. This is roughly similar to what a basic
accept loop would look like.

At the end of this loop, it's not unreasonable to expect that there
would still be 5 open files. Each socket created and registered in the
loop is also unregistered and closed. But since the listener socket
registered first still has references to its resource node due to still
being active, each subsequent socket unregistration is stuck behind it
for reclaim. Hence 5 + N files are still open at that point, where N is
awaiting the final put held up by the listener socket.

Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct
io_kiocb now gets explicit resource nodes assigned, with each holding a
reference to the parent node. A parent node is either of type FILE or
BUFFER, which are the two types of nodes that exist. A request can have
two nodes assigned, if it's using both registered files and buffers.
Since request issue and task_work completion is both under the ring
private lock, no atomics are needed to handle these references. It's a
simple unlocked inc/dec. As before, the registered buffer or file table
each hold a reference as well to the registered nodes. Final put of the
node will remove the node and free the underlying resource, eg unmap the
buffer or put the file.

Outside of removing the stall in resource reclaim described above, it
has the following advantages:

1) It's a lot simpler than the previous scheme, and easier to follow.
   No need to specific quiesce handling anymore.

2) There are no resource node allocations in the fast path, all of that
   happens at resource registration time.

3) The structs related to resource handling can all get simplified
   quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can
   go away completely.

4) Handling of resource tags is much simpler, and doesn't require
   persistent storage as it can simply get assigned up front at
   registration time. Just copy them in one-by-one at registration time
   and assign to the resource node.

The only real downside is that a request is now explicitly limited to
pinning 2 resources, one file and one buffer, where before just
assigning a resource node to a request would pin all of them. The upside
is that it's easier to follow now, as an individual resource is
explicitly referenced and assigned to the request.

With this in place, the above mentioned example will be using exactly 5
files at the end of the loop, not N.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
  • Loading branch information
Jens Axboe committed Nov 2, 2024
1 parent e410ffc commit 7029acd
Show file tree
Hide file tree
Showing 13 changed files with 270 additions and 465 deletions.
10 changes: 3 additions & 7 deletions include/linux/io_uring_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ struct io_wq_work {
};

struct io_file_table {
struct io_fixed_file *files;
struct io_rsrc_node **nodes;
unsigned long *bitmap;
unsigned int alloc_hint;
};
Expand Down Expand Up @@ -264,7 +264,6 @@ struct io_ring_ctx {
* Fixed resources fast path, should be accessed only under
* uring_lock, and updated through io_uring_register(2)
*/
struct io_rsrc_node *rsrc_node;
atomic_t cancel_seq;

/*
Expand All @@ -277,7 +276,7 @@ struct io_ring_ctx {
struct io_wq_work_list iopoll_list;

struct io_file_table file_table;
struct io_mapped_ubuf **user_bufs;
struct io_rsrc_node **user_bufs;
unsigned nr_user_files;
unsigned nr_user_bufs;

Expand Down Expand Up @@ -372,10 +371,7 @@ struct io_ring_ctx {
struct io_rsrc_data *buf_data;

/* protected by ->uring_lock */
struct list_head rsrc_ref_list;
struct io_alloc_cache rsrc_node_cache;
struct wait_queue_head rsrc_quiesce_wq;
unsigned rsrc_quiesce;

u32 pers_next;
struct xarray personalities;
Expand Down Expand Up @@ -642,7 +638,7 @@ struct io_kiocb {
__poll_t apoll_events;
};

struct io_rsrc_node *rsrc_node;
struct io_rsrc_node *rsrc_nodes[2];

atomic_t refs;
bool cancel_seq_set;
Expand Down
2 changes: 1 addition & 1 deletion io_uring/fdinfo.c
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
}
seq_printf(m, "UserBufs:\t%u\n", ctx->nr_user_bufs);
for (i = 0; has_lock && i < ctx->nr_user_bufs; i++) {
struct io_mapped_ubuf *buf = ctx->user_bufs[i];
struct io_mapped_ubuf *buf = ctx->user_bufs[i]->buf;

seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, buf->len);
}
Expand Down
52 changes: 19 additions & 33 deletions io_uring/filetable.c
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,14 @@ static int io_file_bitmap_get(struct io_ring_ctx *ctx)

bool io_alloc_file_tables(struct io_file_table *table, unsigned nr_files)
{
table->files = kvcalloc(nr_files, sizeof(table->files[0]),
GFP_KERNEL_ACCOUNT);
if (unlikely(!table->files))
table->nodes = kvmalloc_array(nr_files, sizeof(struct io_src_node *),
GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (unlikely(!table->nodes))
return false;

table->bitmap = bitmap_zalloc(nr_files, GFP_KERNEL_ACCOUNT);
if (unlikely(!table->bitmap)) {
kvfree(table->files);
kvfree(table->nodes);
return false;
}

Expand All @@ -54,18 +54,17 @@ bool io_alloc_file_tables(struct io_file_table *table, unsigned nr_files)

void io_free_file_tables(struct io_file_table *table)
{
kvfree(table->files);
kvfree(table->nodes);
bitmap_free(table->bitmap);
table->files = NULL;
table->nodes = NULL;
table->bitmap = NULL;
}

static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
u32 slot_index)
__must_hold(&req->ctx->uring_lock)
{
struct io_fixed_file *file_slot;
int ret;
struct io_rsrc_node *node;

if (io_is_uring_fops(file))
return -EBADF;
Expand All @@ -74,22 +73,18 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
if (slot_index >= ctx->nr_user_files)
return -EINVAL;

slot_index = array_index_nospec(slot_index, ctx->nr_user_files);
file_slot = io_fixed_file_slot(&ctx->file_table, slot_index);

if (file_slot->file_ptr) {
ret = io_queue_rsrc_removal(ctx->file_data, slot_index,
io_slot_file(file_slot));
if (ret)
return ret;
node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node)
return -ENOMEM;

file_slot->file_ptr = 0;
} else {
slot_index = array_index_nospec(slot_index, ctx->nr_user_files);
if (ctx->file_table.nodes[slot_index])
io_put_rsrc_node(ctx->file_table.nodes[slot_index]);
else
io_file_bitmap_set(&ctx->file_table, slot_index);
}

*io_get_tag_slot(ctx->file_data, slot_index) = 0;
io_fixed_file_set(file_slot, file);
ctx->file_table.nodes[slot_index] = node;
io_fixed_file_set(node, file);
return 0;
}

Expand Down Expand Up @@ -134,25 +129,16 @@ int io_fixed_fd_install(struct io_kiocb *req, unsigned int issue_flags,

int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset)
{
struct io_fixed_file *file_slot;
int ret;

if (unlikely(!ctx->file_data))
return -ENXIO;
if (offset >= ctx->nr_user_files)
return -EINVAL;

offset = array_index_nospec(offset, ctx->nr_user_files);
file_slot = io_fixed_file_slot(&ctx->file_table, offset);
if (!file_slot->file_ptr)
if (!ctx->file_table.nodes[offset])
return -EBADF;

ret = io_queue_rsrc_removal(ctx->file_data, offset,
io_slot_file(file_slot));
if (ret)
return ret;

file_slot->file_ptr = 0;
io_put_rsrc_node(ctx->file_table.nodes[offset]);
ctx->file_table.nodes[offset] = NULL;
io_file_bitmap_clear(&ctx->file_table, offset);
return 0;
}
Expand Down
25 changes: 12 additions & 13 deletions io_uring/filetable.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,36 +34,35 @@ static inline void io_file_bitmap_set(struct io_file_table *table, int bit)
table->alloc_hint = bit + 1;
}

static inline struct io_fixed_file *
io_fixed_file_slot(struct io_file_table *table, unsigned i)
{
return &table->files[i];
}

#define FFS_NOWAIT 0x1UL
#define FFS_ISREG 0x2UL
#define FFS_MASK ~(FFS_NOWAIT|FFS_ISREG)

static inline unsigned int io_slot_flags(struct io_fixed_file *slot)
static inline unsigned int io_slot_flags(struct io_rsrc_node *node)
{
return (slot->file_ptr & ~FFS_MASK) << REQ_F_SUPPORT_NOWAIT_BIT;

return (node->file_ptr & ~FFS_MASK) << REQ_F_SUPPORT_NOWAIT_BIT;
}

static inline struct file *io_slot_file(struct io_fixed_file *slot)
static inline struct file *io_slot_file(struct io_rsrc_node *node)
{
return (struct file *)(slot->file_ptr & FFS_MASK);
return (struct file *)(node->file_ptr & FFS_MASK);
}

static inline struct file *io_file_from_index(struct io_file_table *table,
int index)
{
return io_slot_file(io_fixed_file_slot(table, index));
struct io_rsrc_node *node = table->nodes[index];

if (node)
return io_slot_file(node);
return NULL;
}

static inline void io_fixed_file_set(struct io_fixed_file *file_slot,
static inline void io_fixed_file_set(struct io_rsrc_node *node,
struct file *file)
{
file_slot->file_ptr = (unsigned long)file |
node->file_ptr = (unsigned long)file |
(io_file_get_flags(file) >> REQ_F_SUPPORT_NOWAIT_BIT);
}

Expand Down
38 changes: 12 additions & 26 deletions io_uring/io_uring.c
Original file line number Diff line number Diff line change
Expand Up @@ -333,15 +333,13 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
mutex_init(&ctx->uring_lock);
init_waitqueue_head(&ctx->cq_wait);
init_waitqueue_head(&ctx->poll_wq);
init_waitqueue_head(&ctx->rsrc_quiesce_wq);
spin_lock_init(&ctx->completion_lock);
spin_lock_init(&ctx->timeout_lock);
INIT_WQ_LIST(&ctx->iopoll_list);
INIT_LIST_HEAD(&ctx->io_buffers_comp);
INIT_LIST_HEAD(&ctx->defer_list);
INIT_LIST_HEAD(&ctx->timeout_list);
INIT_LIST_HEAD(&ctx->ltimeout_list);
INIT_LIST_HEAD(&ctx->rsrc_ref_list);
init_llist_head(&ctx->work_llist);
INIT_LIST_HEAD(&ctx->tctx_list);
ctx->submit_state.free_list.next = NULL;
Expand Down Expand Up @@ -1415,7 +1413,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
io_clean_op(req);
}
io_put_file(req);
io_put_rsrc_node(ctx, req->rsrc_node);
io_req_put_rsrc_nodes(req);
io_put_task(req->task);

node = req->comp_list.next;
Expand Down Expand Up @@ -1878,19 +1876,20 @@ inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
unsigned int issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_fixed_file *slot;
struct io_rsrc_node *node;
struct file *file = NULL;

io_ring_submit_lock(ctx, issue_flags);

if (unlikely((unsigned int)fd >= ctx->nr_user_files))
goto out;
fd = array_index_nospec(fd, ctx->nr_user_files);
slot = io_fixed_file_slot(&ctx->file_table, fd);
if (!req->rsrc_node)
__io_req_set_rsrc_node(req, ctx);
req->flags |= io_slot_flags(slot);
file = io_slot_file(slot);
node = ctx->file_table.nodes[fd];
if (node) {
io_req_assign_rsrc_node(req, node);
req->flags |= io_slot_flags(node);
file = io_slot_file(node);
}
out:
io_ring_submit_unlock(ctx, issue_flags);
return file;
Expand Down Expand Up @@ -2036,7 +2035,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
req->flags = (__force io_req_flags_t) sqe_flags;
req->cqe.user_data = READ_ONCE(sqe->user_data);
req->file = NULL;
req->rsrc_node = NULL;
req->rsrc_nodes[IORING_RSRC_FILE] = NULL;
req->rsrc_nodes[IORING_RSRC_BUFFER] = NULL;
req->task = current;
req->cancel_seq_set = false;

Expand Down Expand Up @@ -2718,15 +2718,10 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
{
io_sq_thread_finish(ctx);
/* __io_rsrc_put_work() may need uring_lock to progress, wait w/o it */
if (WARN_ON_ONCE(!list_empty(&ctx->rsrc_ref_list)))
return;

mutex_lock(&ctx->uring_lock);
if (ctx->buf_data)
__io_sqe_buffers_unregister(ctx);
if (ctx->file_data)
__io_sqe_files_unregister(ctx);
io_sqe_buffers_unregister(ctx);
io_sqe_files_unregister(ctx);
io_cqring_overflow_kill(ctx);
io_eventfd_unregister(ctx);
io_alloc_cache_free(&ctx->apoll_cache, kfree);
Expand All @@ -2743,11 +2738,6 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
if (ctx->submitter_task)
put_task_struct(ctx->submitter_task);

/* there are no registered resources left, nobody uses it */
if (ctx->rsrc_node)
io_rsrc_node_destroy(ctx, ctx->rsrc_node);

WARN_ON_ONCE(!list_empty(&ctx->rsrc_ref_list));
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));

io_alloc_cache_free(&ctx->rsrc_node_cache, kfree);
Expand Down Expand Up @@ -3729,10 +3719,6 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
if (ret)
goto err;

ret = io_rsrc_init(ctx);
if (ret)
goto err;

p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS |
IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL |
Expand Down
11 changes: 6 additions & 5 deletions io_uring/net.c
Original file line number Diff line number Diff line change
Expand Up @@ -1342,24 +1342,25 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)

if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
struct io_ring_ctx *ctx = req->ctx;
struct io_mapped_ubuf *imu;
struct io_rsrc_node *node;
int idx;

ret = -EFAULT;
io_ring_submit_lock(ctx, issue_flags);
if (sr->buf_index < ctx->nr_user_bufs) {
idx = array_index_nospec(sr->buf_index, ctx->nr_user_bufs);
imu = READ_ONCE(ctx->user_bufs[idx]);
io_req_set_rsrc_node(sr->notif, ctx);
node = ctx->user_bufs[idx];
io_req_assign_rsrc_node(sr->notif, node);
ret = 0;
}
io_ring_submit_unlock(ctx, issue_flags);

if (unlikely(ret))
return ret;

ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter, imu,
(u64)(uintptr_t)sr->buf, sr->len);
ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter,
node->buf, (u64)(uintptr_t)sr->buf,
sr->len);
if (unlikely(ret))
return ret;
kmsg->msg.sg_from_iter = io_sg_from_iter;
Expand Down
6 changes: 3 additions & 3 deletions io_uring/nop.c
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,15 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
}
if (nop->flags & IORING_NOP_FIXED_BUFFER) {
struct io_ring_ctx *ctx = req->ctx;
struct io_mapped_ubuf *imu;
struct io_rsrc_node *node;
int idx;

ret = -EFAULT;
io_ring_submit_lock(ctx, issue_flags);
if (nop->buffer < ctx->nr_user_bufs) {
idx = array_index_nospec(nop->buffer, ctx->nr_user_bufs);
imu = READ_ONCE(ctx->user_bufs[idx]);
io_req_set_rsrc_node(req, ctx);
node = READ_ONCE(ctx->user_bufs[idx]);
io_req_assign_rsrc_node(req, node);
ret = 0;
}
io_ring_submit_unlock(ctx, issue_flags);
Expand Down
3 changes: 2 additions & 1 deletion io_uring/notif.c
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,8 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
notif->file = NULL;
notif->task = current;
io_get_task_refs(1);
notif->rsrc_node = NULL;
notif->rsrc_nodes[IORING_RSRC_FILE] = NULL;
notif->rsrc_nodes[IORING_RSRC_BUFFER] = NULL;

nd = io_notif_to_data(notif);
nd->zc_report = false;
Expand Down
Loading

0 comments on commit 7029acd

Please sign in to comment.