Work in progress, but get rid of the per-ring serialization of resource
nodes, like registered buffers and files. Main issue here is that one
node can otherwise hold up a bunch of other nodes from getting freed,
which is especially a problem for file resource nodes and networked
workloads where some descriptors may not see activity in a long time.
As an example, instantiate an io_uring ring fd and create a sparse
registered file table. Even 2 will do. Then create a socket and register
it as fixed file 0, F0. The number of open files in the app is now 5,
with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4
being the socket. Register this socket (eg "the listener") in slot 0 of
the registered file table. Now add an operation on the socket that uses
slot 0. Finally, loop N times, where each loop creates a new socket,
registers said socket as a file, then unregisters the socket, and
finally closes the socket. This is roughly similar to what a basic
accept loop would look like.
At the end of this loop, it's not unreasonable to expect that there
would still be 5 open files. Each socket created and registered in the
loop is also unregistered and closed. But since the listener socket
registered first still has references to its resource node due to still
being active, each subsequent socket unregistration is stuck behind it
for reclaim. Hence 5 + N files are still open at that point, where N is
awaiting the final put held up by the listener socket.
Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct
io_kiocb now gets explicit resource nodes assigned, with each holding a
reference to the parent node. A parent node is either of type FILE or
BUFFER, which are the two types of nodes that exist. A request can have
two nodes assigned, if it's using both registered files and buffers.
Since request issue and task_work completion is both under the ring
private lock, no atomics are needed to handle these references. It's a
simple unlocked inc/dec. As before, the registered buffer or file table
each hold a reference as well to the registered nodes. Final put of the
node will remove the node and free the underlying resource, eg unmap the
buffer or put the file.
Outside of removing the stall in resource reclaim described above, it
has the following advantages:
1) It's a lot simpler than the previous scheme, and easier to follow.
No need to specific quiesce handling anymore.
2) There are no resource node allocations in the fast path, all of that
happens at resource registration time.
3) The structs related to resource handling can all get simplified
quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can
go away completely.
4) Handling of resource tags is much simpler, and doesn't require
persistent storage as it can simply get assigned up front at
registration time. Just copy them in one-by-one at registration time
and assign to the resource node.
The only real downside is that a request is now explicitly limited to
pinning 2 resources, one file and one buffer, where before just
assigning a resource node to a request would pin all of them. The upside
is that it's easier to follow now, as an individual resource is
explicitly referenced and assigned to the request.
With this in place, the above mentioned example will be using exactly 5
files at the end of the loop, not N.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Doesn't matter right now as there's still some bytes left for it, but
let's prepare for the io_kiocb potentially growing and add a specific
freeptr offset for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Generally applications have 1 or a few waits of waiting, yet they pass
in a struct io_uring_getevents_arg every time. This needs to get copied
and, in turn, the timeout value needs to get copied.
Rather than do this for every invocation, allow the application to
register a fixed set of wait regions that can simply be indexed when
asking the kernel to wait on events.
At ring setup time, the application can register a number of these wait
regions and initialize region/index 0 upfront:
struct io_uring_reg_wait *reg;
reg = io_uring_setup_reg_wait(ring, nr_regions, &ret);
/* set timeout and mark as set, sigmask/sigmask_sz as needed */
reg->ts.tv_sec = 0;
reg->ts.tv_nsec = 100000;
reg->flags = IORING_REG_WAIT_TS;
where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(*reg). The
above initializes index 0, but 63 other regions can be initialized,
if needed. Now, instead of doing:
struct __kernel_timespec timeout = { .tv_nsec = 100000, };
io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL);
to wait for events for each submit_and_wait, or just wait, operation, it
can just reference the above region at offset 0 and do:
io_uring_submit_and_wait_reg(ring, &cqe, nr, 0);
to achieve the same goal of waiting 100usec without needing to copy
both struct io_uring_getevents_arg (24b) and struct __kernel_timeout
(16b) for each invocation. Struct io_uring_reg_wait looks as follows:
struct io_uring_reg_wait {
struct __kernel_timespec ts;
__u32 min_wait_usec;
__u32 flags;
__u64 sigmask;
__u32 sigmask_sz;
__u32 pad[3];
__u64 pad2[2];
};
embedding the timeout itself in the region, rather than passing it as
a pointer as well. Note that the signal mask is still passed as a
pointer, both for compatability reasons, but also because there doesn't
seem to be a lot of high frequency waits scenarios that involve setting
and resetting the signal mask for each wait.
The application is free to modify any region before a wait call, or it
can use keep multiple regions with different settings to avoid needing to
modify the same one for wait calls. Up to a page size of regions is mapped
by default, allowing PAGE_SIZE / 64 available regions for use.
The registered region must fit within a page. On a 4kb page size system,
that allows for 64 wait regions if a full page is used, as the size of
struct io_uring_reg_wait is 64b. The region registered must be aligned
to io_uring_reg_wait in size. It's valid to register less than 64
entries.
In network performance testing with zero-copy, this reduced the time
spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4%
to 0.3%.
Wait regions are fixed for the lifetime of the ring - once registered,
they are persistent until the ring is torn down. The regions support
minimum wait timeout as well as the regular waits.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In scenarios where a high frequency of wait events are seen, the copy
of the struct io_uring_getevents_arg is quite noticeable in the
profiles in terms of time spent. It can be seen as up to 3.5-4.5%.
Rewrite the copy-in logic, saving about 0.5% of the time.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This avoids intermediate storage for turning a __kernel_timespec
user pointer into an on-stack struct timespec64, only then to turn it
into a ktime_t.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once a ring has been created, the size of the CQ and SQ rings are fixed.
Usually this isn't a problem on the SQ ring side, as it merely controls
the available number of requests that can be submitted in a single
system call, and there's rarely a need to change that.
For the CQ ring, it's a different story. For most efficient use of
io_uring, it's important that the CQ ring never overflows. This means
that applications must size it for the worst case scenario, which can
be wasteful.
Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize
the existing rings. It takes a struct io_uring_params argument, the same
one which is used to setup the ring initially, and resizes rings
according to the sizes given.
Certain properties are always inherited from the original ring setup,
like SQE128/CQE32 and other setup options. The implementation only
allows flag associated with how the CQ ring is sized and clamped.
Existing unconsumed SQE and CQE entries are copied as part of the
process. If either the SQ or CQ resized destination ring cannot hold the
entries already present in the source rings, then the operation is failed
with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new
submissions, and the internal mapping holds the completion lock as well
across moving CQ ring state.
To prevent races between mmap and ring resizing, add a mutex that's
solely used to serialize ring resize and mmap. mmap_sem can't be used
here, as as fork'ed process may be doing mmaps on the ring as well.
The ctx->resize_lock is held across mmap operations, and the resize
will grab it before swapping out the already mapped new data.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Abstract out a io_uring_fill_params() helper, which fills out the
necessary bits of struct io_uring_params. Add it to io_uring.h as well,
in preparation for having another internal user of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for needing this somewhere else, move the definitions
for the maximum CQ and SQ ring size into io_uring.h. Make the
rings_size() helper available as well, and have it take just the setup
flags argument rather than the fill ring pointer. That's all that is
needed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have too many helpers posting CQEs, instead of tracing completion
events before filling in a CQE and thus having to pass all the data,
set the CQE first, pass it to the tracing helper and let it extract
everything it needs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b83c1ca9ee5aed2df0f3bb743bf5ed699cce4c86.1729267437.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert to using kvmalloc/kfree() for the hash tables, and while at it,
make it handle low memory situations better.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring maintains two hash lists of inflight requests:
1) ctx->cancel_table_locked. This is used when the caller has the
ctx->uring_lock held already. This is only an issue side parameter,
as removal or task_work will always have it held.
2) ctx->cancel_table. This is used when the issuer does NOT have the
ctx->uring_lock held, and relies on the table spinlocks for access.
However, it's pretty trivial to simply grab the lock in the one spot
where we care about it, for insertion. With that, we can kill the
unlocked table (and get rid of the _locked postfix for the other one).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
some of those used to be needed, some had been cargo-culted for
no reason...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Apply __force cast to restricted io_req_flags_t type to fix
the following sparse warning:
io_uring/io_uring.c:2026:23: sparse: warning: cast to restricted io_req_flags_t
No functional changes intended.
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Link: https://lore.kernel.org/r/20240922104132.157055-1-minhuadotchen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Exit the percpu ref when cache init fails to free the data memory with
in struct percpu_ref.
Fixes: 206aefde4f ("io_uring: reduce/pack size of io_ring_ctx")
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240923100512.64638-1-kanie@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbvv30QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj3+EACs346FzM8PlZe1GxBZ6OnQX80blwoldAxC
+Abl5xjoJKUgA7rY3lJVBRNR6olA/4I2VD3g8b3RT6lpd/oKzPFg7FOj5Dc/oN+c
Fo6C7zZdr8caokpL4pfwgyG8ZNssQgRg8e0kRSw8A7AMo1zUazqAXtxjRzeMEOLC
1kWRYGdHCbVjx+hRIyX6KKP427Z5nXvcqFOC0BOpd5jDNYVh9WjNNyUE7trkGJ7o
1cjlpaaOURS0yU/4hue6tRnM8LDjaImyTyISvBWzKfKvpc19K1alOQNvHIoIeiBQ
5MgCNkSpbRmUTrydYVEQXl0Cia2d5+0KQsavUB9nZ8M++NftbRr/i26xT8ReZzXI
NjaedDF+MyOKeJaft2ZeKH8GgWolysMBa4e89CveRxosa/6gwHCkkB4UK9b3gaBB
Fij1zh/7fIVG7Tz8yNUDyGe6DzOEol1bn1KnL35/9nuCCRnSAM0vRPwJSkurlQ8B
PqVUS3BArn+LQZmSZ3HJVKOHv2QAY8etqWizvVmu4DB9Ar+uZ6Ur2uwfMN9JAODP
Fm2qVvxS73QlrvisdbnVbTzqBnqh3Rs4mb5my/gCWO1s67qtu3abSJCSzcnyxQdd
yBMDegJxTNv6DErNjPEF4qDODwSTIzswr//kOeLns1EtDGfrK8nxUfIKPQUwLSTO
Y7h2ru83uA==
=goTY
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"Mostly just a set of fixes in here, or little changes that didn't get
included in the initial pull request. This contains:
- Move the SQPOLL napi polling outside the submission lock (Olivier)
- Rename of the "copy buffers" API that got added in the 6.12 merge
window. There's really no copying going on, it's just referencing
the buffers. After a bit of consideration, decided that it was
better to simply rename this to avoid potential confusion (me)
- Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to
start + len tracking rather than having start / end in there, and
by removing the caching of folio_mask when we can just calculate it
from folio_shift when we need it (me)
- Fixes for the SQPOLL affinity checking (me, Felix)
- Fix for how cqring waiting checks for the presence of task_work.
Just check it directly rather than check for a specific
notification mechanism (me)
- Tweak to how request linking is represented in tracing (me)
- Fix a syzbot report that deliberately sets up a huge list of
overflow entries, and then hits rcu stalls when flushing this list.
Just check for the need to preempt, and drop/reacquire locks in the
loop. There's no state maintained over the loop itself, and each
entry is yanked from head-of-list (me)"
* tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux:
io_uring: check if we need to reschedule during overflow flush
io_uring: improve request linking trace
io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL
io_uring/sqpoll: do the napi busy poll outside the submission block
io_uring: clean up a type in io_uring_register_get_file()
io_uring/sqpoll: do not put cpumask on stack
io_uring/sqpoll: retain test for whether the CPU is valid
io_uring/rsrc: change ubuf->ubuf_end to length tracking
io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask
io_uring: rename "copy buffers" to "clone buffers"
In terms of normal application usage, this list will always be empty.
And if an application does overflow a bit, it'll have a few entries.
However, nothing obviously prevents syzbot from running a test case
that generates a ton of overflow entries, and then flushing them can
take quite a while.
Check for needing to reschedule while flushing, and drop our locks and
do so if necessary. There's no state to maintain here as overflows
always prune from head-of-list, hence it's fine to drop and reacquire
the locks at the end of the loop.
Link: https://lore.kernel.org/io-uring/66ed061d.050a0220.29194.0053.GAE@google.com/
Reported-by: syzbot+5fca234bd7eb378ff78e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now any link trace is listed as being linked after the head
request in the chain, but it's more useful to note explicitly which
request a given new request is chained to. Change the link trace to dump
the tail request so that chains are immediately apparent when looking at
traces.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If some part of the kernel adds task_work that needs executing, in terms
of signaling it'll generally use TWA_SIGNAL or TWA_RESUME. Those two
directly translate to TIF_NOTIFY_SIGNAL or TIF_NOTIFY_RESUME, and can
be used for a variety of use case outside of task_work.
However, io_cqring_wait_schedule() only tests explicitly for
TIF_NOTIFY_SIGNAL. This means it can miss if task_work got added for
the task, but used a different kind of signaling mechanism (or none at
all). Normally this doesn't matter as any task_work will be run once
the task exits to userspace, except if:
1) The ring is setup with DEFER_TASKRUN
2) The local work item may generate normal task_work
For condition 2, this can happen when closing a file and it's the final
put of that file, for example. This can cause stalls where a task is
waiting to make progress inside io_cqring_wait(), but there's nothing else
that will wake it up. Hence change the "should we schedule or loop around"
check to check for the presence of task_work explicitly, rather than just
TIF_NOTIFY_SIGNAL as the mechanism. While in there, also change the
ordering of what type of task_work first in terms of ordering, to both
make it consistent with other task_work runs in io_uring, but also to
better handle the case of defer task_work generating normal task_work,
like in the above example.
Reported-by: Jan Hendrik Farr <kernel@jfarr.cc>
Link: https://github.com/axboe/liburing/issues/1235
Cc: stable@vger.kernel.org
Fixes: 846072f16e ("io_uring: mimimise io_cqring_wait_schedule")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF
iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk
O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A
FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt
uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb
8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz
is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ==
=+WAm
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"This time it's mostly refactoring and improving APIs for slab users in
the kernel, along with some debugging improvements.
- kmem_cache_create() refactoring (Christian Brauner)
Over the years have been growing new parameters to
kmem_cache_create() where most of them are needed only for a small
number of caches - most recently the rcu_freeptr_offset parameter.
To avoid adding new parameters to kmem_cache_create() and adjusting
all its callers, or creating new wrappers such as
kmem_cache_create_rcu(), we can now pass extra parameters using the
new struct kmem_cache_args. Not explicitly initialized fields
default to values interpreted as unused.
kmem_cache_create() is for now a wrapper that works both with the
new form: kmem_cache_create(name, object_size, args, flags) and the
legacy form: kmem_cache_create(name, object_size, align, flags,
ctor)
- kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
Babka, Uladislau Rezki)
Since SLOB removal, kfree() is allowed for freeing objects
allocated by kmem_cache_create(). By extension kfree_rcu() as
allowed as well, which can allow converting simple call_rcu()
callbacks that only do kmem_cache_free(), as there was never a
kmem_cache_free_rcu() variant. However, for caches that can be
destroyed e.g. on module removal, the cache owners knew to issue
rcu_barrier() first to wait for the pending call_rcu()'s, and this
is not sufficient for pending kfree_rcu()'s due to its internal
batching optimizations. Ulad has provided a new
kvfree_rcu_barrier() and to make the usage less error-prone,
kmem_cache_destroy() calls it. Additionally, destroying
SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
synchronously instead of using an async work, because the past
motivation for async work no longer applies. Users of custom
call_rcu() callbacks should however keep calling rcu_barrier()
before cache destruction.
- Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)
Currently, KASAN cannot catch UAFs in such caches as it is legal to
access them within a grace period, and we only track the grace
period when trying to free the underlying slab page. The new
CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
object to be RCU-delayed, after which KASAN can poison them.
- Delayed memcg charging (Shakeel Butt)
In some cases, the memcg is uknown at allocation time, such as
receiving network packets in softirq context. With
kmem_cache_charge() these may be now charged later when the user
and its memcg is known.
- Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"
* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm, slab: restore kerneldoc for kmem_cache_create()
io_uring: port to struct kmem_cache_args
slab: make __kmem_cache_create() static inline
slab: make kmem_cache_create_usercopy() static inline
slab: remove kmem_cache_create_rcu()
file: port to struct kmem_cache_args
slab: create kmem_cache_create() compatibility layer
slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
slab: port KMEM_CACHE() to struct kmem_cache_args
slab: remove rcu_freeptr_offset from struct kmem_cache
slab: pass struct kmem_cache_args to do_kmem_cache_create()
slab: pull kmem_cache_open() into do_kmem_cache_create()
slab: pass struct kmem_cache_args to create_cache()
slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
slab: port kmem_cache_create_rcu() to struct kmem_cache_args
slab: port kmem_cache_create() to struct kmem_cache_args
slab: add struct kmem_cache_args
slab: s/__kmem_cache_create/do_kmem_cache_create/g
memcg: add charging of already allocated slab objects
mm/slab: Optimize the code logic in find_mergeable()
...
When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.
Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f735f807d7c8ba50c9452c69dfe5d3e9e535037b.1726072086.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member
that is currently in there. The value is in usecs, which is explained in
the name as well.
Note that if min_wait_usec and a normal timeout is used in conjunction,
the normal timeout is still relative to the base time. For example, if
min_wait_usec is set to 100 and the normal timeout is 1000, the max
total time waited is still 1000. This also means that if the normal
timeout is shorter than min_wait_usec, then only the min_wait_usec will
take effect.
See previous commit for an explanation of how this works.
IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as
applications doing submit_and_wait_timeout() style operations will
generally not see the -EINVAL from the wait side as they return the
number of IOs submitted. Only if no IOs are submitted will the -EINVAL
bubble back up to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Waiting for events with io_uring has two knobs that can be set:
1) The number of events to wake for
2) The timeout associated with the event
Waiting will abort when either of those conditions are met, as expected.
This adds support for a third event, which is associated with the number
of events to wait for. Applications generally like to handle batches of
completions, and right now they'd set a number of events to wait for and
the timeout for that. If no events have been received but the timeout
triggers, control is returned to the application and it can wait again.
However, if the application doesn't have anything to do until events are
reaped, then it's possible to make this waiting more efficient.
For example, the application may have a latency time of 50 usecs and
wanting to handle a batch of 8 requests at the time. If it uses 50 usecs
as the timeout, then it'll be doing 20K context switches per second even
if nothing is happening.
This introduces the notion of min batch wait time. If the min batch wait
time expires, then we'll return to userspace if we have any events at all.
If none are available, the general wait time is applied. Any request
arriving after the min batch wait time will cause waiting to stop and
return control to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for expanding how we handle waits, move the actual
schedule and schedule_timeout() handling into a helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than need to pass in 2 or 3 separate arguments, add a struct
to encapsulate the timeout and sigset_t parts of waiting. In preparation
for adding another argument for waiting.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new registration opcode IORING_REGISTER_CLOCK, which allows the
user to select which clock id it wants to use with CQ waiting timeouts.
It only allows a subset of all posix clocks and currently supports
CLOCK_MONOTONIC and CLOCK_BOOTTIME.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In addition to current relative timeouts for the waiting loop, where the
timespec argument specifies the maximum time it can wait for, add
support for the absolute mode, with the value carrying a CLOCK_MONOTONIC
absolute time until which we should return control back to the user.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove io_napi_adjust_timeout() and move the adjustments out of the
common path into __io_napi_busy_loop(). Now the limit it's calculated
based on struct io_wait_queue::timeout, for which we query current time
another time. The overhead shouldn't be a problem, it's a polling path,
however that can be optimised later by additionally saving the delta
time value in io_cqring_wait().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a difference in how io_queue_sqe and io_wq_submit_work treat
error codes they get from io_issue_sqe. The first one fails anything
unknown but latter only fails when the code is negative.
It doesn't make sense to have this discrepancy, align them to the
io_queue_sqe behaviour.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_cancel_generic() should retry if any state changes like a
request is completed, however in case of a task exit it only goes for
another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT)
request got completed.
Let's assume we have a non-tracked request executing in iowq and a
tracked request linked to it. Let's also assume
io_uring_cancel_generic() fails to find and cancel the request, i.e.
via io_run_local_work(), which may happen as io-wq has gaps.
Next, the request logically completes, io-wq still hold a ref but queues
it for completion via tw, which happens in
io_uring_try_cancel_requests(). After, right before prepare_to_wait()
io-wq puts the request, grabs the linked one and tries executes it, e.g.
arms polling. Finally the cancellation loop calls prepare_to_wait(),
there are no tw to run, no tracked request was completed, so the
tctx_inflight() check passes and the task is put to indefinite sleep.
Cc: stable@vger.kernel.org
Fixes: 3f48cf18f8 ("io_uring: unify files and task cancel")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaTgusQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr+1EAC4I7pRAM341sfmhe/9QQKMM8VzGwy5Tlr1
AFLO3BujRTl6X8S9fQjIjN1coW6u4F42I19+vVlxqvB7CUnqt9VWpexEjxe4K0FR
R+hIZW+fWV9K/eMrcsLcI7oReN5kIihHOzzy3wz0rENoGB5dCl6JAZMHDUCSqP0/
ZJJQ5ut8ah20Y/myHnzP5o4TfdE7nGo73Di2YoE2g3KqeX/dlAKW9+5hqKzzrHhM
2U25k/6KLy0ROzKpy2qW0QRE3pT5udoHLK2ue9+XwXF8JWVTlfVkHBzGY7NstyyT
z07SEzW1q4xV1HdCwGDAU7cL2NJMRXSG0p2WZTm8QyaVTdsZQvEx08GLsVdLvFH5
Gg+oOaxVE+INzW+/Lwz7lFHgq6XEjdAlEAOXDtGkZoni6Rt6iCzFCW6RTf/guy8o
Cub7tatMyegxai9+FTN/oFVoydRR0tsMf0OHrWnLOperh9CaxAwXvmKFeT/UTwiB
KIuIOJop7aThJbiV42a/xwTrEjNMZRv6uVBBEtJX3rxpmIhqTbjcAv9rKMmgtLMk
s6yX1MvYdOLhhEDyoUBX0dJdEETBf3KbnYIwi8kb4Sbkw/ZDgnkmSxFysom61wUF
byAFEpah3ZFR8aES0uNKUE6UHK6i5qqp0Za/n6gA927E/WGCU9ndaS+01gyknog0
8FqFYwruHQ==
=50CO
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Here are the io_uring updates queued up for 6.11.
Nothing major this time around, various minor improvements and
cleanups/fixes. This contains:
- Add bind/listen opcodes. Main motivation is to support direct
descriptors, to avoid needing a regular fd just for doing these two
operations (Gabriel)
- Probe fixes (Gabriel)
- Treat io-wq work flags as atomics. Not fixing a real issue, but may
as well and it silences a KCSAN warning (me)
- Cleanup of rsrc __set_current_state() usage (me)
- Add 64-bit for {m,f}advise operations (me)
- Improve performance of data ring messages (me)
- Fix for ring message overflow posting (Pavel)
- Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an
io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added
for faster task_work signaling for io_uring, bundling it with this
pull (Pavel)
- Add Pavel as a co-maintainer
- Various cleanups (me, Thorsten)"
* tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits)
io_uring/net: check socket is valid in io_bind()/io_listen()
kernel: rerun task_work while freezing in get_signal()
io_uring/io-wq: limit retrying worker initialisation
io_uring/napi: Remove unnecessary s64 cast
io_uring/net: cleanup io_recv_finish() bundle handling
io_uring/msg_ring: fix overflow posting
MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer
io_uring/msg_ring: use kmem_cache_free() to free request
io_uring/msg_ring: check for dead submitter task
io_uring/msg_ring: add an alloc cache for io_kiocb entries
io_uring/msg_ring: improve handling of target CQE posting
io_uring: add io_add_aux_cqe() helper
io_uring: add remote task_work execution helper
io_uring/msg_ring: tighten requirement for remote posting
io_uring: Allocate only necessary memory in io_probe
io_uring: Fix probe of disabled operations
io_uring: Introduce IORING_OP_LISTEN
io_uring: Introduce IORING_OP_BIND
net: Split a __sys_listen helper for io_uring
net: Split a __sys_bind helper for io_uring
...
The caller of io_cqring_event_overflow() should be holding the
completion_lock, which is violated by io_msg_tw_complete. There
is only one caller of io_add_aux_cqe(), so just add locking there
for now.
WARNING: CPU: 0 PID: 5145 at io_uring/io_uring.c:703 io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703
RIP: 0010:io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703
<TASK>
__io_post_aux_cqe io_uring/io_uring.c:816 [inline]
io_add_aux_cqe+0x27c/0x320 io_uring/io_uring.c:837
io_msg_tw_complete+0x9d/0x4d0 io_uring/msg_ring.c:78
io_fallback_req_func+0xce/0x1c0 io_uring/io_uring.c:256
process_one_work kernel/workqueue.c:3224 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3305
worker_thread+0x86d/0xd40 kernel/workqueue.c:3383
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:144
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Fixes: f33096a3c9 ("io_uring: add io_add_aux_cqe() helper")
Reported-by: syzbot+f7f9c893345c5c615d34@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c7350d07fefe8cce32b50f57665edbb6355ea8c1.1719927398.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before SQPOLL was transitioned to managing its own task_work, the core
used TWA_SIGNAL_NO_IPI to ensure that task_work was processed. If not,
we can't be sure that all task_work is processed at SQPOLL thread exit
time.
Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With slab accounting, allocating and freeing memory has considerable
overhead. Add a basic alloc cache for the io_kiocb allocations that
msg_ring needs to do. Unlike other caches, this one is used by the
sender, grabbing it from the remote ring. When the remote ring gets
the posted completion, it'll free it locally. Hence it is separately
locked, using ctx->msg_lock.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This helper will post a CQE, and can be called from task_work where we
now that the ctx is already properly locked and that deferred
completions will get flushed later on.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All our task_work handling is targeted at the state in the io_kiocb
itself, which is what it is being used for. However, MSG_RING rolls its
own task_work handling, ignoring how that is usually done.
In preparation for switching MSG_RING to be able to use the normal
task_work handling, add io_req_task_work_add_remote() which allows the
caller to pass in the target io_ring_ctx.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The work flags can be set/accessed from different tasks, both the
originator of the request, and the io-wq workers. While modifications
aren't concurrent, it still makes KMSAN unhappy. There's no real
downside to just making the flag reading/manipulation use proper
atomics here.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_submit_flush_completions() assigns ctx->submit_state to a local
variable and uses it in all but one spot, switch that forgotten
statement to using 'state' as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is pretty nicely abstracted already, but let's move it to a separate
file rather than have it in the main io_uring file. With that, we can
also move the io_ev_fd struct and enum out of global scope.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In some ways, it just "happens to work" currently with using the ops
field for both the free and signaling bit. But it depends on ordering
of operations in terms of freeing and signaling. Clean it up and use the
usual refs == 0 under RCU read side lock to determine if the ev_fd is
still valid, and use the reference to gate the freeing as well.
Fixes: 21a091b970 ("io_uring: signal registered eventfd to process deferred task work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only the current owner of a request is allowed to write into req->flags.
Hence, the cancellation path should never touch it. Add a new field
instead of the flag, move it into the 3rd cache line because it should
always be initialised. poll_refs can move further as polling is an
involved process anyway.
It's a minimal patch, in the future we can and should find a better
place for it and remove now unused REQ_F_CANCEL_SEQ.
Fixes: 521223d7c2 ("io_uring/cancel: don't default to setting req->work.cancel_seq")
Cc: stable@vger.kernel.org
Reported-by: Li Shi <sl1589472800@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6827b129f8f0ad76fa9d1f0a773de938b240ffab.1718323430.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the 5.12 kernel release, nobody has been passing NULL as the
sq_offset pointer. Remove the checks for it being NULL or not, it will
always be valid.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YdYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnmVEADBq8QT9Oa3HTIONHwxjmGMOalr7PSrBP89
S6Inv/l+3xDlyolyLh1HIXUC84iS9Ihi2pNC3dZct4fNcpA99H0CFaHDGwZ5rVri
MrFaubZAps1qSzeypqEq3zWGKVUoaYWaOKhuOjye5Ei2tKymbguhDKl1WiKibD21
E9qOYbhSUFdub/xtx9Rv4BS05QW5bHZ2Y/tTFqB8MY4JUsdb9g/deVZkyGUQYRSd
40mDallRldjQQTQ8iU4H6/ORdGIN/90aLPbmzMdFtQcymnmRyid3rOEwhwWYe4NO
ljnI8m1SJQilZz1d5oHBXBB5QubVptY1JWxbk8GQCSmOU5wrCq+ARCJXUtBXwniJ
K4VFsGm9MkZcc5vsIwIzvsrk8DODla6EVo/jyDy8iFceZcNWfVxdwa5NS67V/6QT
macbF785XDsmA5E4UjslbZqU047w+A5N1yazcZWzMk0coJDeB8AtsA1/C2WZOm8p
HVoiAzsqt81hvPItnjCyZluL/YW+BKeOTnq04QbpQKcJpZBzszO4ZLtuD+IXkE69
8ZZPGFPnPS4ZMQojKkwsBr+Yo65S18oBDkib36mr2lsdnoWTpGq47C7ScUDBbqGm
iI7U8tYMnVVkQQHVVmGI4KOr5/4lxxp8398kqCaxfW3D5BQhbtUOF/OBjBHj1ZSV
9aZx87CyhA==
=DwAV
-----END PGP SIGNATURE-----
Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Greatly improve send zerocopy performance, by enabling coalescing of
sent buffers.
MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
io_uring side did not. In local testing, the crossover point for send
zerocopy being faster is now around 3000 byte packets, and it
performs better than the sync syscall variants as well.
This feature relies on a shared branch with net-next, which was
pulled into both branches.
- Unification of how async preparation is done across opcodes.
Previously, opcodes that required extra memory for async retry would
allocate that as needed, using on-stack state until that was the
case. If async retry was needed, the on-stack state was adjusted
appropriately for a retry and then copied to the allocated memory.
This led to some fragile and ugly code, particularly for read/write
handling, and made storage retries more difficult than they needed to
be. Allocate the memory upfront, as it's cheap from our pools, and
use that state consistently both initially and also from the retry
side.
- Move away from using remap_pfn_range() for mapping the rings.
This is really not the right interface to use and can cause lifetime
issues or leaks. Additionally, it means the ring sq/cq arrays need to
be physically contigious, which can cause problems in production with
larger rings when services are restarted, as memory can be very
fragmented at that point.
Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
the same treatment to mapped ring provided buffers. This also helps
unify the code we have dealing with allocating and mapping memory.
Hard to see in the diffstat as we're adding a few features as well,
but this kills about ~400 lines of code from the codebase as well.
- Add support for bundles for send/recv.
When used with provided buffers, bundles support sending or receiving
more than one buffer at the time, improving the efficiency by only
needing to call into the networking stack once for multiple sends or
receives.
- Tweaks for our accept operations, supporting both a DONTWAIT flag for
skipping poll arm and retry if we can, and a POLLFIRST flag that the
application can use to skip the initial accept attempt and rely
purely on poll for triggering the operation. Both of these have
identical flags on the receive side already.
- Make the task_work ctx locking unconditional.
We had various code paths here that would do a mix of lock/trylock
and set the task_work state to whether or not it was locked. All of
that goes away, we lock it unconditionally and get rid of the state
flag indicating whether it's locked or not.
The state struct still exists as an empty type, can go away in the
future.
- Add support for specifying NOP completion values, allowing it to be
used for error handling testing.
- Use set/test bit for io-wq worker flags. Not strictly needed, but
also doesn't hurt and helps silence a KCSAN warning.
- Cleanups for io-wq locking and work assignments, closing a tiny race
where cancelations would not be able to find the work item reliably.
- Misc fixes, cleanups, and improvements
* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
io_uring: support to inject result for NOP
io_uring: fail NOP if non-zero op flags is passed in
io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
io_uring/net: add IORING_ACCEPT_DONTWAIT flag
io_uring/filetable: don't unnecessarily clear/reset bitmap
io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
io_uring: Require zeroed sqe->len on provided-buffers send
io_uring/notif: disable LAZY_WAKE for linked notifs
io_uring/net: fix sendzc lazy wake polling
io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
io_uring/rw: reinstate thread check for retries
io_uring/notif: implement notification stacking
io_uring/notif: simplify io_notif_flush()
net: add callback for setting a ubuf_info to skb
net: extend ubuf_info callback to ops structure
io_uring/net: support bundles for recv
io_uring/net: support bundles for send
io_uring/kbuf: add helpers for getting/peeking multiple buffers
io_uring/net: add provided buffer support for IORING_OP_SEND
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
=SVsU
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
Allowing retries for everything is arguably the right thing to do, now
that every command type is async read from the start. But it's exposed a
few issues around missing check for a retry (which cca6571381 exposed),
and the fixup commit for that isn't necessarily 100% sound in terms of
iov_iter state.
For now, just revert these two commits. This unfortunately then re-opens
the fact that -EAGAIN can get bubbled to userspace for some cases where
the kernel very well could just sanely retry them. But until we have all
the conditions covered around that, we cannot safely enable that.
This reverts commit df604d2ad4.
This reverts commit cca6571381.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IORING_OP_RECV is used with provided buffers, the caller may also set
IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs
buffers available and receives into them, posting a single completion for
all of it.
This can be used with multishot receive as well, or without it.
Now that both send and receive support bundles, add a feature flag for
it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the
ring, then the kernel supports bundles for recv and send.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit removed the checking on whether or not it was possible
to retry a request, since it's now possible to retry any of them. This
would previously have caused the request to have been ended with an error,
but now the retry condition can simply get lost instead.
Cleanup the retry handling and always just punt it to task_work, which
will queue it with io-wq appropriately.
Reported-by: Changhui Zhong <czhong@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Fixes: cca6571381 ("io_uring/rw: cleanup retry path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous consolidation cleanup missed handling the case where the ring
is dying, and __io_cqring_overflow_flush() doesn't flush entries if the
CQ ring is already full. This is fine for the normal CQE overflow
flushing, but if the ring is going away, we need to flush everything,
even if it means simply freeing the overflown entries.
Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only caller doesn't handle the return value of io_put_kbuf_comp(), so
change its return type into void.
Also follow Jens's suggestion to rename it as io_put_kbuf_drop().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_put_rsrc_locked() is a weird shim function around
io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding
->uring_lock, so we can just use it directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_complete_post() was a sole user of ->locked_free_list, but
since we just gutted the function, the cache is not used anymore and
can be removed.
->locked_free_list served as an asynhronous counterpart of the main
request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq.
Now they're all forced to be completed into the main cache directly,
off of the normal completion path or via io_free_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_complete_post() is now io-wq only and shouldn't be used outside
of it, i.e. it relies that io-wq holds a ref for the request as
explained in a comment below. Let's add a warning to enforce the
assumption and make sure nobody would try to do anything weird.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"),
io_req_complete_post() is only called from io-wq submit work, where the
request reference is guaranteed to be grabbed and won't drop to zero
in io_req_complete_post().
Kill the dead code, meantime add req_ref_put() to put the reference.
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.
This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move it into io_uring.c where it belongs, and use it in there as well
rather than have two implementations of this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the last holdout which does odd page checking, convert it to
vmap just like what is done for the non-mmap path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.
If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.
This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The allocator will generally return memory in order, but
__io_alloc_req_refill() then adds them to a stack and we'll extract them
in the opposite order. This obviously isn't a huge deal, but:
1) it makes debugging easier when they are in order
2) keeping them in-order is the right thing to do
3) reduces the code for adding them to the stack
Just add them in reverse to the stack.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This should be plenty, rather than the default of 128, and matches what
we have on the rsrc and futex side as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.
Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.
Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.
Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.
This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.
While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Basic conversion ensuring async_data is allocated off the prep path. Adds
a basic alloc cache as well, as passthrough IO can be quite high in rate.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.
Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.
This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.
In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For historical reasons these were special cased, as they were the only
ones that needed cancelation. But now we handle cancelations generally,
and hence there's no need to check for these in
io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles
both these and the rest as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
to task_work, it's much cleaner and should normally happen. We couldn't
do it before because there was a possibility of looping in
complete_post() -> tw -> complete_post() -> ...
Also, unexport the function and inline __io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
task_work execution is now always locked, and we shouldn't get into
io_req_complete_post() from them. That means that complete_post() is
always called out of the original task context and we don't even need to
check current.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.
DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.
Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can run normal task_work without locking the ctx, however we try to
lock anyway and most handlers prefer or require it locked. It might have
been interesting to multi-submitter ring with high contention completing
async read/write requests via task_work, however that will still need to
go through io_req_complete_post() and potentially take the lock for
rsrc node putting or some other case.
In other words, it's hard to care about it, so alawys force the locking.
The case described would also because of various io_uring caches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.
I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.
Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This bug was introduced in commit 950e79dd73 ("io_uring: minor
io_cqring_wait() optimization"), which was made in preparation for
adc8682ec6 ("io_uring: Add support for napi_busy_poll"). The latter
got reverted in cb31821673 ("Revert "io_uring: Add support for
napi_busy_poll""), so simply undo the former as well.
Cc: stable@vger.kernel.org
Fixes: 950e79dd73 ("io_uring: minor io_cqring_wait() optimization")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Link: https://lore.kernel.org/r/20240405125551.237142-1-izbyshev@ispras.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we look up the kbuf, ensure that it doesn't get unregistered until
after we're done with it. Since we're inside mmap, we cannot safely use
the io_uring lock. Rely on the fact that we can lookup the buffer list
under RCU now and grab a reference to it, preventing it from being
unregistered until we're done with it. The lookup returns the
io_buffer_list directly with it referenced.
Cc: stable@vger.kernel.org # v6.4+
Fixes: 5cf4f52e6d ("io_uring: free io_buffer_list entries via RCU")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just rely on the xarray for any kind of bgid. This simplifies things, and
it really doesn't bring us much, if anything.
Cc: stable@vger.kernel.org # v6.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.
Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: https://github.com/axboe/liburing/issues/1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do the same check for direct io-wq execution for multishot requests that
commit 2a975d426c did for the inline execution, and disable multishot
mode (and revert to single shot) if the file type doesn't support NOWAIT,
and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's
a requirement that nonblocking read attempts can be done.
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If failure happens before the opcode prep handler is called, ensure that
we clear the opcode specific area of the request, which holds data
specific to that request type. This prevents errors where opcode
handlers either don't get to clear per-request private data since prep
isn't even called.
Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Looking at the error path of __io_uaddr_map, if we fail after pinning
the pages for any reasons, ret will be set to -EINVAL and the error
handler won't properly release the pinned pages.
I didn't manage to trigger it without forcing a failure, but it can
happen in real life when memory is heavily fragmented.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Fixes: 223ef47431 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages")
Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We make a few cancellation judgements based on ctx->rings, so let's
zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's
done with the mmap case. Likely, it's not a real problem, but zeroing
is safer and better tested.
Cc: stable@vger.kernel.org
Fixes: 03d89a2de2 ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This kind of state is per-syscall, and since we're doing the waiting off
entering the io_uring_enter(2) syscall, there's no way that iowait can
already be set for this case. Simplify it by setting it if we need to,
and always clearing it to 0 when done.
Fixes: 7b72d661f1 ("io_uring: gate iowait schedule on having pending requests")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We disallow DEFER_TASKRUN multishots from running by io-wq, which is
checked by individual opcodes in the issue path. We can consolidate all
it in io_wq_submit_work() at the same time moving the checks out of the
hot path.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>