linux

Commit Graph

Author	SHA1	Message	Date
Jens Axboe	d7718a9d25	io_uring: use poll driven retry for files that support it Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:38 -07:00
Jens Axboe	8a72758c51	io_uring: mark requests that we can do poll async in io_op_defs Add a pollin/pollout field to the request table, and have commands that we can safely poll for properly marked. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:37 -07:00
Jens Axboe	b41e98524e	io_uring: add per-task callback handler For poll requests, it's not uncommon to link a read (or write) after the poll to execute immediately after the file is marked as ready. Since the poll completion is called inside the waitqueue wake up handler, we have to punt that linked request to async context. This slows down the processing, and actually means it's faster to not use a link for this use case. We also run into problems if the completion_lock is contended, as we're doing a different lock ordering than the issue side is. Hence we have to do trylock for completion, and if that fails, go async. Poll removal needs to go async as well, for the same reason. eventfd notification needs special case as well, to avoid stack blowing recursion or deadlocks. These are all deficiencies that were inherited from the aio poll implementation, but I think we can do better. When a poll completes, simply queue it up in the task poll list. When the task completes the list, we can run dependent links inline as well. This means we never have to go async, and we can remove a bunch of code associated with that, and optimizations to try and make that run faster. The diffstat speaks for itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:36 -07:00
Jens Axboe	c2f2eb7d2c	io_uring: store io_kiocb in wait->private Store the io_kiocb in the private field instead of the poll entry, this is in preparation for allowing multiple waitqueues. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:34 -07:00
Pavel Begunkov	5eae861990	io_uring: remove IO_WQ_WORK_CB IO_WQ_WORK_CB is used only for linked timeouts, which will be armed before the work setup (i.e. mm, override creds, etc). The setup shouldn't take long, so it's ok to arm it a bit later and get rid of IO_WQ_WORK_CB. Make io-wq call work->func() only once, callbacks will handle the rest. i.e. the linked timeout handler will do the actual issue. And as a bonus, it removes an extra indirect call. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:29 -07:00
Pavel Begunkov	02d27d8953	io_uring: extract kmsg copy helper io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising part, so add helper for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:41 -07:00
Pavel Begunkov	b0a20349f2	io_uring: clean io_poll_complete Deduplicate call to io_cqring_fill_event(), plain and easy Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:37 -07:00
Pavel Begunkov	7d67af2c01	io_uring: add splice(2) support Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:37 -07:00
Pavel Begunkov	8da11c1994	io_uring: add interface for getting files Preparation without functional changes. Adds io_get_file(), that allows to grab files not only into req->file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:37 -07:00
Pavel Begunkov	bcaec089c5	io_uring: remove req->in_async req->in_async is not really needed, it only prevents propagation of @nxt for fast not-blocked submissions. Remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:29 -07:00
Pavel Begunkov	deb6dc0544	io_uring: don't do full *prep_worker() from io-wq io_prep_async_worker() called io_wq_assign_next() do many useless checks: io_req_work_grab_env() was already called during prep, and @do_hashed is not ever used. Add io_prep_next_work() -- simplified version, that can be called io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:28 -07:00
Pavel Begunkov	5ea6216116	io_uring: don't call work.func from sync ctx Many operations define custom work.func before getting into an io-wq. There are several points against: - it calls io_wq_assign_next() from outside io-wq, that may be confusing - sync context would go unnecessary through io_req_cancelled() - prototypes are quite different, so work!=old_work looks strange - makes async/sync responsibilities fuzzy - adds extra overhead Don't call generic path and io-wq handlers from each other, but use helpers instead Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:24 -07:00
Jens Axboe	e441d1cf20	io_uring: io_accept() should hold on to submit reference on retry Don't drop an early reference, hang on to it and let the caller drop it. This makes it behave more like "regular" requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:24 -07:00
Jens Axboe	29de5f6a35	io_uring: consider any io_read/write -EAGAIN as final If the -EAGAIN happens because of a static condition, then a poll or later retry won't fix it. We must call it again from blocking condition. Play it safe and ensure that any -EAGAIN condition from read or write must retry from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:24 -07:00
Jens Axboe	d876836204	io_uring: fix 32-bit compatability with sendmsg/recvmsg We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise the iovec import for these commands will not do the right thing and fail the command with -EINVAL. Found by running the test suite compiled as 32-bit. Cc: stable@vger.kernel.org Fixes: `aa1fa28fc7` ("io_uring: add support for recvmsg()") Fixes: `0fa03c624d` ("io_uring: add support for sendmsg()") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-27 14:17:49 -07:00
Tobias Klauser	bebdb65e07	io_uring: define and set show_fdinfo only if procfs is enabled Follow the pattern used with other *_show_fdinfo functions and only define and use io_uring_show_fdinfo and its helper functions if CONFIG_PROC_FS is set. Fixes: `87ce955b24` ("io_uring: add ->show_fdinfo() for the io_uring file descriptor") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-27 06:56:21 -07:00
Jens Axboe	dd3db2a34c	io_uring: drop file set ref put/get on switch Dan reports that he triggered a warning on ring exit doing some testing: percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Modules linked in: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83 RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292 RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000 RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045 R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0 R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> rcu_core+0x1e4/0x4d0 __do_softirq+0xdb/0x2f1 irq_exit+0xa0/0xb0 smp_apic_timer_interrupt+0x60/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x23/0x170 Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95 Turns out that this is due to percpu_ref_switch_to_atomic() only grabbing a reference to the percpu refcount if it's not already in atomic mode. io_uring drops a ref and re-gets it when switching back to percpu mode. We attempt to protect against this with the FFD_F_ATOMIC bit, but that isn't reliable. We don't actually need to juggle these refcounts between atomic and percpu switch, we can just do them when we've switched to atomic mode. This removes the need for FFD_F_ATOMIC, which wasn't reliable. Fixes: `05f3fb3c53` ("io_uring: avoid ring quiesce for fixed file set unregister and update") Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-26 10:53:33 -07:00
Jens Axboe	3a9015988b	io_uring: import_single_range() returns 0/-ERROR Unlike the other core import helpers, import_single_range() returns 0 on success, not the length imported. This means that links that depend on the result of non-vec based IORING_OP_{READ,WRITE} that were added for 5.5 get errored when they should not be. Fixes: `3a6820f2bb` ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-26 07:06:57 -07:00
Jens Axboe	2a44f46781	io_uring: pick up link work on submit reference drop If work completes inline, then we should pick up a dependent link item in __io_queue_sqe() as well. If we don't do so, we're forced to go async with that item, which is suboptimal. This also fixes an issue with io_put_req_find_next(), which always looks up the next work item. That should only be done if we're dropping the last reference to the request, to prevent multiple lookups of the same work item. Outside of being a fix, this also enables a good cleanup series for 5.7, where we never have to pass 'nxt' around or into the work handlers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-26 07:05:30 -07:00
Xiaoguang Wang	bdcd3eab2a	io_uring: fix poll_list race for SETUP_IOPOLL\|SETUP_SQPOLL After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled. There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events. Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens. For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-25 08:40:43 -07:00
Jens Axboe	41726c9a50	io_uring: fix personality idr leak We somehow never free the idr, even though we init it for every ctx. Free it when the rest of the ring data is freed. Fixes: `071698e13a` ("io_uring: allow registering credentials") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-24 08:31:51 -07:00
Jens Axboe	193155c8c9	io_uring: handle multiple personalities in link chains If we have a chain of requests and they don't all use the same credentials, then the head of the chain will be issued with the credentails of the tail of the chain. Ensure __io_queue_sqe() overrides the credentials, if they are different. Once we do that, we can clean up the creds handling as well, by only having io_submit_sqe() do the lookup of a personality. It doesn't need to assign it, since __io_queue_sqe() now always does the right thing. Fixes: `75c6a03904` ("io_uring: support using a registered personality for commands") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-23 19:46:13 -07:00
Xiaoguang Wang	c7849be9cc	io_uring: fix __io_iopoll_check deadlock in io_sq_thread Since commit `a3a0e43fd7` ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit. I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check(). Fixes: `a3a0e43fd7` ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-22 07:45:03 -07:00
Stefano Garzarella	7143b5ac57	io_uring: prevent sq_thread from spinning when it should stop This patch drops 'cur_mm' before calling cond_resched(), to prevent the sq_thread from spinning even when the user process is finished. Before this patch, if the user process ended without closing the io_uring fd, the sq_thread continues to spin until the 'sq_thread_idle' timeout ends. In the worst case where the 'sq_thread_idle' parameter is bigger than INT_MAX, the sq_thread will spin forever. Fixes: `6c271ce2f1` ("io_uring: add submission polling") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-21 09:16:10 -07:00
Pavel Begunkov	929a3af90f	io_uring: fix use-after-free by io_cleanup_req() io_cleanup_req() should be called before req->io is freed, and so shouldn't be after __io_free_req() -> __io_req_aux_free(). Also, it will be ignored for in io_free_req_many(), which use __io_req_aux_free(). Place cleanup_req() into __io_req_aux_free(). Fixes: `99bc4c3853` ("io_uring: fix iovec leaks") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-18 17:12:23 -07:00
Dan Carpenter	297a31e3e8	io_uring: remove unnecessary NULL checks The "kmsg" pointer can't be NULL and we have already dereferenced it so a check here would be useless. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-18 11:22:02 -07:00
Pavel Begunkov	7fbeb95d0f	io_uring: add missing io_req_cancelled() fallocate_finish() is missing cancellation check. Add it. It's safe to do that, as only flags setup and sqe fields copy are done before it gets into __io_fallocate(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-16 10:09:37 -07:00
Jens Axboe	2ca10259b4	io_uring: prune request from overflow list on flush Carter reported an issue where he could produce a stall on ring exit, when we're cleaning up requests that match the given file table. For this particular test case, a combination of a few things caused the issue: - The cq ring was overflown - The request being canceled was in the overflow list The combination of the above means that the cq overflow list holds a reference to the request. The request is canceled correctly, but since the overflow list holds a reference to it, the final put won't happen. Since the final put doesn't happen, the request remains in the inflight. Hence we never finish the cancelation flush. Fix this by removing requests from the overflow list if we're canceling them. Cc: stable@vger.kernel.org # 5.5 Reported-by: Carter Li 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-13 17:25:01 -07:00
Jens Axboe	b537916ca5	io_uring: retain sockaddr_storage across send/recvmsg async punt Jonas reports that he sometimes sees -97/-22 error returns from sendmsg, if it gets punted async. This is due to not retaining the sockaddr_storage between calls. Include that in the state we copy when going async. Cc: stable@vger.kernel.org # 5.3+ Reported-by: Jonas Bonn <jonas@norrbonn.se> Tested-by: Jonas Bonn <jonas@norrbonn.se> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-09 11:32:10 -07:00
Jens Axboe	6ab231448f	io_uring: cancel pending async work if task exits Normally we cancel all work we track, but for untracked work we could leave the async worker behind until that work completes. This is totally fine, but does leave resources pending after the task is gone until that work completes. Cancel work that this task queued up when it goes away. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-09 09:55:38 -07:00
Pavel Begunkov	0bdbdd08a8	io_uring: fix openat/statx's filename leak As in the previous patch, make openat*_prep() and statx_prep() handle double preparation to avoid resource leakage. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:07:00 -07:00
Pavel Begunkov	5f798beaf3	io_uring: fix double prep iovec leak Requests may be prepared multiple times with ->io allocated (i.e. async prepared). Preparation functions don't handle it and forget about previously allocated resources. This may happen in case of: - spurious defer_check - non-head (i.e. async prepared) request executed in sync (via nxt). Make the handlers check, whether they already allocated resources, which is true IFF REQ_F_NEED_CLEANUP is set. Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:07:00 -07:00
Pavel Begunkov	a93b33312f	io_uring: fix async close() with f_op->flush() First, io_close() misses filp_close() and io_cqring_add_event(), when f_op->flush is defined. That's because in this case it will io_queue_async_work() itself not grabbing files, so the corresponding chunk in io_close_finish() won't be executed. Second, when submitted through io_wq_submit_work(), it will do filp_close() and *_add_event() twice: first inline in io_close(), and the second one in call to io_close_finish() from io_close(). The second one will also fire, because it was submitted async through generic path, and so have grabbed files. And the last nice thing is to remove this weird pilgrimage with checking work/old_work and casting it to nxt. Just use a helper instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:07:00 -07:00
Jens Axboe	0b5faf6ba7	io_uring: allow AT_FDCWD for non-file openat/openat2/statx Don't just check for dirfd == -1, we should allow AT_FDCWD as well for relative lookups. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:07:00 -07:00
Jens Axboe	ff002b3018	io_uring: grab ->fs as part of async preparation This passes it in to io-wq, so it assumes the right fs_struct when executing async work that may need to do lookups. Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:07:00 -07:00
Jens Axboe	faac996ccd	io_uring: retry raw bdev writes if we hit -EOPNOTSUPP For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a raw block device, this yields an -EOPNOTSUPP return, as non-blocking writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry from blocking context with IOCB_NOWAIT cleared. Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:06:58 -07:00
Pavel Begunkov	8fef80bf56	io_uring: add cleanup for openat()/statx() openat() and statx() may have allocated ->open.filename, which should be be put. Add cleanup handlers for them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:06:58 -07:00
Pavel Begunkov	99bc4c3853	io_uring: fix iovec leaks Allocated iovec is freed only in io_{read,write,send,recv)(), and just leaves it if an error occured. There are plenty of such cases: - cancellation of non-head requests - fail grabbing files in __io_queue_sqe() - set REQ_F_NOWAIT and returning in __io_queue_sqe() Add REQ_F_NEED_CLEANUP, which will force such requests with custom allocated resourses go through cleanup handlers on put. Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:06:58 -07:00
Pavel Begunkov	e96e977992	io_uring: remove unused struct io_async_open struct io_async_open is unused, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:06:58 -07:00
Stefano Garzarella	63e5d81f72	io_uring: flush overflowed CQ events in the io_uring_poll() In io_uring_poll() we must flush overflowed CQ events before to check if there are CQ events available, to avoid missing events. We call the io_cqring_events() that checks and flushes any overflow and returns the number of CQ events available. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:06:58 -07:00
Jens Axboe	cf3040ca55	io_uring: statx/openat/openat2 don't support fixed files All of these opcodes take a directory file descriptor. We can't easily support fixed files for these operations, and the use case for that probably isn't all that clear (or sensible) anyway. Disable IOSQE_FIXED_FILE for these operations. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-08 13:06:33 -07:00
Pavel Begunkov	1e95081cb5	io_uring: fix deferred req iovec leak After defer, a request will be prepared, that includes allocating iovec if needed, and then submitted through io_wq_submit_work() but not custom handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak iovec, as it's in io-wq and the code goes as follows: io_read() { if (!io_wq_current_is_worker()) kfree(iovec); } Put all deallocation logic in io_{read,write,send,recv}(), which will leave the memory, if going async with -EAGAIN. It also fixes a leak after failed io_alloc_async_ctx() in io_{recv,send}_msg(). Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-06 13:58:57 -07:00
Randy Dunlap	e1d85334d6	io_uring: fix 1-bit bitfields to be unsigned Make bitfields of size 1 bit be unsigned (since there is no room for the sign bit). This clears up the sparse warnings: CHECK ../fs/io_uring.c ../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield ../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield ../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield ../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield ../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield Found by sight and then verified with sparse. Fixes: `69b3e54613` ("io_uring: change io_ring_ctx bool fields into bit fields") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-06 13:41:00 -07:00
Pavel Begunkov	1cb1edb2f5	io_uring: get rid of delayed mm check Fail fast if can't grab mm, so past that requests always have an mm when required. This allows us to remove req->user altogether. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-06 12:53:10 -07:00
Linus Torvalds	c1ef57a3a3	io_uring-5.6-2020-02-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47MicQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpplgD/4wyOfyMQ601AaiXyBmG6lx7UV7kBBaDWAb tlDEh/EWIioejMlYJC8UslLtrlxJS8jKCVJNOAz5zB9V6McLtHNxXNY5pRr4MRrc 2ztxFHuvy8s+LyztGxBh3DA+bT5UrMR/r6uu6Guh2TatFUZr4IOvBUBb6VeP9O1Z sECCkzWZcmIq2gNSh7Dpxr31KdMQo7xngyMhFMh3CHBnDVZN6WX4ugNBJNb71MpY ELH3SRY2uX15dlhatO5UYuAknJOA1VvlulYVWCuBj4UPyH0AAUJQiZJVEPwldCNL qE4cS80Q5EMAFw32cOW/oyl8Z6oFQO5nwFQ+YPPhaZscjMsRteuqnt6qYSgXHJal ze4mUBO9Z1byc9Gex1V5SHZSLzVw3HfgznSUfZrm+Tj2UkocJSaYtS9CzXR8x7tE tD8ev4P3EH+axm4oUSWoA4Bro9eGgkV07ok2mCnxb9rJoV0JNHzUmVSzjF4G9HGK GosVRRS4I4/nHIZQ3KTKp6apLOAn7SPTUkxqb0/M8qbRXZqQYylWhPsL2Q8aBgvT 8pQ2sIQ5AgOmzGKqKRofxbhIh8G+6Ddz97A+Omt47zLb8ccsoatXfEli7mMjtH4P W/aUE0O8Kstma8gZN4LUxrnqKGncDVJMolozFyt5dWc9bIpxX0SmpDdiRzqyN1fw k9L4Ox6hxg== =RzPL -----END PGP SIGNATURE----- Merge tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Some later fixes for io_uring: - Small cleanup series from Pavel - Belt and suspenders build time check of sqe size and layout (Stefan) - Addition of ->show_fdinfo() on request of Jann Horn, to aid in understanding mapped personalities - eventfd recursion/deadlock fix, for both io_uring and aio - Fixup for send/recv handling - Fixup for double deferral of read/write request - Fix for potential double completion event for close request - Adjust fadvise advice async/inline behavior - Fix for shutdown hang with SQPOLL thread - Fix for potential use-after-free of fixed file table" * tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block: io_uring: cleanup fixed file data table references io_uring: spin for sq thread to idle on shutdown aio: prevent potential eventfd recursion on poll io_uring: put the flag changing code in the same spot io_uring: iterate req cache backwards io_uring: punt even fadvise() WILLNEED to async context io_uring: fix sporadic double CQE entry for close io_uring: remove extra ->file check io_uring: don't map read/write iovec potentially twice io_uring: use the proper helpers for io_send/recv io_uring: prevent potential eventfd recursion on poll eventfd: track eventfd_signal() recursion depth io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe io_uring: add ->show_fdinfo() for the io_uring file descriptor	2020-02-06 06:33:17 +00:00
Jens Axboe	2faf852d1b	io_uring: cleanup fixed file data table references syzbot reports a use-after-free in io_ring_file_ref_switch() when it tries to switch back to percpu mode. When we put the final reference to the table by calling percpu_ref_kill_and_confirm(), we don't want the zero reference to queue async work for flushing the potentially queued up items. We currently do a few flush_work(), but they merely paper around the issue, since the work item may not have been queued yet depending on the when the percpu-ref callback gets run. Coming into the file unregister, we know we have the ring quiesced. io_ring_file_ref_switch() can check for whether or not the ref is dying or not, and not queue anything async at that point. Once the ref has been confirmed killed, flush any potential items manually. Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com Fixes: `05f3fb3c53` ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-04 20:04:18 -07:00
Jens Axboe	df069d80c8	io_uring: spin for sq thread to idle on shutdown As part of io_uring shutdown, we cancel work that is pending and won't necessarily complete on its own. That includes requests like poll commands and timeouts. If we're using SQPOLL for kernel side submission and we shutdown the ring immediately after queueing such work, we can race with the sqthread doing the submission. This means we may miss cancelling some work, which results in the io_uring shutdown hanging forever. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-04 16:48:34 -07:00
Pavel Begunkov	3e577dcd73	io_uring: put the flag changing code in the same spot Both iocb_flags() and kiocb_set_rw_flags() are inline and modify kiocb->ki_flags. Place them close, so they can be potentially better optimised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Pavel Begunkov	6c8a313469	io_uring: iterate req cache backwards Grab requests from cache-array from the end, so can get by only free_reqs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Jens Axboe	3e69426da2	io_uring: punt even fadvise() WILLNEED to async context Andres correctly points out that read-ahead can block, if it needs to read in meta data (or even just through the page cache page allocations). Play it safe for now and just ensure WILLNEED is also punted to async context. While in there, allow the file settings hints from non-blocking context. They don't need to start/do IO, and we can safely do them inline. Fixes: `4840e418c2` ("io_uring: add IORING_OP_FADVISE") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Jens Axboe	1a417f4e61	io_uring: fix sporadic double CQE entry for close We punt close to async for the final fput(), but we log the completion even before that even in that case. We rely on the request not having a files table assigned to detect what the final async close should do. However, if we punt the async queue to __io_queue_sqe(), we'll get ->files assigned and this makes io_close_finish() think it should both close the filp again (which does no harm) AND log a new CQE event for this request. This causes duplicate CQEs. Queue the request up for async manually so we don't grab files needlessly and trigger this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Pavel Begunkov	9250f9ee19	io_uring: remove extra ->file check It won't ever get into io_prep_rw() when req->file haven't been set in io_req_set_file(), hence remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Jens Axboe	5d204bcfa0	io_uring: don't map read/write iovec potentially twice If we have a read/write that is deferred, we already setup the async IO context for that request, and mapped it. When we later try and execute the request and we get -EAGAIN, we don't want to attempt to re-map it. If we do, we end up with garbage in the iovec, which typically leads to an -EFAULT or -EINVAL completion. Cc: stable@vger.kernel.org # 5.5 Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Jens Axboe	0b7b21e42b	io_uring: use the proper helpers for io_send/recv Don't use the recvmsg/sendmsg helpers, use the same helpers that the recv(2) and send(2) system calls use. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
Jens Axboe	f0b493e6b9	io_uring: prevent potential eventfd recursion on poll If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack. Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 17:27:47 -07:00
John Hubbard	f1f6a7dd9b	mm, tree-wide: rename put_user_page() to unpin_user_page() In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages() calls match up with unpin_user_pages() calls, and the API is a lot closer to being self-explanatory. Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-31 10:30:38 -08:00
John Hubbard	2113b05d03	fs/io_uring: set FOLL_PIN via pin_user_pages() Convert fs/io_uring to use the new pin_user_pages() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages, and therefore for any code that calls put_user_page(). In partial anticipation of this work, the io_uring code was already calling put_user_page() instead of put_page(). Therefore, in order to convert from the get_user_pages()/put_page() model, to the pin_user_pages()/put_user_page() model, the only change required here is to change get_user_pages() to pin_user_pages(). Link: http://lkml.kernel.org/r/20200107224558.2362728-17-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-31 10:30:37 -08:00
Stefan Metzmacher	d7f62e825f	io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe With nesting of anonymous unions and structs it's hard to review layout changes. It's better to ask the compiler for these things. Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-30 12:40:37 -07:00
Jens Axboe	87ce955b24	io_uring: add ->show_fdinfo() for the io_uring file descriptor It can be hard to know exactly what is registered with the ring. Especially for credentials, it'd be handy to be able to see which ones are registered, what personalities they have, and what the ID of each of them is. This adds support for showing information registered in the ring from the fdinfo of the io_uring fd. Here's an example from a test case that registers 4 files (two of them sparse), 4 buffers, and 2 personalities: pos: 0 flags: 02000002 mnt_id: 14 UserFiles: 4 0: file-no-1 1: file-no-2 2: <none> 3: <none> UserBufs: 4 0: 0x563817c46000/128 1: 0x563817c47000/256 2: 0x563817c48000/512 3: 0x563817c49000/1024 Personalities: 1 Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 CapEff: 0000003fffffffff 2 Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 CapEff: 0000003fffffffff Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-30 12:40:35 -07:00
Linus Torvalds	896f8d23d0	for-5.6/io_uring-vfs-2020-01-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4yEegQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpn5ZD/4/WlXs2cUDgg1C65bzZFO4qvevm+VkXmsk GbyrnFstRekvSH01/ZQxlyDVKS8Wux0XIJ6OArCh1047LvL1bEE5dvOW5iIiwa/r grjQuwFAzIPsE2fgcAO17BKIUzq2Z96+hwDzH7dw0i32yBuLvNmY/1SxcCHKfPut uzGyp7t3/2dIHbpWILRndMYe0O9j9ubmOMvKyKTwy723yDEafsUoqu2mlpigzTq4 2i+DbYBIAd8qmLqG/m3e+vOt9xodJ2Q0hlO+v6DcP2SKXU64Hb/N98HadR//aWP9 41DBXqs+dvDBcu3Jxb80PFUTiOQZECJivkns5cNcjuSXmNkOuQhDQR5K372AHmR9 m6e6FSBxwej8HselAZCI6yu9uBKd0i+MM4FnFs/O73QGYx2ayXsEXp/Jad9xiYgW pC5XJTSqJQhPE0AYYEOzHPPcBLBcpvXHkvmGKdjkNb8OLhhgh2S/YG0DNC+8ABXr j1uIe/n3kJEEmOanUyiitGyLmDq+mXd7aCVKJL/J0KiGD8Gkc1avAZ1ZrTQgjujY FqqBFawO8gv3g0L4WMI8JI+HJGMnA488obet6UKm9+l/Z/urEpXzDAKf/W/vnx2B LD0FSA0bCh1tyO6JU+avFwHlwShtV7/rx/OhrmCK7CCYKtZCA2IEctxyr8U+PBIv DtwIMTYTsA== =ZZUI -----END PGP SIGNATURE----- Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Support for various new opcodes (fallocate, openat, close, statx, fadvise, madvise, openat2, non-vectored read/write, send/recv, and epoll_ctl) - Faster ring quiesce for fileset updates - Optimizations for overflow condition checking - Support for max-sized clamping - Support for probing what opcodes are supported - Support for io-wq backend sharing between "sibling" rings - Support for registering personalities - Lots of little fixes and improvements * tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits) io_uring: add support for epoll_ctl(2) eventpoll: support non-blocking do_epoll_ctl() calls eventpoll: abstract out epoll_ctl() handler io_uring: fix linked command file table usage io_uring: support using a registered personality for commands io_uring: allow registering credentials io_uring: add io-wq workqueue sharing io-wq: allow grabbing existing io-wq io_uring/io-wq: don't use static creds/mm assignments io-wq: make the io_wq ref counted io_uring: fix refcounting with batched allocations at OOM io_uring: add comment for drain_next io_uring: don't attempt to copy iovec for READ/WRITE io_uring: honor IOSQE_ASYNC for linked reqs io_uring: prep req when do IOSQE_ASYNC io_uring: use labeled array init in io_op_defs io_uring: optimise sqe-to-req flags translation io_uring: remove REQ_F_IO_DRAINED io_uring: file switch work needs to get flushed on exit io_uring: hide uring_fd in ctx ...	2020-01-29 18:53:37 -08:00
Jens Axboe	3e4827b05d	io_uring: add support for epoll_ctl(2) This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-29 15:46:09 -07:00
Jens Axboe	f86cd20c94	io_uring: fix linked command file table usage We're not consistent in how the file table is grabbed and assigned if we have a command linked that requires the use of it. Add ->file_table to the io_op_defs[] array, and use that to determine when to grab the table instead of having the handlers set it if they need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES flag. We always initialize work->files, so io-wq can just check for that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-29 13:46:44 -07:00
Jens Axboe	75c6a03904	io_uring: support using a registered personality for commands For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:45:31 -07:00
Jens Axboe	071698e13a	io_uring: allow registering credentials If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used. An 'id' is returned for the application to use in subsequent personality support. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:44:44 -07:00
Pavel Begunkov	24369c2e3b	io_uring: add io-wq workqueue sharing If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails. This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend. Reported-by: Jens Axboe <axboe@kernel.dk> Reported-by: Daurnimator <quae@daurnimator.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:44:41 -07:00
Jens Axboe	cccf0ee834	io_uring/io-wq: don't use static creds/mm assignments We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:44:20 -07:00
Pavel Begunkov	9466f43741	io_uring: fix refcounting with batched allocations at OOM In case of out of memory the second argument of percpu_ref_put_many() in io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly wrong. Fixes: `2b85edfc0c` ("io_uring: batch getting pcpu references") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-27 15:36:30 -07:00
Pavel Begunkov	8cdf2193a3	io_uring: add comment for drain_next Draining the middle of a link is tricky, so leave a comment there Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-27 15:36:30 -07:00
Jens Axboe	980ad26304	io_uring: don't attempt to copy iovec for READ/WRITE For the non-vectored variant of READV/WRITEV, we don't need to setup an async io context, and we flag that appropriately in the io_op_defs array. However, in fixing this for the 5.5 kernel in commit `74566df3a7` we didn't have these opcodes, so the check there was added just for the READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a single check for needing async context, that covers all four of these read/write variants that don't use an iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-27 15:36:29 -07:00
Jens Axboe	ebe1002621	io_uring: don't cancel all work on process exit If we're sharing the ring across forks, then one process exiting means that we cancel ALL work and prevent future work. This is overly restrictive. As long as we cancel the work associated with the files from the current task, it's safe to let others persist. Normal fd close on exit will still wait (and cancel) pending work. Fixes: `fcb323cc53` ("io_uring: io_uring: add support for async work inheriting files") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-26 10:17:12 -07:00
Jens Axboe	73e08e711d	Revert "io_uring: only allow submit from owning task" This ends up being too restrictive for tasks that willingly fork and share the ring between forks. Andres reports that this breaks his postgresql work. Since we're close to 5.5 release, revert this change for now. Cc: stable@vger.kernel.org Fixes: `44d282796f` ("io_uring: only allow submit from owning task") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-26 09:56:05 -07:00
Pavel Begunkov	86a761f81e	io_uring: honor IOSQE_ASYNC for linked reqs REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 13:57:48 -07:00
Pavel Begunkov	1118591ab8	io_uring: prep req when do IOSQE_ASYNC Whenever IOSQE_ASYNC is set, requests will be punted to async without getting into io_issue_req() and without proper preparation done (e.g. io_req_defer_prep()). Hence they will be left uninitialised. Prepare them before punting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 13:57:46 -07:00
Pavel Begunkov	0463b6c58e	io_uring: use labeled array init in io_op_defs Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	6b47ee6eca	io_uring: optimise sqe-to-req flags translation For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG) req->flags \|= REQ_F_FLAG Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	87987898a1	io_uring: remove REQ_F_IO_DRAINED A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Jens Axboe	e46a7950d3	io_uring: file switch work needs to get flushed on exit We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	b14cca0c84	io_uring: hide uring_fd in ctx req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Pavel Begunkov	0791015837	io_uring: remove extra check in __io_commit_cqring __io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Pavel Begunkov	711be0312d	io_uring: optimise use of ctx->drain_next Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request. No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	66f4af93da	io_uring: add support for probing opcodes The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	10fef4bebf	io_uring: account fixed file references correctly in batch We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all. Also ensure we free requests properly between inflight accounted and normal requests. Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov <9erthalion6@gmail.com> Reported-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	354420f705	io_uring: add opcode to issue trace event For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	cebdb98617	io_uring: add support for IORING_OP_OPENAT2 Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f8748881b1	io_uring: remove 'fname' from io_open structure We only use it internally in the prep functions for both statx and openat, so we don't need it to be persistent across the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	c12cedf24e	io_uring: add 'struct open_how' to the openat request context We'll need this for openat2(2) support, remove flags and mode from the existing io_open struct. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f2842ab5b7	io_uring: enable option to only trigger eventfd for async completions If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <markuspapadakis@icloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	69b3e54613	io_uring: change io_ring_ctx bool fields into bit fields In preparation for adding another one, which would make us spill into another long (and hence bump the size of the ctx), change them to bit fields. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	c150368b49	io_uring: file set registration should use interruptible waits If an application attempts to register a set with unbounded requests pending, we can be stuck here forever if they don't complete. We can make this wait interruptible, and just abort if we get signaled. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
YueHaibing	96fd84d83a	io_uring: Remove unnecessary null check Null check kfree is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	fddafacee2	io_uring: add support for send(2) and recv(2) This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	2550878f84	io_uring: remove extra io_wq_current_is_worker() io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	caf582c652	io_uring: optimise commit_sqring() for common case It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	ee7d46d9db	io_uring: optimise head checks in io_get_sqring() A user may ask to submit more than there is in the ring, and then io_uring will submit as much as it can. However, in the last iteration it will allocate an io_kiocb and immediately free it. It could do better and adjust @to_submit to what is in the ring. And since the ring's head is already checked here, there is no need to do it in the loop, spamming with smp_load_acquire()'s barriers Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Pavel Begunkov	9ef4f12489	io_uring: clamp to_submit in io_submit_sqes() Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated code and prepares for following changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	8110c1a621	io_uring: add support for IORING_SETUP_CLAMP Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c6ca97b30c	io_uring: extend batch freeing to cover more cases Currently we only batch free if fixed files are used, no links, no aux data, etc. This extends the batch freeing to only exclude the linked case and fallback case, and make io_free_req_many() handle the other cases just fine. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	8237e04598	io_uring: wrap multi-req freeing in struct req_batch This cleans up the code a bit, and it allows us to build on top of the multi-req freeing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Pavel Begunkov	2b85edfc0c	io_uring: batch getting pcpu references percpu_ref_tryget() has its own overhead. Instead getting a reference for each request, grab a bunch once per io_submit_sqes(). ~5% throughput boost for a "submit and wait 128 nops" benchmark. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> __io_req_free_empty() -> __io_req_do_free() Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c1ca757bd6	io_uring: add IORING_OP_MADVISE This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	4840e418c2	io_uring: add IORING_OP_FADVISE This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:01 -07:00
Jens Axboe	ba04291eb6	io_uring: allow use of offset == -1 to mean file position This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	3a6820f2bb	io_uring: add non-vectored read/write commands For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	e94f141bd2	io_uring: improve poll completion performance For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ad3eb2c89f	io_uring: split overflow state into SQ and CQ side We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side. This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	d3656344fe	io_uring: add lookup table for various opcode needs We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added. This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	add7b6b85a	io_uring: remove two unnecessary function declarations __io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Pavel Begunkov	32fe525b6d	io_uring: move *queue_link_head() from common path Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Pavel Begunkov	9d76377f7e	io_uring: rename prev to head Calling "prev" a head of a link is a bit misleading. Rename it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ce35a47a3a	io_uring: add IOSQE_ASYNC io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	eddc7ef52a	io_uring: add support for IORING_OP_STATX This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:54 -07:00
Jens Axboe	05f3fb3c53	io_uring: avoid ring quiesce for fixed file set unregister and update We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:50 -07:00
Jens Axboe	b5dba59e0c	io_uring: add support for IORING_OP_CLOSE This works just like close(2), unsurprisingly. We remove the file descriptor and post the completion inline, then offload the actual (potential) last file put to async context. Mark the async part of this work as uncancellable, as we really must guarantee that the latter part of the close is run. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	0c9d5ccd26	io-wq: add support for uncancellable work Not all work can be cancelled, some of it we may need to guarantee that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL on work that must not be cancelled. Note that the caller work function must also check for IO_WQ_WORK_NO_CANCEL on work that is marked IO_WQ_WORK_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	15b71abe7b	io_uring: add support for IORING_OP_OPENAT This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	d63d1b5edb	io_uring: add support for fallocate() This exposes fallocate(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Eugene Syromiatnikov	1292e972ff	io_uring: fix compat for IORING_REGISTER_FILES_UPDATE fds field of struct io_uring_files_update is problematic with regards to compat user space, as pointer size is different in 32-bit, 32-on-64-bit, and 64-bit user space. In order to avoid custom handling of compat in the syscall implementation, make fds __u64 and use u64_to_user_ptr in order to retrieve it. Also, align the field naturally and check that no garbage is passed there. Fixes: `c3a31e6056` ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:00:44 -07:00
Jens Axboe	44d282796f	io_uring: only allow submit from owning task If the credentials or the mm doesn't match, don't allow the task to submit anything on behalf of this ring. The task that owns the ring can pass the file descriptor to another task, but we don't want to allow that task to submit an SQE that then assumes the ring mm and creds if it needs to go async. Cc: stable@vger.kernel.org Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-16 21:43:24 -07:00
Jens Axboe	11ba820bf1	io_uring: ensure workqueue offload grabs ring mutex for poll list A previous commit moved the locking for the async sqthread, but didn't take into account that the io-wq workers still need it. We can't use req->in_async for this anymore as both the sqthread and io-wq workers set it, gate the need for locking on io_wq_current_is_worker() instead. Fixes: `8a4955ff1c` ("io_uring: sqthread should grab ctx->uring_lock for submissions") Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-15 21:51:17 -07:00
Bijan Mottahedeh	797f3f535d	io_uring: clear req->result always before issuing a read/write request req->result is cleared when io_issue_sqe() calls io_read/write_pre() routines. Those routines however are not called when the sqe argument is NULL, which is the case when io_issue_sqe() is called from io_wq_submit_work(). io_issue_sqe() may then examine a stale result if a polled request had previously failed with -EAGAIN: if (ctx->flags & IORING_SETUP_IOPOLL) { if (req->result == -EAGAIN) return -EAGAIN; io_iopoll_req_issued(req); } and in turn cause a subsequently completed request to be re-issued in io_wq_submit_work(). Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-15 21:36:13 -07:00
Jens Axboe	78912934f4	io_uring: be consistent in assigning next work from handler If we pass back dependent work in case of links, we need to always ensure that we call the link setup and work prep handler. If not, we might be missing some setup for the next work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-14 22:09:06 -07:00
Jens Axboe	74566df3a7	io_uring: don't setup async context for read/write fixed We don't need it, and if we have it, then the retry handler will attempt to copy the non-existent iovec with the inline iovec, with a segment count that doesn't make sense. Fixes: `f67676d160` ("io_uring: ensure async punted read/write requests copy iovec") Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-13 19:25:29 -07:00
Jens Axboe	eacc6dfaea	io_uring: remove punt of short reads to async context We currently punt any short read on a regular file to async context, but this fails if the short read is due to running into EOF. This is especially problematic since we only do the single prep for commands now, as we don't reset kiocb->ki_pos. This can result in a 4k read on a 1k file returning zero, as we detect the short read and then retry from async context. At the time of retry, the position is now 1k, and we end up reading nothing, and hence return 0. Instead of trying to patch around the fact that short reads can be legitimate and won't succeed in case of retry, remove the logic to punt a short read to async context. Simply return it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-07 13:08:56 -07:00
Jens Axboe	3529d8c2b3	io_uring: pass in 'sqe' to the prep handlers This moves the prep handlers outside of the opcode handlers, and allows us to pass in the sqe directly. If the sqe is non-NULL, it means that the request should be prepared for the first time. With the opcode handlers not having access to the sqe at all, we are guaranteed that the prep handler has setup the request fully by the time we get there. As before, for opcodes that need to copy in more data then the io_kiocb allows for, the io_async_ctx holds that info. If a prep handler is invoked with req->io set, it must use that to retain information for later. Finally, we can remove io_kiocb->sqe as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 10:04:50 -07:00
Jens Axboe	06b76d44ba	io_uring: standardize the prep methods We currently have a mix of use cases. Most of the newer ones are pretty uniform, but we have some older ones that use different calling calling conventions. This is confusing. For the opcodes that currently rely on the req->io->sqe copy saving them from reuse, add a request type struct in the io_kiocb command union to store the data they need. Prepare for all opcodes having a standard prep method, so we can call it in a uniform fashion and outside of the opcode handler. This is in preparation for passing in the 'sqe' pointer, rather than storing it in the io_kiocb. Once we have uniform prep handlers, we can leave all the prep work to that part, and not even pass in the sqe to the opcode handler. This ensures that we don't reuse sqe data inadvertently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 10:04:22 -07:00
Jens Axboe	26a61679f1	io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler Add the count field to struct io_timeout, and ensure the prep handler has read it. Timeout also needs an async context always, set it up in the prep handler if we don't have one. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:55:33 -07:00
Jens Axboe	e47293fdf9	io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler Add struct io_sr_msg in our io_kiocb per-command union, and ensure that the send/recvmsg prep handlers have grabbed what they need from the SQE by the time prep is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:55:23 -07:00
Jens Axboe	3fbb51c18f	io_uring: move all prep state for IORING_OP_CONNECT to prep handler Add struct io_connect in our io_kiocb per-command union, and ensure that io_connect_prep() has grabbed what it needs from the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:52:48 -07:00
Jens Axboe	9adbd45d6d	io_uring: add and use struct io_rw for read/writes Put the kiocb in struct io_rw, and add the addr/len for the request as well. Use the kiocb->private field for the buffer index for fixed reads and writes. Any use of kiocb->ki_filp is flipped to req->file. It's the same thing, and less confusing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:52:45 -07:00
Jens Axboe	d55e5f5b70	io_uring: use u64_to_user_ptr() consistently We use it in some spots, but not consistently. Convert the rest over, makes it easier to read as well. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 08:36:50 -07:00
Jens Axboe	fd6c2e4c06	io_uring: io_wq_submit_work() should not touch req->rw I've been chasing a weird and obscure crash that was userspace stack corruption, and finally narrowed it down to a bit flip that made a stack address invalid. io_wq_submit_work() unconditionally flips the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work handler, this isn't valid. Normal read/write operations own that part of the request, on other types it could be something else. Move the IOCB_NOWAIT clear to the read/write handlers where it belongs. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-18 12:19:41 -07:00
Pavel Begunkov	7c504e6520	io_uring: don't wait when under-submitting There is no reliable way to submit and wait in a single syscall, as io_submit_sqes() may under-consume sqes (in case of an early error). Then it will wait for not-yet-submitted requests, deadlocking the user in most cases. Don't wait/poll if can't submit all sqes Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-18 10:01:49 -07:00
Jens Axboe	e781573e2f	io_uring: warn about unhandled opcode Now that we have all the opcodes handled in terms of command prep and SQE reuse, add a printk_once() to warn about any potentially new and unhandled ones. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	d625c6ee49	io_uring: read opcode and user_data from SQE exactly once If we defer a request, we can't be reading the opcode again. Ensure that the user_data and opcode fields are stable. For the user_data we already have a place for it, for the opcode we can fill a one byte hold and store that as well. For both of them, assign them when we originally read the SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data is switched to req->opcode and req->user_data. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	b29472ee7b	io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the timeout remove op into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	fbf23849b1	io_uring: make IORING_OP_CANCEL_ASYNC deferrable If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the async cancel op into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	0969e783e3	io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable If we defer these commands as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the poll add/remove into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Pavel Begunkov	ffbb8d6b76	io_uring: make HARDLINK imply LINK The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a link and there is no need to set IOSQE_IO_LINK separately, though it could be there. Add proper check and ensure that IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	8ed8d3c3bc	io_uring: any deferred command must have stable sqe data We're currently not retaining sqe data for accept, fsync, and sync_file_range. None of these commands need data outside of what is directly provided, hence it can't go stale when the request is deferred. However, it can get reused, if an application reuses SQE entries. Ensure that we retain the information we need and only read the sqe contents once, off the submission path. Most of this is just moving code into a prep and finish function. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Jens Axboe	fc4df999e2	io_uring: remove 'sqe' parameter to the OP helpers that take it We pass in req->sqe for all of them, no need to pass it in as the request is always passed in. This is a necessary prep patch to be able to cleanup/fix the request prep path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Jens Axboe	b7bb4f7da0	io_uring: fix pre-prepped issue with force_nonblock == true Some of these code paths assume that any force_nonblock == true issue is not prepped, but that's not true if we did prep as part of link setup earlier. Check if we already have an async context allocate before setting up a new one. Cleanup the async context setup in general, we have a lot of duplicated code there. Fixes: `03b1230ca1` ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Fixes: `f67676d160` ("io_uring: ensure async punted read/write requests copy iovec") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Jens Axboe	0b416c3e13	io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG If we have to punt the recvmsg to async context, we copy all the context. But since the iovec used can be either on-stack (if small) or dynamically allocated, if it's on-stack, then we need to ensure we reset the iov pointer. If we don't, then we're reusing old stack data, and that can lead to -EFAULTs if things get overwritten. Ensure we retain the right pointers for the iov, and free it as well if we end up having to go beyond UIO_FASTIOV number of vectors. Fixes: `03b1230ca1` ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-15 22:12:47 -07:00
Brian Gianforcaro	d195a66e36	io_uring: fix stale comment and a few typos - Fix a few typos found while reading the code. - Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter was renamed to 'req', but the comment still holds. Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-15 14:49:30 -07:00
Jens Axboe	9e3aa61ae3	io_uring: ensure we return -EINVAL on unknown opcode If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-11 16:02:32 -07:00
Jens Axboe	10d5934557	io_uring: add sockets to list of files that support non-blocking issue In chasing a performance issue between using IORING_OP_RECVMSG and IORING_OP_READV on sockets, tracing showed that we always punt the socket reads to async offload. This is due to io_file_supports_async() not checking for S_ISSOCK on the inode. Since sockets supports the O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list of file types that we can do a non-blocking issue to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	53108d476a	io_uring: only hash regular files for async work execution We hash regular files to avoid having multiple threads hammer on the inode mutex, but it should not be needed on other types of files (like sockets). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	4a0a7a1874	io_uring: run next sqe inline if possible One major use case of linked commands is the ability to run the next link inline, if at all possible. This is done correctly for async offload, but somewhere along the line we lost the ability to do so when we were able to complete a request without having to punt it. Ensure that we do so correctly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	392edb45b2	io_uring: don't dynamically allocate poll data This essentially reverts commit `e944475e69`. For high poll ops workloads, like TAO, the dynamic allocation of the wait_queue entry for IORING_OP_POLL_ADD adds considerable extra overhead. Go back to embedding the wait_queue_entry, but keep the usage of wait->private for the pointer stashing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	d96885658d	io_uring: deferred send/recvmsg should assign iov Don't just assign it from the main call path, that can miss the case when we're called from issue deferral. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	8a4955ff1c	io_uring: sqthread should grab ctx->uring_lock for submissions We use the mutex to guard against registered file updates, for instance. Ensure we're safe in accessing that state against concurrent updates. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	4e88d6e779	io_uring: allow unbreakable links Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:06 -07:00
LimingWu	0b4295b5e2	io_uring: fix a typo in a comment thatn -> than. Signed-off-by: Liming Wu <19092205@suning.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-05 07:59:37 -07:00
Pavel Begunkov	4493233edc	io_uring: hook all linked requests via link_list Links are created by chaining requests through req->list with an exception that head uses req->link_list. (e.g. link_list->list->list) Because of that, io_req_link_next() needs complex splicing to advance. Link them all through list_list. Also, it seems to be simpler and more consistent IMHO. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-05 06:54:52 -07:00
Pavel Begunkov	2e6e1fde32	io_uring: fix error handling in io_queue_link_head In case of an error io_submit_sqe() drops a request and continues without it, even if the request was a part of a link. Not only it doesn't cancel links, but also may execute wrong sequence of actions. Stop consuming sqes, and let the user handle errors. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-05 06:54:51 -07:00
Jens Axboe	78076bb64a	io_uring: use hash table for poll command lookups We recently changed this from a single list to an rbtree, but for some real life workloads, the rbtree slows down the submission/insertion case enough so that it's the top cycle consumer on the io_uring side. In testing, using a hash table is a more well rounded compromise. It is fast for insertion, and as long as it's sized appropriately, it works well for the cancellation case as well. Running TAO with a lot of network sockets, this removes io_poll_req_insert() from spending 2% of the CPU cycles. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-04 20:12:58 -07:00
Jens Axboe	2d28390aff	io_uring: ensure deferred timeouts copy necessary data If we defer a timeout, we should ensure that we copy the timespec when we have consumed the sqe. This is similar to commit `f67676d160` for read/write requests. We already did this correctly for timeouts deferred as links, but do it generally and use the infrastructure added by commit `1a6b74fc87` instead of having the timeout deferral use its own. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-04 11:12:08 -07:00
Jens Axboe	901e59bba9	io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT There's really no reason why we forbid things like link/drain etc on regular timeout commands. Enable the usual SQE flags on timeouts. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-04 10:34:03 -07:00
Jens Axboe	87f80d623c	io_uring: handle connect -EINPROGRESS like -EAGAIN Right now we return it to userspace, which means the application has to poll for the socket to be writeable. Let's just treat it like -EAGAIN and have io_uring handle it internally, this makes it much easier to use. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 11:23:54 -07:00
Jackie Liu	22efde5998	io_uring: remove parameter ctx of io_submit_state_start Parameter ctx we have never used, clean it up. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 07:04:32 -07:00
Jens Axboe	da8c969069	io_uring: mark us with IORING_FEAT_SUBMIT_STABLE If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 07:04:32 -07:00
Jens Axboe	f499a021ea	io_uring: ensure async punted connect requests copy data Just like commit `f67676d160` for read/write requests, this one ensures that the sockaddr data has been copied for IORING_OP_CONNECT if we need to punt the request to async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 07:04:30 -07:00
Jens Axboe	03b1230ca1	io_uring: ensure async punted sendmsg/recvmsg requests copy data Just like commit `f67676d160` for read/write requests, this one ensures that the msghdr data is fully copied if we need to punt a recvmsg or sendmsg system call to async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 07:03:35 -07:00
Jens Axboe	f67676d160	io_uring: ensure async punted read/write requests copy iovec Currently we don't copy the iovecs when we punt to async context. This can be problematic for applications that store the iovec on the stack, as they often assume that it's safe to let the iovec go out of scope as soon as IO submission has been called. This isn't always safe, as we will re-copy the iovec once we're in async context. Make this 100% safe by copying the iovec just once. With this change, applications may safely store the iovec on the stack for all cases. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-02 21:33:25 -07:00
Jens Axboe	1a6b74fc87	io_uring: add general async offload context Right now we just copy the sqe for async offload, but we want to store more context across an async punt. In preparation for doing so, put the sqe copy inside a structure that we can expand. With this pointer added, we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether req->io is NULL or not. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-02 18:49:33 -07:00
Jens Axboe	441cdbd544	io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR We should never return -ERESTARTSYS to userspace, transform it into -EINTR. Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-02 18:49:10 -07:00
Jens Axboe	0b8c0ec7ee	io_uring: use current task creds instead of allocating a new one syzbot reports: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack. Fixes: `181e448d87` ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-02 08:50:00 -07:00
Linus Torvalds	31764f1b6d	for-linus-20191129 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3h2LsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvrOD/43g08VvkLexb6DvO8kTkyxPSUsPZml292H yowv1Rij9x/7Y9MTOmcnEAV9QofRPY9nkmEq1KR+iqGQnLZsKFLrDVxQSgkhU7HJ 2wRE2DCPYTfIyIfS0TF2DdVjh3Q90tPKZbVkiOxvy9R+yS6hmTur0t731OERsMAh QgkY8zJP04XonXNwZSch9QYdpf9XGu+W83Pc0ecmootimJHVhFV8J9111dessod0 h82Epbbbc8bg/CBWCA4fDm4EZ8doMHdAHYpw60WDXPoLF9zISvg5QG+pVjKrLkVx pqgTt5Kwd/EkqurRw3sH+8A6p0ACLrUiKffX2cXk8ScdTXMGD5stmdF5cMLSikFY yJgGHOTBHdjV4T2gB1TQ0rlnnb1VtHoJm5XvPviRaSQt7irzh4EWwhYiL7JlTAC3 etCbzMPn8Sm+Ns+/zObuOmVTQvyE/+PngToxDRpgzDd4pSbMPR502iL3gvSbMDjw BCfGKFRkBYAB1SSjMwi+l9hiX2F+jTnHSvrm69HUz5K2zvWl4hsd2wClExuwoCQ+ UFXULCJZFCKbAcgEVh2OQX9JVg7fUv5GhIymEPBWFDAODwtX1XJK6IxmQhRh8owV AxBFnNdpgRcBzyy+c+2cM4JOVcm8bV1s6eYP0UyV+EieD5OcBdq7GH5YBFzOztJM SLMbjQQ/7w== =hnf4 -----END PGP SIGNATURE----- Merge tag 'for-linus-20191129' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "I wasn't going to send this one off so soon, but unfortunately one of the fixes from the previous pull broke the build on some archs. So I'm sending this sooner rather than later. This contains: - Add highmem.h include for io_uring, because of the kmap() additions from last round. For some reason the build bot didn't spot this even though it sat for days. - Three minor ';' removals - Add support for the Beurer CD-on-a-chip device - Make io_uring work on MMU-less archs" * tag 'for-linus-20191129' of git://git.kernel.dk/linux-block: io_uring: fix missing kmap() declaration on powerpc ataflop: Remove unneeded semicolon block: sunvdc: Remove unneeded semicolon drbd: Remove unneeded semicolon io_uring: add mapping support for NOMMU archs sr_vendor: support Beurer GL50 evo CD-on-a-chip devices. cdrom: respect device capabilities during opening action	2019-12-01 18:26:56 -08:00
Jens Axboe	aa4c396775	io_uring: fix missing kmap() declaration on powerpc Christophe reports that current master fails building on powerpc with this error: CC fs/io_uring.o fs/io_uring.c: In function ‘loop_rw_iter’: fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’ [-Werror=implicit-function-declaration] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1628:19: warning: assignment makes pointer from integer without a cast [-Wint-conversion] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’ [-Werror=implicit-function-declaration] kunmap(iter->bvec->bv_page); ^ which is caused by a missing highmem.h include. Fix it by including it. Fixes: `311ae9e159` ("io_uring: fix dead-hung for non-iter fixed rw") Reported-by: Christophe Leroy <christophe.leroy@c-s.fr> Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-29 10:14:00 -07:00
Roman Penyaev	6c5c240e41	io_uring: add mapping support for NOMMU archs That is a bit weird scenario but I find it interesting to run fio loads using LKL linux, where MMU is disabled. Probably other real archs which run uClinux can also benefit from this patch. Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-28 10:08:02 -07:00
Jens Axboe	e944475e69	io_uring: make poll->wait dynamically allocated In the quest to bring io_kiocb down to 3 cachelines, this one does the trick. Make the wait_queue_entry for the poll command come out of kmalloc instead of embedding it in struct io_poll_iocb, as the latter is the largest member of io_kiocb. Once we trim this down a bit, we're back at a healthy 192 bytes for struct io_kiocb. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-26 15:02:56 -07:00
Pavel Begunkov	7d00916555	io_uring: cleanup io_import_fixed() Clean io_import_fixed() call site and make it return proper type. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-26 15:02:56 -07:00
Pavel Begunkov	cf6fd4bd55	io_uring: inline struct sqe_submit There is no point left in keeping struct sqe_submit. Inline it into struct io_kiocb, so any req->submit.field is now just req->field - moves initialisation of ring_file into io_get_req() - removes duplicated req->sequence. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-26 15:02:56 -07:00
Pavel Begunkov	cc42e0ac17	io_uring: store timeout's sqe->off in proper place Timeouts' sequence offset (i.e. sqe->off) is stored in req->submit.sequence under a false name. Keep it in timeout.data instead. The unused space for sequence will be reclaimed in the following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-26 15:02:56 -07:00
Hrvoje Zeba	8042d6ce8c	io_uring: remove superfluous check for sqe->off in io_accept() This field contains a pointer to addrlen and checking to see if it's set returns -EINVAL if the caller sets addr & addrlen pointers. Fixes: `17f2fe35d0` ("io_uring: add support for IORING_OP_ACCEPT") Signed-off-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	181e448d87	io_uring: async workers should inherit the user creds If we don't inherit the original task creds, then we can confuse users like fuse that pass creds in the request header. See link below on identical aio issue. Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	576a347b7a	io-wq: have io_wq_create() take a 'data' argument We currently pass in 4 arguments outside of the bounded size. In preparation for adding one more argument, let's bundle them up in a struct to make it more readable. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	311ae9e159	io_uring: fix dead-hung for non-iter fixed rw Read/write requests to devices without implemented read/write_iter using fixed buffers can cause general protection fault, which totally hangs a machine. io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter() accesses it as iovec, dereferencing random address. kmap() page by page in this case Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	f8e85cf255	io_uring: add support for IORING_OP_CONNECT This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to. Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	c4a2ed72c9	io_uring: only return -EBUSY for submit on non-flushed backlog We return -EBUSY on submit when we have a CQ ring overflow backlog, but that can be a bit problematic if the application is using pure userspace poll of the CQ ring. For that case, if the ring briefly overflowed and we have pending entries in the backlog, the submit flushes the backlog successfully but still returns -EBUSY. If we're able to fully flush the CQ ring backlog, let the submission proceed. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	f9bd67f69a	io_uring: only !null ptr to io_issue_sqe() Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's side. And propagate it. - kiocb_done() is only called from io_read() and io_write(), which are only called from io_issue_sqe(), so it's @nxt != NULL - io_put_req_find_next() is called either with explicitly non-null local nxt, or from one of the functions in io_issue_sqe() switch (or their callees). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	b18fdf71e0	io_uring: simplify io_req_link_next() "if (nxt)" is always true, as it was checked in the while's condition. io_wq_current_is_worker() is unnecessary, as non-async callers don't pass nxt, so io_queue_async_work() will be called for them anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	944e58bfed	io_uring: pass only !null to io_req_find_next() Make io_req_find_next() and io_req_link_next() to accept only non-null nxt, and handle it in callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	70cf9f3270	io_uring: remove io_free_req_find_next() There is only one one-liner user of io_free_req_find_next(). Inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	9835d6fafb	io_uring: add likely/unlikely in io_get_sqring() The number of SQEs to submit is specified by a user, so io_get_sqring() in most of the cases succeeds. Hint compilers about that. Checking ASM genereted by gcc 9.2.0 for x64, there is one branch misprediction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Pavel Begunkov	d732447fed	io_uring: rename __io_submit_sqe() __io_submit_sqe() is issuing requests, so call it as such. Moreover, it ends by calling io_iopoll_req_issued(). Rename it and make terminology clearer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	915967f69c	io_uring: improve trace_io_uring_defer() trace point We don't have shadow requests anymore, so get rid of the shadow argument. Add the user_data argument, as that's often useful to easily match up requests, instead of having to look at request pointers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Pavel Begunkov	1b4a51b6d0	io_uring: drain next sqe instead of shadowing There's an issue with the shadow drain logic in that we drop the completion lock after deciding to defer a request, then re-grab it later and assume that the state is still the same. In the mean time, someone else completing a request could have found and issued it. This can cause a stall in the queue, by having a shadow request inserted that nobody is going to drain. Additionally, if we fail allocating the shadow request, we simply ignore the drain. Instead of using a shadow request, defer the next request/link instead. This also has the following advantages: - removes semi-duplicated code - doesn't allocate memory for shadows - works better if only the head marked for drain - doesn't need complex synchronisation On the flip side, it removes the shadow->seq == last_drain_in_in_link->seq optimization. That shouldn't be a common case, and can always be added back, if needed. Fixes: `4fe2c96315` ("io_uring: add support for link with drain") Cc: Jackie Liu <liuyun01@kylinos.cn> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	b76da70fc3	io_uring: close lookup gap for dependent next work When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible. Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	4d7dd46297	io_uring: allow finding next link independent of req reference count We currently try and start the next link when we put the request, and only if we were going to free it. This means that the optimization to continue executing requests from the same context often fails, as we're not putting the final reference. Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the next request more efficiently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	eb065d301e	io_uring: io_allocate_scq_urings() should return a sane state We currently rely on the ring destroy on cleaning things up in case of failure, but io_allocate_scq_urings() can leave things half initialized if only parts of it fails. Be nice and return with either everything setup in success, or return an error with things nicely cleaned up. Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	bbad27b2f6	io_uring: Always REQ_F_FREE_SQE for allocated sqe Always mark requests with allocated sqe and deallocate it in __io_free_req(). It's easier to follow and doesn't add edge cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	5d960724b0	io_uring: io_fail_links() should only consider first linked timeout We currently clear the linked timeout field if we cancel such a timeout, but we should only attempt to cancel if it's the first one we see. Others should simply be freed like other requests, as they haven't been started yet. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	09fbb0a83e	io_uring: Fix leaking linked timeouts let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT 1. submission stage: submission references for REQ and LINK_TIMEOUT are dropped. So, references respectively (1,1,2) 2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all linked timeouts will call cancel_timeout() and drop 1 reference. So, references after: (0,0,1). That's a leak. Make it treat only the first linked timeout as such, and pass others through __io_double_put_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	f70193d6d8	io_uring: remove redundant check Pass any IORING_OP_LINK_TIMEOUT request further, where it will eventually fail in io_issue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	d3b35796b1	io_uring: break links for failed defer If io_req_defer() failed, it needs to cancel a dependant link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	fba38c272a	io_uring: request cancellations should break links We currently don't explicitly break links if a request is cancelled, but we should. Add explicitly link breakage for all types of request cancellations that we support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	b0dd8a4126	io_uring: correct poll cancel and linked timeout expiration completion Currently a poll request fills a completion entry of 0, even if it got cancelled. This is odd, and it makes it harder to support with chains. Ensure that it returns -ECANCELED in the completions events if it got cancelled, and furthermore ensure that the linked timeout that triggered it completes with -ETIME if we did indeed trigger the completions through a timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	e0e328c4b3	io_uring: remove dead REQ_F_SEQ_PREV flag With the conversion to io-wq, we no longer use that flag. Kill it. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	94ae5e77a9	io_uring: fix sequencing issues with linked timeouts We have an issue with timeout links that are deeper in the submit chain, because we only handle it upfront, not from later submissions. Move the prep + issue of the timeout link to the async work prep handler, and do it normally for non-async queue. If we validate and prepare the timeout links upfront when we first see them, there's nothing stopping us from supporting any sort of nesting. Fixes: `2665abfd75` ("io_uring: add support for linked SQE timeouts") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	ad8a48acc2	io_uring: make req->timeout be dynamically allocated There are a few reasons for this: - As a prep to improving the linked timeout logic - io_timeout is the biggest member in the io_kiocb opcode union This also enables a few cleanups, like unifying the timer setup between IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple arguments to the link/prep helpers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:01 -07:00

... 2 3 4 5 6 ...

548 Commits