Commit Graph

124 Commits

Author SHA1 Message Date
Pavel Begunkov 8418f22a53 io-wq: refactor *_get_acct()
Extract a helper for io_work_get_acct() and io_wqe_get_acct() to avoid
duplication.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11 17:41:59 -06:00
Pavel Begunkov c60eb049f4 io-wq: cancel unbounded works on io-wq destroy
WARNING: CPU: 5 PID: 227 at fs/io_uring.c:8578 io_ring_exit_work+0xe6/0x470
RIP: 0010:io_ring_exit_work+0xe6/0x470
Call Trace:
 process_one_work+0x206/0x400
 worker_thread+0x4a/0x3d0
 kthread+0x129/0x170
 ret_from_fork+0x22/0x30

INFO: task lfs-openat:2359 blocked for more than 245 seconds.
task:lfs-openat      state:D stack:    0 pid: 2359 ppid:     1 flags:0x00000004
Call Trace:
 ...
 wait_for_completion+0x8b/0xf0
 io_wq_destroy_manager+0x24/0x60
 io_wq_put_and_exit+0x18/0x30
 io_uring_clean_tctx+0x76/0xa0
 __io_uring_files_cancel+0x1b9/0x2e0
 do_exit+0xc0/0xb40
 ...

Even after io-wq destroy has been issued io-wq worker threads will
continue executing all left work items as usual, and may hang waiting
for I/O that won't ever complete (aka unbounded).

[<0>] pipe_read+0x306/0x450
[<0>] io_iter_do_read+0x1e/0x40
[<0>] io_read+0xd5/0x330
[<0>] io_issue_sqe+0xd21/0x18a0
[<0>] io_wq_submit_work+0x6c/0x140
[<0>] io_worker_handle_work+0x17d/0x400
[<0>] io_wqe_worker+0x2c0/0x330
[<0>] ret_from_fork+0x22/0x30

Cancel all unbounded I/O instead of executing them. This changes the
user visible behaviour, but that's inevitable as io-wq is not per task.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cd4b543154154cba055cf86f351441c2174d7f71.1617842918.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08 13:33:17 -06:00
Pavel Begunkov 696ee88a7c io_uring/io-wq: protect against sprintf overflow
task_pid may be large enough to not fit into the left space of
TASK_COMM_LEN-sized buffers and overflow in sprintf. We not so care
about uniqueness, so replace it with safer snprintf().

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1702c6145d7e1c46fbc382f28334c02e1a3d3994.1617267273.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-01 09:21:18 -06:00
Jens Axboe dbe1bdbb39 io_uring: handle signals for IO threads like a normal thread
We go through various hoops to disallow signals for the IO threads, but
there's really no reason why we cannot just allow them. The IO threads
never return to userspace like a normal thread, and hence don't go through
normal signal processing. Instead, just check for a pending signal as part
of the work loop, and call get_signal() to handle it for us if anything
is pending.

With that, we can support receiving signals, including special ones like
SIGSTOP.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-27 14:09:07 -06:00
Jens Axboe f5d2d23bf0 io-wq: fix race around pending work on teardown
syzbot reports that it's triggering the warning condition on having
pending work on shutdown:

WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_destroy fs/io-wq.c:1061 [inline]
WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_put+0x153/0x260 fs/io-wq.c:1072
Modules linked in:
CPU: 1 PID: 12346 Comm: syz-executor.5 Not tainted 5.12.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_wq_destroy fs/io-wq.c:1061 [inline]
RIP: 0010:io_wq_put+0x153/0x260 fs/io-wq.c:1072
Code: 8d e8 71 90 ea 01 49 89 c4 41 83 fc 40 7d 4f e8 33 4d 97 ff 42 80 7c 2d 00 00 0f 85 77 ff ff ff e9 7a ff ff ff e8 1d 4d 97 ff <0f> 0b eb b9 8d 6b ff 89 ee 09 de bf ff ff ff ff e8 18 51 97 ff 09
RSP: 0018:ffffc90001ebfb08 EFLAGS: 00010293
RAX: ffffffff81e16083 RBX: ffff888019038040 RCX: ffff88801e86b780
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
RBP: 1ffff1100b2f8a80 R08: ffffffff81e15fce R09: ffffed100b2f8a82
R10: ffffed100b2f8a82 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: ffff8880597c5400 R15: ffff888019038000
FS:  00007f8dcd89c700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e9a054e160 CR3: 000000001dfb8000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 io_uring_clean_tctx+0x1b7/0x210 fs/io_uring.c:8802
 __io_uring_files_cancel+0x13c/0x170 fs/io_uring.c:8820
 io_uring_files_cancel include/linux/io_uring.h:47 [inline]
 do_exit+0x258/0x2340 kernel/exit.c:780
 do_group_exit+0x168/0x2d0 kernel/exit.c:922
 get_signal+0x1734/0x1ef0 kernel/signal.c:2773
 arch_do_signal_or_restart+0x3c/0x610 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208
 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
 syscall_exit_to_user_mode+0x48/0x180 kernel/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x465f69

which shouldn't happen, but seems to be possible due to a race on whether
or not the io-wq manager sees a fatal signal first, or whether the io-wq
workers do. If we race with queueing work and then send a fatal signal to
the owning task, and the io-wq worker sees that before the manager sets
IO_WQ_BIT_EXIT, then it's possible to have the worker exit and leave work
behind.

Just turn the WARN_ON_ONCE() into a cancelation condition instead.

Reported-by: syzbot+77a738a6bc947bf639ca@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25 10:16:12 -06:00
Jens Axboe 0b8cfa974d io_uring: don't use {test,clear}_tsk_thread_flag() for current
Linus correctly points out that this is both unnecessary and generates
much worse code on some archs as going from current to thread_info is
actually backwards - and obviously just wasteful, since the thread_info
is what we care about.

Since io_uring only operates on current for these operations, just use
test_thread_flag() instead. For io-wq, we can further simplify and use
tracehook_notify_signal() to handle the TIF_NOTIFY_SIGNAL work and clear
the flag. The latter isn't an actual bug right now, but it may very well
be in the future if we place other work items under TIF_NOTIFY_SIGNAL.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/CAHk-=wgYhNck33YHKZ14mFB5MzTTk8gqXHcfj=RWTAXKwgQJgg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21 14:16:08 -06:00
Jens Axboe 00ddff431a io-wq: ensure task is running before processing task_work
Mark the current task as running if we need to run task_work from the
io-wq threads as part of work handling. If that is the case, then return
as such so that the caller can appropriately loop back and reset if it
was part of a going-to-sleep flush.

Fixes: 3bfe610669 ("io-wq: fork worker threads from original task")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21 09:41:14 -06:00
Jens Axboe 16efa4fce3 io_uring: allow IO worker threads to be frozen
With the freezer using the proper signaling to notify us of when it's
time to freeze a thread, we can re-enable normal freezer usage for the
IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call
try_to_freeze() appropriately, and remove the default setting of
PF_NOFREEZE from create_io_thread().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12 20:26:13 -07:00
Jens Axboe e22bc9b481 kernel: make IO threads unfreezable by default
The io-wq threads were already marked as no-freeze, but the manager was
not. On resume, we perpetually have signal_pending() being true, and
hence the manager will loop and spin 100% of the time.

Just mark the tasks created by create_io_thread() as PF_NOFREEZE by
default, and remove any knowledge of it in io-wq and io_uring.

Reported-by: Kevin Locke <kevin@kevinlocke.name>
Tested-by: Kevin Locke <kevin@kevinlocke.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10 07:28:43 -07:00
yangerkun 70e3512509 io-wq: fix ref leak for req in case of exit cancelations
do_work such as io_wq_submit_work that cancel the work may leave a ref of
req as 1 if we have links. Fix it by call io_run_cancel.

Fixes: 4fb6ac3262 ("io-wq: improve manager/worker handling over exec")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/r/20210309030410.3294078-1-yangerkun@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10 07:28:42 -07:00
Jens Axboe cc20e3fec6 io-wq: remove unused 'user' member of io_wq
Previous patches killed the last user of this, now it's just a dead member
in the struct. Get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10 07:28:42 -07:00
Pavel Begunkov 678eeba481 io-wq: warn on creating manager while exiting
Add a simple warning making sure that nobody tries to create a new
manager while we're under IO_WQ_BIT_EXIT. That can potentially happen
due to racy work submission after final put.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07 14:12:43 -07:00
Jens Axboe 886d0137f1 io-wq: fix race in freeing 'wq' and worker access
Ran into a use-after-free on the main io-wq struct, wq. It has a worker
ref and completion event, but the manager itself isn't holding a
reference. This can lead to a race where the manager thinks there are
no workers and exits, but a worker is being added. That leads to the
following trace:

BUG: KASAN: use-after-free in io_wqe_worker+0x4c0/0x5e0
Read of size 8 at addr ffff888108baa8a0 by task iou-wrk-3080422/3080425

CPU: 5 PID: 3080425 Comm: iou-wrk-3080422 Not tainted 5.12.0-rc1+ #110
Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
Call Trace:
 dump_stack+0x90/0xbe
 print_address_description.constprop.0+0x67/0x28d
 ? io_wqe_worker+0x4c0/0x5e0
 kasan_report.cold+0x7b/0xd4
 ? io_wqe_worker+0x4c0/0x5e0
 __asan_load8+0x6d/0xa0
 io_wqe_worker+0x4c0/0x5e0
 ? io_worker_handle_work+0xc00/0xc00
 ? recalc_sigpending+0xe5/0x120
 ? io_worker_handle_work+0xc00/0xc00
 ? io_worker_handle_work+0xc00/0xc00
 ret_from_fork+0x1f/0x30

Allocated by task 3080422:
 kasan_save_stack+0x23/0x60
 __kasan_kmalloc+0x80/0xa0
 kmem_cache_alloc_node_trace+0xa0/0x480
 io_wq_create+0x3b5/0x600
 io_uring_alloc_task_context+0x13c/0x380
 io_uring_add_task_file+0x109/0x140
 __x64_sys_io_uring_enter+0x45f/0x660
 do_syscall_64+0x32/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 3080422:
 kasan_save_stack+0x23/0x60
 kasan_set_track+0x20/0x40
 kasan_set_free_info+0x24/0x40
 __kasan_slab_free+0xe8/0x120
 kfree+0xa8/0x400
 io_wq_put+0x14a/0x220
 io_wq_put_and_exit+0x9a/0xc0
 io_uring_clean_tctx+0x101/0x140
 __io_uring_files_cancel+0x36e/0x3c0
 do_exit+0x169/0x1340
 __x64_sys_exit+0x34/0x40
 do_syscall_64+0x32/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Have the manager itself hold a reference, and now both drop points drop
and complete if we hit zero, and the manager can unconditionally do a
wait_for_completion() instead of having a race between reading the ref
count and waiting if it was non-zero.

Fixes: fb3a1f6c74 ("io-wq: have manager wait for all workers to exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-06 10:57:01 -07:00
Jens Axboe 09ca6c40c2 io-wq: kill hashed waitqueue before manager exits
If we race with shutting down the io-wq context and someone queueing
a hashed entry, then we can exit the manager with it armed. If it then
triggers after the manager has exited, we can have a use-after-free where
io_wqe_hash_wake() attempts to wake a now gone manager process.

Move the killing of the hashed write queue into the manager itself, so
that we know we've killed it before the task exits.

Fixes: e941894eae ("io-wq: make buffered file write hashed work map per-ctx")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05 08:44:09 -07:00
Jens Axboe 46fe18b16c io_uring: move to using create_io_thread()
This allows us to do task creation and setup without needing to use
completions to try and synchronize with the starting thread. Get rid of
the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup
completion events - we can now do setup before the task is running.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05 08:43:01 -07:00
Jens Axboe f01272541d io-wq: ensure all pending work is canceled on exit
If we race on shutting down the io-wq, then we should ensure that any
work that was queued after workers shutdown is canceled. Harden the
add work check a bit too, checking for IO_WQ_BIT_EXIT and cancel if
it's set.

Add a WARN_ON() for having any work before we kill the io-wq context.

Reported-by: syzbot+91b4b56ead187d35c9d3@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:38:10 -07:00
Jens Axboe e4b4a13f49 io_uring: ensure that threads freeze on suspend
Alex reports that his system fails to suspend using 5.12-rc1, with the
following dump:

[  240.650300] PM: suspend entry (deep)
[  240.650748] Filesystems sync: 0.000 seconds
[  240.725605] Freezing user space processes ...
[  260.739483] Freezing of tasks failed after 20.013 seconds (3 tasks refusing to freeze, wq_busy=0):
[  260.739497] task:iou-mgr-446     state:S stack:    0 pid:  516 ppid:   439 flags:0x00004224
[  260.739504] Call Trace:
[  260.739507]  ? sysvec_apic_timer_interrupt+0xb/0x81
[  260.739515]  ? pick_next_task_fair+0x197/0x1cde
[  260.739519]  ? sysvec_reschedule_ipi+0x2f/0x6a
[  260.739522]  ? asm_sysvec_reschedule_ipi+0x12/0x20
[  260.739525]  ? __schedule+0x57/0x6d6
[  260.739529]  ? del_timer_sync+0xb9/0x115
[  260.739533]  ? schedule+0x63/0xd5
[  260.739536]  ? schedule_timeout+0x219/0x356
[  260.739540]  ? __next_timer_interrupt+0xf1/0xf1
[  260.739544]  ? io_wq_manager+0x73/0xb1
[  260.739549]  ? io_wq_create+0x262/0x262
[  260.739553]  ? ret_from_fork+0x22/0x30
[  260.739557] task:iou-mgr-517     state:S stack:    0 pid:  522 ppid:   439 flags:0x00004224
[  260.739561] Call Trace:
[  260.739563]  ? sysvec_apic_timer_interrupt+0xb/0x81
[  260.739566]  ? pick_next_task_fair+0x16f/0x1cde
[  260.739569]  ? sysvec_apic_timer_interrupt+0xb/0x81
[  260.739571]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[  260.739574]  ? __schedule+0x5b7/0x6d6
[  260.739578]  ? del_timer_sync+0x70/0x115
[  260.739581]  ? schedule_timeout+0x211/0x356
[  260.739585]  ? __next_timer_interrupt+0xf1/0xf1
[  260.739588]  ? io_wq_check_workers+0x15/0x11f
[  260.739592]  ? io_wq_manager+0x69/0xb1
[  260.739596]  ? io_wq_create+0x262/0x262
[  260.739600]  ? ret_from_fork+0x22/0x30
[  260.739603] task:iou-wrk-517     state:S stack:    0 pid:  523 ppid:   439 flags:0x00004224
[  260.739607] Call Trace:
[  260.739609]  ? __schedule+0x5b7/0x6d6
[  260.739614]  ? schedule+0x63/0xd5
[  260.739617]  ? schedule_timeout+0x219/0x356
[  260.739621]  ? __next_timer_interrupt+0xf1/0xf1
[  260.739624]  ? task_thread.isra.0+0x148/0x3af
[  260.739628]  ? task_thread_unbound+0xa/0xa
[  260.739632]  ? task_thread_bound+0x7/0x7
[  260.739636]  ? ret_from_fork+0x22/0x30
[  260.739647] OOM killer enabled.
[  260.739648] Restarting tasks ... done.
[  260.740077] PM: suspend exit

Play nice and ensure that any thread we create will call try_to_freeze()
at an opportune time so that memory suspend can proceed. For the io-wq
worker threads, mark them as PF_NOFREEZE. They could potentially be
blocked for a long time.

Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:38:09 -07:00
Jens Axboe dc7bbc9ef3 io-wq: fix error path leak of buffered write hash map
The 'err' path should include the hash put, we already grabbed a reference
once we get that far.

Fixes: e941894eae ("io-wq: make buffered file write hashed work map per-ctx")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:37:59 -07:00
Jens Axboe 5730b27e84 io_uring: move cred assignment into io_issue_sqe()
If we move it in there, then we no longer have to care about it in io-wq.
This means we can drop the cred handling in io-wq, and we can drop the
REQ_F_WORK_INITIALIZED flag and async init functions as that was the last
user of it since we moved to the new workers. Then we can also drop
io_wq_work->creds, and just hold the personality u16 in there instead.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:36:28 -07:00
Jens Axboe afcc4015d1 io-wq: provide an io_wq_put_and_exit() helper
If we put the io-wq from io_uring, we really want it to exit. Provide
a helper that does that for us. Couple that with not having the manager
hold a reference to the 'wq' and the normal SQPOLL exit will tear down
the io-wq context appropriate.

On the io-wq side, our wq context is per task, so only the task itself
is manipulating ->manager and hence it's safe to check and clear without
any extra locking. We just need to ensure that the manager task stays
around, in case it exits.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:34:39 -07:00
Jens Axboe 470ec4ed8c io-wq: fix double put of 'wq' in error path
We are already freeing the wq struct in both spots, so don't put it and
get it freed twice.

Reported-by: syzbot+7bf785eedca35ca05501@syzkaller.appspotmail.com
Fixes: 4fb6ac3262 ("io-wq: improve manager/worker handling over exec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:34:00 -07:00
Jens Axboe d364d9e5db io-wq: wait for manager exit on wq destroy
The manager waits for the workers, hence the manager is always valid if
workers are running. Now also have wq destroy wait for the manager on
exit, so we now everything is gone.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:33:58 -07:00
Jens Axboe dbf996202e io-wq: rename wq->done completion to wq->started
This is a leftover from a different use cases, it's used to wait for
the manager to startup. Rename it as such.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:32:54 -07:00
Jens Axboe 613eeb600e io-wq: don't ask for a new worker if we're exiting
If we're in the process of shutting down the async context, then don't
create new workers if we already have at least the fixed one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:32:53 -07:00
Jens Axboe fb3a1f6c74 io-wq: have manager wait for all workers to exit
Instead of having to wait separately on workers and manager, just have
the manager wait on the workers. We use an atomic_t for the reference
here, as we need to start at 0 and allow increment from that. Since the
number of workers is naturally capped by the allowed nr of processes,
and that uses an int, there is no risk of overflow.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04 06:32:33 -07:00
Jens Axboe 65d4302317 io-wq: wait for worker startup when forking a new one
We need to have our worker count updated before continuing, to avoid
cases where we repeatedly think we need a new worker, but a fork is
already in progress.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-01 10:21:06 -07:00
Jens Axboe d6ce7f6761 io-wq: remove now unused IO_WQ_BIT_ERROR
This flag is now dead, remove it.

Fixes: 1cbd9c2bcf ("io-wq: don't create any IO workers upfront")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25 10:19:43 -07:00
Jens Axboe 4fb6ac3262 io-wq: improve manager/worker handling over exec
exec will cancel any threads, including the ones that io-wq is using. This
isn't a problem, in fact we'd prefer it to be that way since it means we
know that any async work cancels naturally without having to handle it
proactively.

But it does mean that we need to setup a new manager, as the manager and
workers are gone. Handle this at queue time, and cancel work if we fail.
Since the manager can go away without us noticing, ensure that the manager
itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
io_wq_put() to reflect that.

In the future we can now simplify exec cancelation handling, for now just
leave it the same.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25 10:17:09 -07:00
Jens Axboe e941894eae io-wq: make buffered file write hashed work map per-ctx
Before the io-wq thread change, we maintained a hash work map and lock
per-node per-ring. That wasn't ideal, as we really wanted it to be per
ring. But now that we have per-task workers, the hash map ends up being
just per-task. That'll work just fine for the normal case of having
one task use a ring, but if you share the ring between tasks, then it's
considerably worse than it was before.

Make the hash map per ctx instead, which provides full per-ctx buffered
write serialization on hashed writes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25 09:23:47 -07:00
Jens Axboe eb2de9418d io-wq: fix race around io_worker grabbing
There's a small window between lookup dropping the reference to the
worker and calling wake_up_process() on the worker task, where the worker
itself could have exited. We ensure that the worker struct itself is
valid, but worker->task may very well be gone by the time we issue the
wakeup.

Fix the race by using a completion triggered by the reference going to
zero, and having exit wait for that completion before proceeding.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:33:41 -07:00
Jens Axboe 8b3e78b595 io-wq: fix races around manager/worker creation and task exit
These races have always been there, they are just more apparent now that
we do early cancel of io-wq when the task exits.

Ensure that the io-wq manager sets task state correctly to not miss
wakeups for task creation. This is important if we get a wakeup after
having marked ourselves as TASK_INTERRUPTIBLE. If we do end up creating
workers, then we flip the state back to running, making the subsequent
schedule() a no-op. Also increment the wq ref count before forking the
thread, to avoid a use-after-free.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:33:38 -07:00
Jens Axboe 728f13e730 io-wq: remove nr_process accounting
We're now just using fork like we would from userspace, so there's no
need to try and impose extra restrictions or accounting on the user
side of things. That's already being done for us. That also means we
don't have to pass in the user_struct anymore, that's correctly inherited
through ->creds on fork.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:33:26 -07:00
Jens Axboe 843bbfd49f io-wq: make io_wq_fork_thread() available to other users
We want to use this in io_uring proper as well, for the SQPOLL thread.
Rename it from fork_thread() to io_wq_fork_thread(), and make it
available through the io-wq.h header.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe bf1daa4bfc io-wq: only remove worker from free_list, if it was there
If the worker isn't on the free_list, don't attempt to delete it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 4379bf8bd7 io_uring: remove io_identity
We are no longer grabbing state, so no need to maintain an IO identity
that we COW if there are changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe c6d77d92b7 io-wq: worker idling always returns false
Remove the bool return, and the checking for it in the caller.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 3bfe610669 io-wq: fork worker threads from original task
Instead of using regular kthread kernel threads, create kernel threads
that are like a real thread that the task would create. This ensures that
we get all the context that we need, without having to carry that state
around. This greatly reduces the code complexity, and the risk of missing
state for a given request type.

With the move away from kthread, we can also dump everything related to
assigned state to the new threads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 958234d5ec io-wq: don't pass 'wqe' needlessly around
Just grab it from the worker itself, which we're already passing in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 3b094e727d io-wq: get rid of wq->use_refs
We don't support attach anymore, so doesn't make sense to carry the
use_refs reference count. Get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 1cbd9c2bcf io-wq: don't create any IO workers upfront
When the manager thread starts up, it creates a worker per node for
the given context. Just let these get created dynamically, like we do
for adding further workers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 7c25c0d16e io_uring: remove the need for relying on an io-wq fallback worker
We hit this case when the task is exiting, and we need somewhere to
do background cleanup of requests. Instead of relying on the io-wq
task manager to do this work for us, just stuff it somewhere where
we can safely run it ourselves directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe e06aa2e94f io-wq: clear out worker ->fs and ->files
By default, kernel threads have init_fs and init_files assigned. In the
past, this has triggered security problems, as commands that don't ask
for (and hence don't get assigned) fs/files from the originating task
can then attempt path resolution etc with access to parts of the system
they should not be able to.

Rather than add checks in the fs code for misuse, just set these to
NULL. If we do attempt to use them, then the resulting code will oops
rather than provide access to something that it should not permit.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 14:02:54 -07:00
Pavel Begunkov 5280f7e530 io_uring/io-wq: return 2-step work swap scheme
Saving one lock/unlock for io-wq is not super important, but adds some
ugliness in the code. More important, atomic decs not turning it to zero
for some archs won't give the right ordering/barriers so the
io_steal_work() may pretty easily get subtly and completely broken.

Return back 2-step io-wq work exchange and clean it up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-04 08:05:46 -07:00
Jens Axboe 4014d943cb io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
it as uncancelable, kill it of.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Jens Axboe 446bc1c207 io-wq: kill now unused io_wq_cancel_all()
io_uring no longer issues full cancelations on the io-wq, so remove any
remnants of this code and the IO_WQ_BIT_CANCEL flag.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20 10:47:42 -07:00
Pavel Begunkov f6edbabb83 io_uring: always batch cancel in *cancel_files()
Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.

In many cases it should be faster, and we can reuse a lot of code from
task cancellation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Jens Axboe 3dd1680d14 io-wq: cancel request if it's asking for files and we don't have them
This can't currently happen, but will be possible shortly. Handle missing
files just like we do not being able to grab a needed mm, and mark the
request as needing cancelation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-04 10:22:56 -07:00
Jens Axboe 43c01fbefd io-wq: re-set NUMA node affinities if CPUs come online
We correctly set io-wq NUMA node affinities when the io-wq context is
setup, but if an entire node CPU set is offlined and then brought back
online, the per node affinities are broken. Ensure that we set them
again whenever a CPU comes online. This ensures that we always track
the right node affinity. The usual cpuhp notifiers are used to drive it.

Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-22 09:02:50 -06:00
Jens Axboe 69228338c9 io_uring: unify fsize with def->work_flags
This one was missed in the earlier conversion, should be included like
any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
when dropping the personality again.

Fixes: 98447d65b4 ("io_uring: move io identity items into separate struct")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-20 16:03:13 -06:00
Jens Axboe 4ea33a976b io-wq: inherit audit loginuid and sessionid
Make sure the async io-wq workers inherit the loginuid and sessionid from
the original task, and restore them to unset once we're done with the
async work item.

While at it, disable the ability for kernel threads to write to their own
loginuid.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:47 -06:00