commit d1868328dec5ae2cf210111025fcbc71f78dd5ca upstream.
ida_alloc_range(..., min, max, ...) returns values from min to max,
inclusive.
So, NR_EXT_DEVT is a valid idx returned by blk_alloc_ext_minor().
This is an issue because in device_add_disk(), this value is used in:
ddev->devt = MKDEV(disk->major, disk->first_minor);
and NR_EXT_DEVT is '(1 << MINORBITS)'.
So, should 'disk->first_minor' be NR_EXT_DEVT, it would overflow.
Fixes: 22ae8ce8b8 ("block: simplify bdev/disk lookup in blkdev_get")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/cc17199798312406b90834e433d2cefe8266823d.1648306232.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 15729ff8143f8135b03988a100a19e66d7cb7ecd ]
A crash [1] happened to be triggered in conjunction with commit
2d52c58b9c ("block, bfq: honor already-setup queue merges"). The
latter was then reverted by commit ebc69e897e ("Revert "block, bfq:
honor already-setup queue merges""). Yet, the reverted commit was not
the one introducing the bug. In fact, it actually triggered a UAF
introduced by a different commit, and now fixed by commit d29bd41428
("block, bfq: reset last_bfqq_created on group change").
So, there is no point in keeping commit 2d52c58b9c ("block, bfq:
honor already-setup queue merges") out. This commit restores it.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ab552fcb17cc9e4afe0e4ac4df95fc7b30e8490a ]
KASAN reports a use-after-free report when doing normal scsi-mq test
[69832.239032] ==================================================================
[69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0
[69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155
[69832.244656]
[69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8
[69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[69832.249069] Workqueue: kblockd blk_mq_run_work_fn
[69832.250022] Call Trace:
[69832.250541] dump_stack+0x9b/0xce
[69832.251232] ? bfq_dispatch_request+0x1045/0x44b0
[69832.252243] print_address_description.constprop.6+0x3e/0x60
[69832.253381] ? __cpuidle_text_end+0x5/0x5
[69832.254211] ? vprintk_func+0x6b/0x120
[69832.254994] ? bfq_dispatch_request+0x1045/0x44b0
[69832.255952] ? bfq_dispatch_request+0x1045/0x44b0
[69832.256914] kasan_report.cold.9+0x22/0x3a
[69832.257753] ? bfq_dispatch_request+0x1045/0x44b0
[69832.258755] check_memory_region+0x1c1/0x1e0
[69832.260248] bfq_dispatch_request+0x1045/0x44b0
[69832.261181] ? bfq_bfqq_expire+0x2440/0x2440
[69832.262032] ? blk_mq_delay_run_hw_queues+0xf9/0x170
[69832.263022] __blk_mq_do_dispatch_sched+0x52f/0x830
[69832.264011] ? blk_mq_sched_request_inserted+0x100/0x100
[69832.265101] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[69832.266206] ? blk_mq_do_dispatch_ctx+0x570/0x570
[69832.267147] ? __switch_to+0x5f4/0xee0
[69832.267898] blk_mq_sched_dispatch_requests+0xdf/0x140
[69832.268946] __blk_mq_run_hw_queue+0xc0/0x270
[69832.269840] blk_mq_run_work_fn+0x51/0x60
[69832.278170] process_one_work+0x6d4/0xfe0
[69832.278984] worker_thread+0x91/0xc80
[69832.279726] ? __kthread_parkme+0xb0/0x110
[69832.280554] ? process_one_work+0xfe0/0xfe0
[69832.281414] kthread+0x32d/0x3f0
[69832.282082] ? kthread_park+0x170/0x170
[69832.282849] ret_from_fork+0x1f/0x30
[69832.283573]
[69832.283886] Allocated by task 7725:
[69832.284599] kasan_save_stack+0x19/0x40
[69832.285385] __kasan_kmalloc.constprop.2+0xc1/0xd0
[69832.286350] kmem_cache_alloc_node+0x13f/0x460
[69832.287237] bfq_get_queue+0x3d4/0x1140
[69832.287993] bfq_get_bfqq_handle_split+0x103/0x510
[69832.289015] bfq_init_rq+0x337/0x2d50
[69832.289749] bfq_insert_requests+0x304/0x4e10
[69832.290634] blk_mq_sched_insert_requests+0x13e/0x390
[69832.291629] blk_mq_flush_plug_list+0x4b4/0x760
[69832.292538] blk_flush_plug_list+0x2c5/0x480
[69832.293392] io_schedule_prepare+0xb2/0xd0
[69832.294209] io_schedule_timeout+0x13/0x80
[69832.295014] wait_for_common_io.constprop.1+0x13c/0x270
[69832.296137] submit_bio_wait+0x103/0x1a0
[69832.296932] blkdev_issue_discard+0xe6/0x160
[69832.297794] blk_ioctl_discard+0x219/0x290
[69832.298614] blkdev_common_ioctl+0x50a/0x1750
[69832.304715] blkdev_ioctl+0x470/0x600
[69832.305474] block_ioctl+0xde/0x120
[69832.306232] vfs_ioctl+0x6c/0xc0
[69832.306877] __se_sys_ioctl+0x90/0xa0
[69832.307629] do_syscall_64+0x2d/0x40
[69832.308362] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[69832.309382]
[69832.309701] Freed by task 155:
[69832.310328] kasan_save_stack+0x19/0x40
[69832.311121] kasan_set_track+0x1c/0x30
[69832.311868] kasan_set_free_info+0x1b/0x30
[69832.312699] __kasan_slab_free+0x111/0x160
[69832.313524] kmem_cache_free+0x94/0x460
[69832.314367] bfq_put_queue+0x582/0x940
[69832.315112] __bfq_bfqd_reset_in_service+0x166/0x1d0
[69832.317275] bfq_bfqq_expire+0xb27/0x2440
[69832.318084] bfq_dispatch_request+0x697/0x44b0
[69832.318991] __blk_mq_do_dispatch_sched+0x52f/0x830
[69832.319984] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[69832.321087] blk_mq_sched_dispatch_requests+0xdf/0x140
[69832.322225] __blk_mq_run_hw_queue+0xc0/0x270
[69832.323114] blk_mq_run_work_fn+0x51/0x60
[69832.323942] process_one_work+0x6d4/0xfe0
[69832.324772] worker_thread+0x91/0xc80
[69832.325518] kthread+0x32d/0x3f0
[69832.326205] ret_from_fork+0x1f/0x30
[69832.326932]
[69832.338297] The buggy address belongs to the object at ffff88802622b968
[69832.338297] which belongs to the cache bfq_queue of size 512
[69832.340766] The buggy address is located 288 bytes inside of
[69832.340766] 512-byte region [ffff88802622b968, ffff88802622bb68)
[69832.343091] The buggy address belongs to the page:
[69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228
[69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0
[69832.347719] flags: 0x1fffff80010200(slab|head)
[69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840
[69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000
[69832.356547] page dumped because: kasan: bad access detected
[69832.357652]
[69832.357970] Memory state around the buggy address:
[69832.358926] ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.360358] ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.363273] ^
[69832.363975] ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
[69832.375960] ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[69832.377405] ==================================================================
In bfq_dispatch_requestfunction, it may have function call:
bfq_dispatch_request
__bfq_dispatch_request
bfq_select_queue
bfq_bfqq_expire
__bfq_bfqd_reset_in_service
bfq_put_queue
kmem_cache_free
In this function call, in_serv_queue has beed expired and meet the
conditions to free. In the function bfq_dispatch_request, the address
of in_serv_queue pointing to has been released. For getting the value
of idle_timer_disabled, it will get flags value from the address which
in_serv_queue pointing to, then the problem of use-after-free happens;
Fix the problem by check in_serv_queue == bfqd->in_service_queue, to
get the value of idle_timer_disabled if in_serve_queue is equel to
bfqd->in_service_queue. If the space of in_serv_queue pointing has
been released, this judge will aviod use-after-free problem.
And if in_serv_queue may be expired or finished, the idle_timer_disabled
will be false which would not give effects to bfq_update_dispatch_stats.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bcd2be763252f3a4d5fc4d6008d4d96c601ee74b ]
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0.
What we want is ioprio or 0.
Correct this by changing the calculation.
Signed-off-by: Yahu Gao <gaoyahu19@gmail.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0f69288253e9fc7c495047720e523b9f1aba5712 ]
kobjects aren't supposed to be deleted before their child kobjects are
deleted. Apparently this is usually benign; however, a WARN will be
triggered if one of the child kobjects has a named attribute group:
sysfs group 'modes' not found for kobject 'crypto'
WARNING: CPU: 0 PID: 1 at fs/sysfs/group.c:278 sysfs_remove_group+0x72/0x80
...
Call Trace:
sysfs_remove_groups+0x29/0x40 fs/sysfs/group.c:312
__kobject_del+0x20/0x80 lib/kobject.c:611
kobject_cleanup+0xa4/0x140 lib/kobject.c:696
kobject_release lib/kobject.c:736 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x53/0x70 lib/kobject.c:753
blk_crypto_sysfs_unregister+0x10/0x20 block/blk-crypto-sysfs.c:159
blk_unregister_queue+0xb0/0x110 block/blk-sysfs.c:962
del_gendisk+0x117/0x250 block/genhd.c:610
Fix this by moving the kobject_del() and the corresponding
kobject_uevent() to the correct place.
Fixes: 2c2086afc2 ("block: Protect less code with sysfs_lock in blk_{un,}register_queue()")
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220124215938.2769-3-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f122d103b564e5fb7c82de902c6f8f6cbdf50ec9 ]
Don't need to do blkg_iostat_set for top blkg iostat on each CPU,
so move it after percpu stat aggregation.
Fixes: ef45fe470e ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220213085902.88884-1-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6b2b04590b51aa4cf395fcd185ce439cab5961dc upstream.
blk-iocost and iolatency are cgroup aware rq-qos policies but they didn't
disable merges across different cgroups. This obviously can lead to
accounting and control errors but more importantly to priority inversions -
e.g. an IO which belongs to a higher priority cgroup or IO class may end up
getting throttled incorrectly because it gets merged to an IO issued from a
low priority cgroup.
Fix it by adding blk_cgroup_mergeable() which is called from merge paths and
rejects cross-cgroup and cross-issue_as_root merges.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Cc: stable@vger.kernel.org # v4.19+
Cc: Josef Bacik <jbacik@fb.com>
Link: https://lore.kernel.org/r/Yi/eE/6zFNyWJ+qd@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit daaca3522a8e67c46e39ef09c1d542e866f85f3b upstream.
blkcg_init_queue() may add rq qos structures to request queue, previously
blk_cleanup_queue() calls rq_qos_exit() to release them, but commit
8e141f9eb8 ("block: drain file system I/O on del_gendisk")
moves rq_qos_exit() into del_gendisk(), so memory leak is caused
because queues may not have disk, such as un-present scsi luns, nvme
admin queue, ...
Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back.
BTW, v5.18 won't need this patch any more since we move
blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release
handler, and patches have been in for-5.18/block.
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b81e0c2372e65e5627864ba034433b64b2fc73f5 upstream.
Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>a
Reported-by: "H. Nikolaus Schaller" <hns@goldelico.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: "Maciej W. Rozycki" <macro@orcam.me.uk>
[ resolves MIPS build failure by luck, root cause needs to be fixed in
Linus's tree properly, but this is needed for now to fix the build - gregkh ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cc8f7fe1f5eab010191aa4570f27641876fa1267 ]
Add __GFP_ZERO flag for alloc_page in function bio_copy_kern to initialize
the buffer of a bio.
Signed-off-by: Haimin Zhang <tcs.kernel@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216084038.15635-1-tcs.kernel@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7a5428dcb7902700b830e912feee4e845df7c019 upstream.
Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb8 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.
Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e92bc4cd34de2ce454bdea8cd198b8067ee4e123 upstream.
Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.
The following is the scenario that triggered the problem.
T1 T2 T3
elevator_switch_mq
bfq_init_queue
wbt_disable_default <= Set
rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
scsi_remove_device
sd_remove
del_gendisk
blk_unregister_queue
elv_unregister_queue
wbt_enable_default
<= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.
Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.
Fixes: 76a8040817 ("blk-wbt: make sure throttle is enabled properly")
Remove oneline stale comment, and kill one oneshot local variable.
Signed-off-by: Ming Lei <ming.lei@rehdat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b13e0c71856817fca67159b11abac350e41289f5 upstream.
Commit 309a62fa3a ("bio-integrity: bio_integrity_advance must update
integrity seed") added code to update the integrity seed value when
advancing a bio. However, it failed to take into account that the
integrity interval might be larger than the 512-byte block layer
sector size. This broke bio splitting on PI devices with 4KB logical
blocks.
The seed value should be advanced by bio_integrity_intervals() and not
the number of sectors.
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
Fixes: 309a62fa3a ("bio-integrity: bio_integrity_advance must update integrity seed")
Tested-by: Dmitry Ivanov <dmitry.ivanov2@hpe.com>
Reported-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220204034209.4193-1-martin.petersen@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3ee859e384d453d6ac68bfd5971f630d9fa46ad3 upstream.
bio_truncate() clears the buffer outside of last block of bdev, however
current bio_truncate() is using the wrong offset of page. So it can
return the uninitialized data.
This happened when both of truncated/corrupted FS and userspace (via
bdev) are trying to read the last of bdev.
Reported-by: syzbot+ac94ae5f68b84197f41c@syzkaller.appspotmail.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/875yqt1c9g.fsf@mail.parknet.co.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e45c47d1f94e0cc7b6b079fdb4bcce2995e2adc4 upstream.
bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 46cdc45acb089c811d9a54fd50af33b96e5fae9d upstream.
A previous commit added this feature, but it inadvertently used the wrong
variable to show/store the setting from/to, victimized by copy/paste. Fix
it up so that the async_depth sysfs interface reads and writes from the
right setting.
Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e338924bd05d6e71574bc13e310c89e10e49a8a5 ]
ioctl(fd, LOOP_CTL_ADD, 1048576) causes
sysfs: cannot create duplicate filename '/dev/block/7:0'
message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.
Reported-by: wangyangbo <wangyangbo@uniontech.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/b1b19379-23ee-5379-0eb5-94bf5f79f1b4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6e1fcab00a23f7fe9f4fe9704905a790efa1eeab ]
John Garry reported a deadlock that occurs when trying to access a
runtime-suspended SATA device. For obscure reasons, the rescan procedure
causes the link to be hard-reset, which disconnects the device.
The rescan tries to carry out a runtime resume when accessing the device.
scsi_rescan_device() holds the SCSI device lock and won't release it until
it can put commands onto the device's block queue. This can't happen until
the queue is successfully runtime-resumed or the device is unregistered.
But the runtime resume fails because the device is disconnected, and
__scsi_remove_device() can't do the unregistration because it can't get the
device lock.
The best way to resolve this deadlock appears to be to allow the block
queue to start running again even after an unsuccessful runtime resume.
The idea is that the driver or the SCSI error handler will need to be able
to use the queue to resolve the runtime resume failure.
This patch removes the err argument to blk_post_runtime_resume() and makes
the routine act as though the resume was successful always. This fixes the
deadlock.
Link: https://lore.kernel.org/r/1639999298-244569-4-git-send-email-chenxiang66@hisilicon.com
Fixes: e27829dc92 ("scsi: serialize ->rescan against ->remove")
Reported-and-tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 99d8690aae4b2f0d1d90075de355ac087f820a66 ]
One device_add is called disk->ev will be freed by disk_release, so we
should free it twice. Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.
Based on an earlier patch from Tetsuo Handa.
Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221161851.788424-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c65e6fd460b4df796ecd6ea22e132076ed1f2820 ]
Commit 7cc4ffc555 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.
Consider for example the following job file:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200
[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200
Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.
Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.
Fixes: 7cc4ffc555 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit edaa26334c117a584add6053f48d63a988d25a6e upstream.
The donation calculation logic assumes that the donor has non-zero
after-donation hweight, so the lowest active hweight a donating cgroup can
have is 2 so that it can donate 1 while keeping the other 1 for itself.
Earlier, we only donated from cgroups with sizable surpluses so this
condition was always true. However, with the precise donation algorithm
implemented, f1de2439ec ("blk-iocost: revamp donation amount
determination") made the donation amount calculation exact enabling even low
hweight cgroups to donate.
This means that in rare occasions, a cgroup with active hweight of 1 can
enter donation calculation triggering the following warning and then a
divide-by-zero oops.
WARNING: CPU: 4 PID: 0 at block/blk-iocost.c:1928 transfer_surpluses.cold+0x0/0x53 [884/94867]
...
RIP: 0010:transfer_surpluses.cold+0x0/0x53
Code: 92 ff 48 c7 c7 28 d1 ab b5 65 48 8b 34 25 00 ae 01 00 48 81 c6 90 06 00 00 e8 8b 3f fe ff 48 c7 c0 ea ff ff ff e9 95 ff 92 ff <0f> 0b 48 c7 c7 30 da ab b5 e8 71 3f fe ff 4c 89 e8 4d 85 ed 74 0
4
...
Call Trace:
<IRQ>
ioc_timer_fn+0x1043/0x1390
call_timer_fn+0xa1/0x2c0
__run_timers.part.0+0x1ec/0x2e0
run_timer_softirq+0x35/0x70
...
iocg: invalid donation weights in /a/b: active=1 donating=1 after=0
Fix it by excluding cgroups w/ active hweight < 2 from donating. Excluding
these extreme low hweight donations shouldn't affect work conservation in
any meaningful way.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f1de2439ec ("blk-iocost: revamp donation amount determination")
Cc: stable@vger.kernel.org # v5.10+
Link: https://lore.kernel.org/r/Ybfh86iSvpWKxhVM@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e6a59aac8a8713f335a37d762db0dbe80e7f6d38 upstream.
do_each_pid_thread(PIDTYPE_PGID) can race with a concurrent
change_pid(PIDTYPE_PGID) that can move the task from one hlist
to another while iterating. Serialize ioprio_get to take
the tasklist_lock in this case, just like it's set counterpart.
Fixes: d69b78ba1d (ioprio: grab rcu_read_lock in sys_ioprio_{set,get}())
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lore.kernel.org/r/20211210182058.43417-1-dave@stgolabs.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 245a489e81e13dd55ae46d27becf6d5901eb7828 upstream.
elevator_init_mq() is only called before adding disk, when there isn't
any FS I/O, only passthrough requests can be queued, so freezing queue
plus canceling dispatch work is enough to drain any dispatch activities,
then we can avoid synchronize_srcu() in blk_mq_quiesce_queue().
Long boot latency issue can be fixed in case of lots of disks added
during booting.
Fixes: 737eb78e82 ("block: Delay default elevator initialization")
Reported-by: yangerkun <yangerkun@huawei.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211117115502.1600950-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b781d8db580c058ecd54ed7d5dde7f8270b25f5b ]
KASAN reports a use-after-free report when doing block test:
==================================================================
[10050.967049] BUG: KASAN: use-after-free in
submit_bio_checks+0x1539/0x1550
[10050.977638] Call Trace:
[10050.978190] dump_stack+0x9b/0xce
[10050.979674] print_address_description.constprop.6+0x3e/0x60
[10050.983510] kasan_report.cold.9+0x22/0x3a
[10050.986089] submit_bio_checks+0x1539/0x1550
[10050.989576] submit_bio_noacct+0x83/0xc80
[10050.993714] submit_bio+0xa7/0x330
[10050.994435] mpage_readahead+0x380/0x500
[10050.998009] read_pages+0x1c1/0xbf0
[10051.002057] page_cache_ra_unbounded+0x4c2/0x6f0
[10051.007413] do_page_cache_ra+0xda/0x110
[10051.008207] force_page_cache_ra+0x23d/0x3d0
[10051.009087] page_cache_sync_ra+0xca/0x300
[10051.009970] generic_file_buffered_read+0xbea/0x2130
[10051.012685] generic_file_read_iter+0x315/0x490
[10051.014472] blkdev_read_iter+0x113/0x1b0
[10051.015300] aio_read+0x2ad/0x450
[10051.023786] io_submit_one+0xc8e/0x1d60
[10051.029855] __se_sys_io_submit+0x125/0x350
[10051.033442] do_syscall_64+0x2d/0x40
[10051.034156] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.048733] Allocated by task 18598:
[10051.049482] kasan_save_stack+0x19/0x40
[10051.050263] __kasan_kmalloc.constprop.1+0xc1/0xd0
[10051.051230] kmem_cache_alloc+0x146/0x440
[10051.052060] mempool_alloc+0x125/0x2f0
[10051.052818] bio_alloc_bioset+0x353/0x590
[10051.053658] mpage_alloc+0x3b/0x240
[10051.054382] do_mpage_readpage+0xddf/0x1ef0
[10051.055250] mpage_readahead+0x264/0x500
[10051.056060] read_pages+0x1c1/0xbf0
[10051.056758] page_cache_ra_unbounded+0x4c2/0x6f0
[10051.057702] do_page_cache_ra+0xda/0x110
[10051.058511] force_page_cache_ra+0x23d/0x3d0
[10051.059373] page_cache_sync_ra+0xca/0x300
[10051.060198] generic_file_buffered_read+0xbea/0x2130
[10051.061195] generic_file_read_iter+0x315/0x490
[10051.062189] blkdev_read_iter+0x113/0x1b0
[10051.063015] aio_read+0x2ad/0x450
[10051.063686] io_submit_one+0xc8e/0x1d60
[10051.064467] __se_sys_io_submit+0x125/0x350
[10051.065318] do_syscall_64+0x2d/0x40
[10051.066082] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.067455] Freed by task 13307:
[10051.068136] kasan_save_stack+0x19/0x40
[10051.068931] kasan_set_track+0x1c/0x30
[10051.069726] kasan_set_free_info+0x1b/0x30
[10051.070621] __kasan_slab_free+0x111/0x160
[10051.071480] kmem_cache_free+0x94/0x460
[10051.072256] mempool_free+0xd6/0x320
[10051.072985] bio_free+0xe0/0x130
[10051.073630] bio_put+0xab/0xe0
[10051.074252] bio_endio+0x3a6/0x5d0
[10051.074984] blk_update_request+0x590/0x1370
[10051.075870] scsi_end_request+0x7d/0x400
[10051.076667] scsi_io_completion+0x1aa/0xe50
[10051.077503] scsi_softirq_done+0x11b/0x240
[10051.078344] blk_mq_complete_request+0xd4/0x120
[10051.079275] scsi_mq_done+0xf0/0x200
[10051.080036] virtscsi_vq_done+0xbc/0x150
[10051.080850] vring_interrupt+0x179/0x390
[10051.081650] __handle_irq_event_percpu+0xf7/0x490
[10051.082626] handle_irq_event_percpu+0x7b/0x160
[10051.083527] handle_irq_event+0xcc/0x170
[10051.084297] handle_edge_irq+0x215/0xb20
[10051.085122] asm_call_irq_on_stack+0xf/0x20
[10051.085986] common_interrupt+0xae/0x120
[10051.086830] asm_common_interrupt+0x1e/0x40
==================================================================
Bio will be checked at beginning of submit_bio_noacct(). If bio needs
to be throttled, it will start the timer and stop submit bio directly.
Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
But in the current process, if bio is throttled, it will still set bio
issue->value by blkcg_bio_issue_init(). This is redundant and may cause
the above use-after-free.
CPU0 CPU1
submit_bio
submit_bio_noacct
submit_bio_checks
blk_throtl_bio()
<=mod_timer(&sq->pending_timer
blk_throtl_dispatch_work_fn
submit_bio_noacct() <= bio have
throttle tag, will throw directly
and bio issue->value will be set
here
bio_endio()
bio_put()
bio_free() <= free this bio
blkcg_bio_issue_init(bio)
<= bio has been freed and
will lead to UAF
return BLK_QC_T_NONE
Fix this by remove extra blkcg_bio_issue_init.
Fixes: e439bedf6b (blkcg: consolidate bio_issue_init() to be a part of core)
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 86399ea071099ec8ee0a83ac9ad67f7df96a50ad upstream.
When BLKRESETZONE ioctl and data read race, the data read leaves stale
page cache. The commit e511350590 ("block: Discard page cache of zone
reset target range") added page cache truncation to avoid stale page
cache after the ioctl. However, the stale page cache still can be read
during the reset zone operation for the ioctl. To avoid the stale page
cache completely, hold invalidate_lock of the block device file mapping.
Fixes: e511350590 ("block: Discard page cache of zone reset target range")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211111085238.942492-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 35e4c6c1a2fc2eb11b9306e95cda1fa06a511948 upstream.
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.
Fixes: 22dd6d3566 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7607c44c157d343223510c8ffdf7206fdd2a6213 upstream.
When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.
Fixes: 351499a172 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fe7d064fa3faec5d8157029fb8720b4fddc9e1e8 ]
Commit 83cbce9574 ("block: add error handling for device_add_disk /
add_disk") added error handling to device_add_disk(), however the goto
label for the kobject_create_and_add() failure did not set the return
value correctly, and so we can end up in a situation where
kobject_create_and_add() fails but we report success.
Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211103164023.1384821-1-mcgrof@kernel.org
[axboe: fold in followup fix from Wu Bo <wubo40@huawei.com>]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 037057a5a979c7eeb2ee5d12cf4c24b805192c75 ]
This check is meant to catch cases where a requeue is attempted on a
request that is still inserted. It's never really been useful to catch any
misuse, and now it's actively wrong. Outside of that, this should not be a
BUG_ON() to begin with.
Remove the check as it's now causing active harm, as requeue off the plug
path will trigger it even though the request state is just fine.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ba0ffdd8ce48ad7f7e85191cd29f9674caca3745 ]
Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.
Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.
While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit a33df75c63 ("block: use an xarray for disk->part_tbl") modified
the method to check partition existence in host-aware zoned block
devices from disk_has_partitions() helper function call to empty check
of xarray disk->part_tbl. However, disk->part_tbl always has single
entry for disk->part0 and never becomes empty. This resulted in the
host-aware zoned devices always judged to have partitions, and it made
the sysfs queue/zoned attribute to be "none" instead of "host-aware"
regardless of partition existence in the devices.
This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for
sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned
device. Since block layer handled the host-aware zoned devices as non-
zoned devices, scsi layer did not have chance to initialize the mutex
for zone revalidation. Therefore, the warning was triggered.
To fix the issues, call the helper function disk_has_partitions() in
place of disk->part_tbl empty check. Since the function was removed with
the commit a33df75c63, reimplement it to walk through entries in the
xarray disk->part_tbl.
Fixes: a33df75c63 ("block: use an xarray for disk->part_tbl")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211026060115.753746-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.
Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.
Also, consolidate the checks into the "need_resource" variable to simplify
the condition.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before removing disk from sysfs, userspace still may change queue via
sysfs, such as switching elevator or setting wbt latency, both may
reinitialize wbt, then the warning in blk_free_queue_stats() will be
triggered since rq_qos_exit() is moved to del_gendisk().
Fixes the issue by moving draining queue & tearing down after disk is
removed from sysfs, at that time no one can come into queue's
store()/show().
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026101204.2897166-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFzfzwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqvoEACZB9dFKiYyFcv6X6ARAhfKDuE4ImaJ/hLI
VCt4M52P06nqywPQ7iyZMRWR/EVKW8ADDJiiAeXy3mDBgMqD6O27u894i1JP06sn
5gHcSvfH1b9PNFMUw04aI3iXXFUzQU7pwn5z6o2nXlA4onPVW7vwTp8fcHd41Kep
LqvZJbihW/iF9d0Wjs9LqPRBWXchtsVyxiNDBgC+kx5IYFn+oTnZOhlxw8ZiT/KH
0v8FIq9HY+5n6UP7InZF2gtIQUyDTR5L1zKKvJu5LDoHvcNlM6Ke0m3DVPcgP79D
2kKGyHOGfqC9Gr37qqOjgKqRO/Z/9SCvG39dmocAd/hh3AfUgKpDQs3HgLyx7ECT
aRAe5n0XbfIVcHX1XaOc8cGrszan9YhJvt/dMCmkjaG/3hASlzl2kV4QF3f5IVjx
oMgB1Kj8kyu6SqG8mCCjyGCxPpzNq8lVplJRlpifoz+ID/+hgt03aDoYVfPZkDRL
nf4VdQCRSl3ZEXkHy1j6l6Nb2UgNEZP1B3a/9onSyBJ/WYqSfFMXrx29PSirz7m7
x4jGOJvdqtNx09zjWHXc/d+I8BEXp4JDXe0GH0OHMiwCwz5PoMo99HRb+IuffKjR
lWl4EimH0bfzOA/3vFr5TigfqbnDJ7HCRrGsodQX8gJhVaxVTWxeZG+7Y9qkLqnD
JGlZeMQ37w==
=uhGw
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Fix for the cgroup code not ussing irq safe stats updates, and one fix
for an error handling condition in add_partition()"
* tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
block: fix incorrect references to disk objects
blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
When adding partitions to the disk, the reference count of the disk
object is increased. then alloc partition device and called
device_add(), if the device_add() return error, the reference
count of the disk object will be reduced twice, at put_device(pdev)
and put_disk(disk). this leads to the end of the object's life cycle
prematurely, and trigger following calltrace.
__init_work+0x2d/0x50 kernel/workqueue.c:519
synchronize_rcu_expedited+0x3af/0x650 kernel/rcu/tree_exp.h:847
bdi_remove_from_list mm/backing-dev.c:938 [inline]
bdi_unregister+0x17f/0x5c0 mm/backing-dev.c:946
release_bdi+0xa1/0xc0 mm/backing-dev.c:968
kref_put include/linux/kref.h:65 [inline]
bdi_put+0x72/0xa0 mm/backing-dev.c:976
bdev_free_inode+0x11e/0x220 block/bdev.c:408
i_callback+0x3f/0x70 fs/inode.c:226
rcu_do_batch kernel/rcu/tree.c:2508 [inline]
rcu_core+0x76d/0x16c0 kernel/rcu/tree.c:2743
__do_softirq+0x1d7/0x93b kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu kernel/softirq.c:636 [inline]
irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
sysvec_apic_timer_interrupt+0x93/0xc0
making disk is NULL when calling put_disk().
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211018103422.2043-1-qiang.zhang1211@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFsIqAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppbBEACDewLUv7bg1VFIGdroRN51OGiOv1oV+8HP
ruY7O9CPtV7wcb3lA1Zy9igICuzuC5culHjbRJrNIUeTdWCQHCFk/sfKSD6VGMoT
cFTqpKxV7M3vYr9G2m5TFWgY2mfS+I5fxyDZxK2z2esHCFw6TZ7A5W13xScVXKP+
QdNFSlTrGkpggsSIEeHApG+NLsIecnkT4qzm8zPfUodUtQ3A8JMjQjnYUFEAWfWv
l9x9zDIzaGjPtXf5soFEvmdh1ALh3WWiYb1kIwK1FeP/PYX0JV/3zCMgqOwpK+4b
69OM3Q0NPHvu2TgSRK+ghekAtz5qgPDMCrzdhSgLYJEL/PGAOboqjrB9E+wWoEjd
IKrYLx4Xao2TUZLJF2y34hHfODGdasx7d+wS191UpVFEZHFhDhIaazZ2rDd5xnQK
LdzQw1JQF/igJovHauhSkGFIdJWBSDneLQoMimBnitZlsWARUmFSZej34FFRLZsW
8ZXfqipn/x+fh4sQ/HdEfWxnGHtveDpU+0Ka5bMUe/tJ9RPtmn/Ye7nFjYecC6NY
4UzFSNn+4e9DpHaDuP3I/eA1YBmVlcB5Hum3ve7X6ovwpjArYg3dgJOEi8uCZjfb
hdMANmkVptcPiEO9njEHhC7S8+Nm3t+8o3qQceN81j6Vcjgzt/Y/n3Z6UkKeSlkn
Ila+cZI1oA==
=J/e4
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Bigger than usual for this point in time, the majority is fixing some
issues around BDI lifetimes with the move from the request_queue to
the disk in this release. In detail:
- Series on draining fs IO for del_gendisk() (Christoph)
- NVMe pull request via Christoph:
- fix the abort command id (Keith Busch)
- nvme: fix per-namespace chardev deletion (Adam Manzanares)
- brd locking scope fix (Tetsuo)
- BFQ fix (Paolo)"
* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
block, bfq: reset last_bfqq_created on group change
block: warn when putting the final reference on a registered disk
brd: reduce the brd_devices_mutex scope
kyber: avoid q->disk dereferences in trace points
block: keep q_usage_counter in atomic mode after del_gendisk
block: drain file system I/O on del_gendisk
block: split bio_queue_enter from blk_queue_enter
block: factor out a blk_try_enter_queue helper
block: call submit_bio_checks under q_usage_counter
nvme: fix per-namespace chardev deletion
block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
nvme-pci: Fix abort command id
Since commit 430a67f9d6 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d6 ("block, bfq: merge bursts
of newly-created queues").
Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].
Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Warn when the last reference on a live disk is put without calling
del_gendisk first. There are some BDI related bug reports that look
like a case of this, so make sure we have the proper instrumentation
to catch it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
q->disk becomes invalid after the gendisk is removed. Work around this
by caching the dev_t for the tracepoints. The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices. After removing the disk only passthrough
commands can be send anyway.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>