linux

Commit Graph

Author	SHA1	Message	Date
Bryan Gurney	72d7df4c80	dm dust: add limited write failure mode Add a limited write failure mode which allows a write to a block to fail a specified amount of times, prior to remapping. The "addbadblock" message is extended to allow specifying the limited number of times a write fails. Example: add bad block on block 60, with 5 write failures: dmsetup message 0 dust1 addbadblock 60 5 The write failure counter will be printed for newly added bad blocks. Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 15:25:34 -05:00
Bryan Gurney	cc7a7fb3b6	dm dust: change ret to r in dust_map_read and dust_map In the dust_map_read() and dust_map() functions, change the return code variable "ret" to "r", to match the convention of the other device-mapper targets. Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:56:08 -05:00
Bryan Gurney	6ec1be5015	dm dust: change result vars to r Change the "result" variables to "r" in dust_status() and dust_message(). Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:55:31 -05:00
Mikulas Patocka	26b924b93c	dm cache: replace spin_lock_irqsave with spin_lock_irq If we are in a place where it is known that interrupts are enabled, functions spin_lock_irq/spin_unlock_irq should be used instead of spin_lock_irqsave/spin_unlock_irqrestore. spin_lock_irq and spin_unlock_irq are faster because they don't need to push and pop the flags register. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:53:04 -05:00
Mikulas Patocka	235bc86160	dm bio prison: replace spin_lock_irqsave with spin_lock_irq Replace spin_lock_irqsave/irqrestore with spin_lock_irq/spin_unlock_irq. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:53:03 -05:00
Mikulas Patocka	8e0c9dacc3	dm thin: replace spin_lock_irqsave with spin_lock_irq If we are in a place where it is known that interrupts are enabled, functions spin_lock_irq/spin_unlock_irq should be used instead of spin_lock_irqsave/spin_unlock_irqrestore. spin_lock_irq and spin_unlock_irq are faster because they don't need to push and pop the flags register. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:38:26 -05:00
Nikos Tsironis	52c67d416b	dm clone: add bucket_lock_irq/bucket_unlock_irq helpers Introduce bucket_lock_irq() and bucket_unlock_irq() helpers and use them in places where it is known that interrupts are enabled. Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:36:34 -05:00
Mikulas Patocka	6ca43ed837	dm clone: replace spin_lock_irqsave with spin_lock_irq If we are in a place where it is known that interrupts are enabled, functions spin_lock_irq/spin_unlock_irq should be used instead of spin_lock_irqsave/spin_unlock_irqrestore. spin_lock_irq and spin_unlock_irq are faster because they don't need to push and pop the flags register. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:35:42 -05:00
Maged Mokhtar	c1005322ff	dm writecache: handle REQ_FUA Call writecache_flush() on REQ_FUA in writecache_map(). Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Maged Mokhtar <mmokhtar@petasan.org> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:21:40 -05:00
Mikulas Patocka	8dd85873a0	dm writecache: fix uninitialized variable warning This fixes coverity warning CID 1454301. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:11:44 -05:00
Gustavo A. R. Silva	8adeac3be0	dm stripe: use struct_size() in kmalloc() One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct stripe_c { ... struct stripe stripe[0]; }; In this case alloc_context() and dm_array_too_big() are removed and replaced by the direct use of the struct_size() helper in kmalloc(). Notice that open-coded form is prone to type mistakes. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:09:59 -05:00
Heinz Mauelshagen	53be73a5d7	dm raid: streamline rs_get_progress() and its raid_status() caller side Pass already deciphered state into rs_get_progress, simplify recovery offset definition and combine two st_resync, st_reshape conditionals into one as is already the case with st_check and st_repair. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:04:37 -05:00
Heinz Mauelshagen	f9f3ee9130	dm raid: simplify rs_setup_recovery call chain rs_setup_recovery() sets the starting recovery offset. Drop superfluous rs_setup_recovery() and replace with __rs_setup_recovery(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:02:52 -05:00
Heinz Mauelshagen	99273d9e6e	dm raid: to ensure resynchronization, perform raid set grow in preresume This fixes a flaw causing raid set extensions not to be synchronized in case the MD bitmap resize required additional pages to be allocated. Also share resize code in the raid constructor between new size changes and those occuring during recovery. Bump the target version to define the change and document it in Documentation/admin-guide/device-mapper/dm-raid.rst. Reported-by: Steve D <steved424@gmail.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:02:26 -05:00
Heinz Mauelshagen	22c992e1a8	dm raid: change rs_set_dev_and_array_sectors API and callers Add a size argument to rs_set_dev_and_array_sectors as prerequisite to fixing grown device resynchronization not occuring when new MD bitmap pages have to be allocated as a result of the extension in a follwup patch. Also avoid code duplication by using rs_set_rdev_sectors in the aforementioned function. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 14:02:00 -05:00
Mike Snitzer	6ba01df72b	dm table: do not allow request-based DM to stack on partitions Partitioned request-based devices cannot be used as underlying devices for request-based DM because no partition offsets are added to each incoming request. As such, until now, stacking on partitioned devices would _always_ result in data corruption (e.g. wiping the partition table, writing to other partitions, etc). Fix this by disallowing request-based stacking on partitions. While at it, since all .request_fn support has been removed from block core, remove legacy dm-table code that differentiated between blk-mq and .request_fn request-based. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-11-05 11:22:52 -05:00
Yufen Yu	6a5cb53aaa	md: no longer compare spare disk superblock events in super_load We have a test case as follow: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \ --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force When we readd /dev/sda to the array, it started to do recovery. After offline the other two disks in md1, the recovery have been interrupted and superblock update info cannot be written to the offline disks. While the spare disk (/dev/sda) can continue to update superblock info. After stopping the array and assemble it, we found the array run fail, with the follow kernel message: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) [ 173.023466] md: md1 stopped. Since both sdb and sdc have the value of 'sb->events' smaller than that in sda, they have been kicked from the array. However, the only remained disk sda is in 'spare' state before stop and it cannot be added to conf->mirrors[] array. In the end, raid array assemble and run fail. In fact, we can use the older disk sdb or sdc to assemble the array. That means we should not choose the 'spare' disk as the fresh disk in analyze_sbs(). To fix the problem, we do not compare superblock events when it is a spare disk, as same as validate_super. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-10-24 15:22:40 -07:00
David Jeffery	775d78319f	md: improve handling of bio with REQ_PREFLUSH in md_flush_request() If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: `2bc13b83e6` ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown <neilb@suse.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-10-24 15:22:40 -07:00
Guoqing Jiang	fadcbd2901	md/bitmap: avoid race window between md_bitmap_resize and bitmap_file_clear_bit We need to move "spin_lock_irq(&bitmap->counts.lock)" before unmap previous storage, otherwise panic like belows could happen as follows. [ 902.353802] sdl: detected capacity change from 1077936128 to 3221225472 [ 902.616948] general protection fault: 0000 [#1] SMP [snip] [ 902.618588] CPU: 12 PID: 33698 Comm: md0_raid1 Tainted: G O 4.14.144-1-pserver #4.14.144-1.1~deb10 [ 902.618870] Hardware name: Supermicro SBA-7142G-T4/BHQGE, BIOS 3.00 10/24/2012 [ 902.619120] task: ffff9ae1860fc600 task.stack: ffffb52e4c704000 [ 902.619301] RIP: 0010:bitmap_file_clear_bit+0x90/0xd0 [md_mod] [ 902.619464] RSP: 0018:ffffb52e4c707d28 EFLAGS: 00010087 [ 902.619626] RAX: ffe8008b0d061000 RBX: ffff9ad078c87300 RCX: 0000000000000000 [ 902.619792] RDX: ffff9ad986341868 RSI: 0000000000000803 RDI: ffff9ad078c87300 [ 902.619986] RBP: ffff9ad0ed7a8000 R08: 0000000000000000 R09: 0000000000000000 [ 902.620154] R10: ffffb52e4c707ec0 R11: ffff9ad987d1ed44 R12: ffff9ad0ed7a8360 [ 902.620320] R13: 0000000000000003 R14: 0000000000060000 R15: 0000000000000800 [ 902.620487] FS: 0000000000000000(0000) GS:ffff9ad987d00000(0000) knlGS:0000000000000000 [ 902.620738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 902.620901] CR2: 000055ff12aecec0 CR3: 0000001005207000 CR4: 00000000000406e0 [ 902.621068] Call Trace: [ 902.621256] bitmap_daemon_work+0x2dd/0x360 [md_mod] [ 902.621429] ? find_pers+0x70/0x70 [md_mod] [ 902.621597] md_check_recovery+0x51/0x540 [md_mod] [ 902.621762] raid1d+0x5c/0xeb0 [raid1] [ 902.621939] ? try_to_del_timer_sync+0x4d/0x80 [ 902.622102] ? del_timer_sync+0x35/0x40 [ 902.622265] ? schedule_timeout+0x177/0x360 [ 902.622453] ? call_timer_fn+0x130/0x130 [ 902.622623] ? find_pers+0x70/0x70 [md_mod] [ 902.622794] ? md_thread+0x94/0x150 [md_mod] [ 902.622959] md_thread+0x94/0x150 [md_mod] [ 902.623121] ? wait_woken+0x80/0x80 [ 902.623280] kthread+0x119/0x130 [ 902.623437] ? kthread_create_on_node+0x60/0x60 [ 902.623600] ret_from_fork+0x22/0x40 [ 902.624225] RIP: bitmap_file_clear_bit+0x90/0xd0 [md_mod] RSP: ffffb52e4c707d28 Because mdadm was running on another cpu to do resize, so bitmap_resize was called to replace bitmap as below shows. PID: 38801 TASK: ffff9ad074a90e00 CPU: 0 COMMAND: "mdadm" [exception RIP: queued_spin_lock_slowpath+56] [snip] -- <NMI exception stack> -- #5 [ffffb52e60f17c58] queued_spin_lock_slowpath at ffffffff9c0b27b8 #6 [ffffb52e60f17c58] bitmap_resize at ffffffffc0399877 [md_mod] #7 [ffffb52e60f17d30] raid1_resize at ffffffffc0285bf9 [raid1] #8 [ffffb52e60f17d50] update_size at ffffffffc038a31a [md_mod] #9 [ffffb52e60f17d70] md_ioctl at ffffffffc0395ca4 [md_mod] And the procedure to keep resize bitmap safe is allocate new storage space, then quiesce, copy bits, replace bitmap, and re-start. However the daemon (bitmap_daemon_work) could happen even the array is quiesced, which means when bitmap_file_clear_bit is triggered by raid1d, then it thinks it should be fine to access store->filemap since counts->lock is held, but resize could change the storage without the protection of the lock. Cc: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-10-24 15:22:40 -07:00
Dan Carpenter	e3fc3f3d09	md/raid0: Fix an error message in raid0_make_request() The first argument to WARN() is supposed to be a condition. The original code will just print the mdname() instead of the full warning message. Fixes: `c84a1372df` ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-10-24 15:22:40 -07:00
Linus Torvalds	d418d07005	for-linus-2019-10-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW 9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7 v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2 E+DK7if0Gg== =rvkw -----END PGP SIGNATURE----- Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Keith that address deadlocks, double resets, memory leaks, and other regression. - Fixup elv_support_iosched() for bio based devices (Damien) - Fixup for the ahci PCS quirk (Dan) - Socket O_NONBLOCK handling fix for io_uring (me) - Timeout sequence io_uring fixes (yangerkun) - MD warning fix for parameter default_layout (Song) - blkcg activation fixes (Tejun) - blk-rq-qos node deletion fix (Tejun) * tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block: nvme-pci: Set the prp2 correctly when using more than 4k page io_uring: fix logic error in io_timeout io_uring: fix up O_NONBLOCK handling for sockets md/raid0: fix warning message for parameter default_layout libata/ahci: Fix PCS quirk application blk-rq-qos: fix first node deletion of rq_qos_del() blkcg: Fix multiple bugs in blkcg_activate_policy() io_uring: consider the overflow of sequence for timeout req nvme-tcp: fix possible leakage during error flow nvmet-loop: fix possible leakage during error flow block: Fix elv_support_iosched() nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL nvme: Wait for reset state when required nvme: Prevent resets during paused controller state nvme: Restart request timers in resetting state nvme: Remove ADMIN_ONLY state nvme-pci: Free tagset if no IO queues nvme: retain split access workaround for capability reads nvme: fix possible deadlock when nvme_update_formats fails	2019-10-18 22:29:36 -04:00
Mikulas Patocka	13bd677a47	dm cache: fix bugs when a GFP_NOWAIT allocation fails GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being available and it fails if the mempool is exhausted and there is not enough memory. If we go down this path: map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT) we can see that map_bio() doesn't check the return value of mg_start(), and the bio is leaked. If we go down this path: map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell -> dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) -> mg_lock_writes -> mg_complete the bio is ended with an error - it is unacceptable because it could cause filesystem corruption if the machine ran out of memory temporarily. Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly wait until memory becomes available. mempool_alloc with GFP_NOIO can't fail, so remove the code paths that deal with allocation failure. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-10-17 11:13:50 -04:00
Song Liu	3874d73e06	md/raid0: fix warning message for parameter default_layout The message should match the parameter, i.e. raid0.default_layout. Fixes: `c84a1372df` ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Cc: NeilBrown <neilb@suse.de> Reported-by: Ivan Topolsky <doktor.yak@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-10-16 09:43:02 -07:00
Mikulas Patocka	b21555786f	dm snapshot: rework COW throttling to fix deadlock Commit `721b1d98fb` ("dm snapshot: Fix excessive memory usage and workqueue stalls") introduced a semaphore to limit the maximum number of in-flight kcopyd (COW) jobs. The implementation of this throttling mechanism is prone to a deadlock: 1. One or more threads write to the origin device causing COW, which is performed by kcopyd. 2. At some point some of these threads might reach the s->cow_count semaphore limit and block in down(&s->cow_count), holding a read lock on _origins_lock. 3. Someone tries to acquire a write lock on _origins_lock, e.g., snapshot_ctr(), which blocks because the threads at step (2) already hold a read lock on it. 4. A COW operation completes and kcopyd runs dm-snapshot's completion callback, which ends up calling pending_complete(). pending_complete() tries to resubmit any deferred origin bios. This requires acquiring a read lock on _origins_lock, which blocks. This happens because the read-write semaphore implementation gives priority to writers, meaning that as soon as a writer tries to enter the critical section, no readers will be allowed in, until all writers have completed their work. So, pending_complete() waits for the writer at step (3) to acquire and release the lock. This writer waits for the readers at step (2) to release the read lock and those readers wait for pending_complete() (the kcopyd thread) to signal the s->cow_count semaphore: DEADLOCK. The above was thoroughly analyzed and documented by Nikos Tsironis as part of his initial proposal for fixing this deadlock, see: https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html Fix this deadlock by reworking COW throttling so that it waits without holding any locks. Add a variable 'in_progress' that counts how many kcopyd jobs are running. A function wait_for_in_progress() will sleep if 'in_progress' is over the limit. It drops _origins_lock in order to avoid the deadlock. Reported-by: Guruswamy Basavaiah <guru2018@gmail.com> Reported-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Tested-by: Nikos Tsironis <ntsironis@arrikto.com> Fixes: `721b1d98fb` ("dm snapshot: Fix excessive memory usage and workqueue stalls") Cc: stable@vger.kernel.org # v5.0+ Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-10-10 09:46:05 -04:00
Mikulas Patocka	a2f83e8b0c	dm snapshot: introduce account_start_copy() and account_end_copy() This simple refactoring moves code for modifying the semaphore cow_count into separate functions to prepare for changes that will extend these methods to provide for a more sophisticated mechanism for COW throttling. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-10-10 09:45:48 -04:00
YueHaibing	0a005856d3	dm clone: Make __hash_find static drivers/md/dm-clone-target.c:594:34: warning: symbol '__hash_find' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-10-08 14:04:54 -04:00
Linus Torvalds	2e959dd87a	for-5.4/post-2019-09-24 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J8xQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpujgD/94s9GGKN8JShxCpT0YNuWyyFF5gNlaimQU RSGAwnv2YUgEGNSUOPpcaj5FAYhTfYzbqoHlE+jytA2U5KXTOhc5Z85QV+TY4HPs I03xczYuYD/uX0QuF00zU2+6eV3lETELPiBARbfEQdHfm72iwurweHzlh4dfhbxW P7UA/cKixXWF2CH9wg5347Ll93nD24f2pi8BUyLJi/xpdlaRrN11Ii8AzNlRmq52 VRxURuogl98W89F6EV2VhPGFgUEYHY2Ot7II2OqqV+jmjHDQW9y5hximzINOqkxs bQwo5J+WrDSPoqwl8+db2k7QQjAl1XKDAHmCwz+7J/BoOgZj8/M1FMBwzita+5x+ UqxEYe7k+2G3w2zuhBrq03BypU8pwqFep/QI0cCCPaHs4J5QnkVOScEqd6iV/C3T FPvMvqDf7MrElghj4Qa2IZlh/CgqmLG5NUEz8E40cXkdiP+E+eK9ZY2Uwx2XhBrm 7Gl+SpG5DxWqqJeRNVWjFwM4p5L+01NtwDbTjZ1rsf+mCW5cNsy/L9B4UpPz4HxW coAs0y/Ce+ZhCopIXZ4jLDBoTG9yoVg8EcyfaHKD2Zz0mUFxa2xm+LvXKeT49qqx xuodpKD3fiuM7h9Xgv+cDsmn8Rr8gSeXEGV7qzpudmkxbp6IVg/yG5hC/dM921GR EVrRtUIwdw== =aAPP -----END PGP SIGNATURE----- Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Some later additions that weren't quite done for the first pull request, and also a few fixes that have arrived since. This contains: - Kill silly pktcdvd warning on attempting to register a non-scsi passthrough device (me) - Use symbolic constants for the block t10 protection types, and switch to handling it in core rather than in the drivers (Max) - libahci platform missing node put fix (Nishka) - Small series of fixes for BFQ (Paolo) - Fix possible nbd crash (Xiubo)" * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block: block: drop device references in bsg_queue_rq() block: t10-pi: fix -Wswitch warning pktcdvd: remove warning on attempting to register non-passthrough dev ata: libahci_platform: Add of_node_put() before loop exit nbd: fix possible page fault for nbd disk nbd: rename the runtime flags as NBD_RT_ prefixed block, bfq: push up injection only after setting service time block, bfq: increase update frequency of inject limit block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 block, bfq: update inject limit only after injection occurred block: centralize PI remapping logic to the block layer block: use symbolic constants for t10_pi type	2019-09-24 16:31:50 -07:00
Linus Torvalds	3e414b5bd2	- crypto and DM crypt advances that allow the crypto API to reclaim implementation details that do not belong in DM crypt. The wrapper template for ESSIV generation that was factored out will also be used by fscrypt in the future. - Add root hash pkcs#7 signature verification to the DM verity target. - Add a new "clone" DM target that allows for efficient remote replication of a device. - Enhance DM bufio's cache to be tailored to each client based on use. Clients that make heavy use of the cache get more of it, and those that use less have reduced cache usage. - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the version number of a DM target (even if the associated module isn't yet loaded). - Fix invalid memory access in DM zoned target. - Fix the max_discard_sectors limit advertised by the DM raid target; it was mistakenly storing the limit in bytes rather than sectors. - Small optimizations and cleanups in DM writecache target. - Various fixes and cleanups in DM core, DM raid1 and space map portion of DM persistent data library. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl2D7ycTHHNuaXR6ZXJA cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWp9QCACwTkVGzPGMCbAaCVlCACo8B5JyY4OO FNxucqUlt1MHKuBbzJd4XwNGlLg68xjMUKVPYPlgina7TaDl+wvlTbHchaJS8nak x1zyhDSywy0F9f6HHiXJi/vshmAfa0xnIM6fQXVPM346S6xf9u7hqOJQMCrdvY92 w4FhuW9nVt5xizo8iC/3LzoWbhrWncT7dyZUZtG3/tmglhkEK7QwctlgQxcD7tXg H1lhntQzHzpxQAVBefWWdw7ubuDd6XCHuQMaxRhyR++c62P3eKDR8ck9hhd3hZKv E481gtxcsjKuYLxwULjqFJZaNFitWFNMJ7gppQyKRqCzn2zlGAL6npl8 =m6zD -----END PGP SIGNATURE----- Merge tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - crypto and DM crypt advances that allow the crypto API to reclaim implementation details that do not belong in DM crypt. The wrapper template for ESSIV generation that was factored out will also be used by fscrypt in the future. - Add root hash pkcs#7 signature verification to the DM verity target. - Add a new "clone" DM target that allows for efficient remote replication of a device. - Enhance DM bufio's cache to be tailored to each client based on use. Clients that make heavy use of the cache get more of it, and those that use less have reduced cache usage. - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the version number of a DM target (even if the associated module isn't yet loaded). - Fix invalid memory access in DM zoned target. - Fix the max_discard_sectors limit advertised by the DM raid target; it was mistakenly storing the limit in bytes rather than sectors. - Small optimizations and cleanups in DM writecache target. - Various fixes and cleanups in DM core, DM raid1 and space map portion of DM persistent data library. * tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits) dm: introduce DM_GET_TARGET_VERSION dm bufio: introduce a global cache replacement dm bufio: remove old-style buffer cleanup dm bufio: introduce a global queue dm bufio: refactor adjust_total_allocated dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer dm: add clone target dm raid: fix updating of max_discard_sectors limit dm writecache: skip writecache_wait for pmem mode dm stats: use struct_size() helper dm crypt: omit parsing of the encapsulated cipher dm crypt: switch to ESSIV crypto API template crypto: essiv - create wrapper template for ESSIV generation dm space map common: remove check for impossible sm_find_free() return value dm raid1: use struct_size() with kzalloc() dm writecache: optimize performance by sorting the blocks for writeback_all dm writecache: add unlikely for getting two block with same LBA dm writecache: remove unused member pointer in writeback_struct dm zoned: fix invalid memory access dm verity: add root hash pkcs#7 signature verification ...	2019-09-21 10:40:37 -07:00
Max Gurtovoy	54d4e6ab91	block: centralize PI remapping logic to the block layer Currently t10_pi_prepare/t10_pi_complete functions are called during the NVMe and SCSi layers command preparetion/completion, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used by block storage protocols. Introduce .prepare_fn and .complete_fn callbacks within the integrity profile that each type can implement according to its needs. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY isn't defined in the config. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Linus Torvalds	7ad67ca553	for-5.4/block-2019-09-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/ EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV xsKgmTrQQQ== =Qt+h -----END PGP SIGNATURE----- Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...	2019-09-17 16:57:47 -07:00
Mikulas Patocka	afa179eb60	dm: introduce DM_GET_TARGET_VERSION This commit introduces a new ioctl DM_GET_TARGET_VERSION. It will load a target that is specified in the "name" entry in the parameter structure and return its version. This functionality is intended to be used by cryptsetup, so that it can query kernel capabilities before activating the device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-16 10:18:01 -04:00
Mikulas Patocka	6e913b28cd	dm bufio: introduce a global cache replacement This commit introduces a global cache replacement (instead of per-client cleanup). If one bufio client uses the cache heavily and another client is not using it, we want to let the first client use most of the cache. The old algorithm would partition the cache equally betwen the clients and that is sub-optimal. For cache replacement, we use the clock algorithm because it doesn't require taking any lock when the buffer is accessed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-13 17:00:21 -04:00
Guoqing Jiang	067df25c83	raid5: use bio_end_sector in r5_next_bio Actually, we calculate bio's end sector here, so use the common way for the purpose. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-13 13:14:43 -07:00
Guoqing Jiang	feb9bf9849	raid5: remove STRIPE_OPS_REQ_PENDING This stripe state is not used anymore after commit `51acbcec6c` ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted state. gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r drivers/md/raid5.c: (1 << STRIPE_OPS_REQ_PENDING) \| drivers/md/raid5.h: STRIPE_OPS_REQ_PENDING, Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-13 13:14:39 -07:00
NeilBrown	33f2c35a54	md: add feature flag MD_FEATURE_RAID0_LAYOUT Due to a bug introduced in Linux 3.14 we cannot determine the correctly layout for a multi-zone RAID0 array - there are two possibilities. It is possible to tell the kernel which to chose using a module parameter, but this can be clumsy to use. It would be best if the choice were recorded in the metadata. So add a feature flag for this purpose. If it is set, then the 'layout' field of the superblock is used to determine which layout to use. If this flag is not set, then mddev->layout gets set to -1, which causes the module parameter to be required. Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-13 13:10:06 -07:00
NeilBrown	c84a1372df	md/raid0: avoid RAID0 data corruption due to layout confusion. If the drives in a RAID0 are not all the same size, the array is divided into zones. The first zone covers all drives, to the size of the smallest. The second zone covers all drives larger than the smallest, up to the size of the second smallest - etc. A change in Linux 3.14 unintentionally changed the layout for the second and subsequent zones. All the correct data is still stored, but each chunk may be assigned to a different device than in pre-3.14 kernels. This can lead to data corruption. It is not possible to determine what layout to use - it depends which kernel the data was written by. So we add a module parameter to allow the old (0) or new (1) layout to be specified, and refused to assemble an affected array if that parameter is not set. Fixes: `20d0189b10` ("block: Introduce new bio_split()") cc: stable@vger.kernel.org (3.14+) Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-13 13:10:05 -07:00
Guoqing Jiang	6ce220dd2f	raid5: don't set STRIPE_HANDLE to stripe which is in batch list If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe could be set with STRIPE_ACTIVE by the handle_stripe function. And if error happens to the batch_head at the same time, break_stripe_batch_list is called, then below warning could happen (the same report in [1]), it means a member of batch list was set with STRIPE_ACTIVE. [7028915.431770] stripe state: 2001 [7028915.431815] ------------[ cut here ]------------ [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456] [...] [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G O 4.14.86-1-storage #4.14.86-1.2~deb9 [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018 [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456] [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000 [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456] [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286 [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000 [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458 [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4 [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00 [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001 [7028915.431906] FS: 0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000 [7028915.431907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0 [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7028915.431910] Call Trace: [7028915.431923] handle_stripe+0x8e7/0x2020 [raid456] [7028915.431930] ? __wake_up_common_lock+0x89/0xc0 [7028915.431935] handle_active_stripes.isra.58+0x35f/0x560 [raid456] [7028915.431939] raid5_do_work+0xc6/0x1f0 [raid456] Also commit `59fc630b8b` ("RAID5: batch adjacent full stripe write") said "If a stripe is added to batch list, then only the first stripe of the list should be put to handle_list and run handle_stripe." So don't set STRIPE_HANDLE to stripe which is already in batch list, otherwise the stripe could be put to handle_list and run handle_stripe, then the above warning could be triggered. [1]. https://www.spinics.net/lists/raid/msg62552.html Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-13 13:10:05 -07:00
Nigel Croxon	b76b4715eb	raid5: don't increment read_errors on EILSEQ return While MD continues to count read errors returned by the lower layer. If those errors are -EILSEQ, instead of -EIO, it should NOT increase the read_errors count. When RAID6 is set up on dm-integrity target that detects massive corruption, the leg will be ejected from the array. Even if the issue is correctable with a sector re-write and the array has necessary redundancy to correct it. The leg is ejected because it runs up the rdev->read_errors beyond conf->max_nr_stripes. The return status in dm-drypt when there is a data integrity error is -EILSEQ (BLK_STS_PROTECTION). Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-13 13:10:05 -07:00
Mikulas Patocka	b132ff3332	dm bufio: remove old-style buffer cleanup Remove code that cleans up buffers if the cache size grows over the limit. The next commit will introduce a new global cleanup. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-13 10:43:48 -04:00
Mikulas Patocka	af53badc0c	dm bufio: introduce a global queue Rename param_spinlock to global_spinlock and introduce a global queue of all used buffers. The queue will be used in the following commits. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-13 10:43:27 -04:00
Mikulas Patocka	d0a328a385	dm bufio: refactor adjust_total_allocated Refactor adjust_total_allocated() so that it takes a bool argument indicating if it should add or subtract the buffer size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-13 10:43:07 -04:00
Mikulas Patocka	26d2ef0cd0	dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer Move the call to adjust_total_allocated() to __link_buffer() and __unlink_buffer() so that only used buffers are counted. Reserved buffers are not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-13 10:42:27 -04:00
Nikos Tsironis	7431b7835f	dm: add clone target Add the dm-clone target, which allows cloning of arbitrary block devices. dm-clone produces a one-to-one copy of an existing, read-only source device into a writable destination device: It presents a virtual block device which makes all data appear immediately, and redirects reads and writes accordingly. The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O. The cloned device is visible/mountable immediately and the copy of the source device to the destination device happens in the background, in parallel with user I/O. When the cloning completes, the dm-clone table can be removed altogether and be replaced, e.g., by a linear table, mapping directly to the destination device. For further information and examples of how to use dm-clone, please read Documentation/admin-guide/device-mapper/dm-clone.rst Suggested-by: Vangelis Koukis <vkoukis@arrikto.com> Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-12 09:32:31 -04:00
Ming Lei	c8156fc77d	dm raid: fix updating of max_discard_sectors limit Unit of 'chunk_size' is byte, instead of sector, so fix it by setting the queue_limits' max_discard_sectors to rs->md.chunk_sectors. Also, rename chunk_size to chunk_size_bytes. Without this fix, too big max_discard_sectors is applied on the request queue of dm-raid, finally raid code has to split the bio again. This re-split done by raid causes the following nested clone_endio: 1) one big bio 'A' is submitted to dm queue, and served as the original bio 2) one new bio 'B' is cloned from the original bio 'A', and .map() is run on this bio of 'B', and B's original bio points to 'A' 3) raid code sees that 'B' is too big, and split 'B' and re-submit the remainded part of 'B' to dm-raid queue via generic_make_request(). 4) now dm will handle 'B' as new original bio, then allocate a new clone bio of 'C' and run .map() on 'C'. Meantime C's original bio points to 'B'. 5) suppose now 'C' is completed by raid directly, then the following clone_endio() is called recursively: clone_endio(C) ->clone_endio(B) #B is original bio of 'C' ->bio_endio(A) 'A' can be big enough to make hundreds of nested clone_endio(), then stack can be corrupted easily. Fixes: `61697a6abd` ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-11 16:18:23 -04:00
Damien Le Moal	737eb78e82	block: Delay default elevator initialization When elevator_init_mq() is called from blk_mq_init_allocated_queue(), the only information known about the device is the number of hardware queues as the block device scan by the device driver is not completed yet for most drivers. The device type and elevator required features are not set yet, preventing to correctly select the default elevator most suitable for the device. This currently affects all multi-queue zoned block devices which default to the "none" elevator instead of the required "mq-deadline" elevator. These drives currently include host-managed SMR disks connected to a smartpqi HBA and null_blk block devices with zoned mode enabled. Upcoming NVMe Zoned Namespace devices will also be affected. Fix this by adding the boolean elevator_init argument to blk_mq_init_allocated_queue() to control the execution of elevator_init_mq(). Two cases exist: 1) elevator_init = false is used for calls to blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this case, a call to elevator_init_mq() is added to __device_add_disk(), resulting in the delayed initialization of the queue elevator after the device driver finished probing the device information. This effectively allows elevator_init_mq() access to more information about the device. 2) elevator_init = true preserves the current behavior of initializing the elevator directly from blk_mq_init_allocated_queue(). This case is used for the special request based DM devices where the device gendisk is created before the queue initialization and device information (e.g. queue limits) is already known when the queue initialization is executed. Additionally, to make sure that the elevator initialization is never done while requests are in-flight (there should be none when the device driver calls device_add_disk()), freeze and quiesce the device request queue before calling blk_mq_init_sched() in elevator_init_mq(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-05 19:52:34 -06:00
Huaisheng Ye	6d1959138c	dm writecache: skip writecache_wait for pmem mode The array bio_in_progress[2] only have chance to be increased and decreased with ssd mode. For pmem mode, they are not involved at all. So skip writecache_wait_for_ios in writecache_flush for pmem. Suggested-by: Doris Yu <tyu1@lenovo.com> Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-05 13:22:05 -04:00
Gustavo A. R. Silva	fb16c799b8	dm stats: use struct_size() helper One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct dm_stat { ... struct dm_stat_shared stat_shared[0]; }; Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. So, replace the following form: sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared) with: struct_size(s, stat_shared, n_entries) This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-04 09:39:22 -04:00
Guoqing Jiang	b0f01ecf29	md/raid5: use bio_end_sector to calculate last_sector Use the common way to get last_sector. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-03 14:52:38 -07:00
Yufen Yu	07f1a6850c	md/raid1: fail run raid1 array when active disk less than one When run test case: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force mdadm run fail with kernel message as follow: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) In fact, when active disk in raid1 array less than one, we need to return fail in raid1_run(). Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-03 14:52:03 -07:00
Guilherme G. Piccoli	62f7b1989c	md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone Currently md raid0/linear are not provided with any mechanism to validate if an array member got removed or failed. The driver keeps sending BIOs regardless of the state of array members, and kernel shows state 'clean' in the 'array_state' sysfs attribute. This leads to the following situation: if a raid0/linear array member is removed and the array is mounted, some user writing to this array won't realize that errors are happening unless they check dmesg or perform one fsync per written file. Despite udev signaling the member device is gone, 'mdadm' cannot issue the STOP_ARRAY ioctl successfully, given the array is mounted. In other words, no -EIO is returned and writes (except direct ones) appear normal. Meaning the user might think the wrote data is correctly stored in the array, but instead garbage was written given that raid0 does stripping (and so, it requires all its members to be working in order to not corrupt data). For md/linear, writes to the available members will work fine, but if the writes go to the missing member(s), it'll cause a file corruption situation, whereas the portion of the writes to the missing devices aren't written effectively. This patch changes this behavior: we check if the block device's gendisk is UP when submitting the BIO to the array member, and if it isn't, we flag the md device as MD_BROKEN and fail subsequent I/Os to that device; a read request to the array requiring data from a valid member is still completed. While flagging the device as MD_BROKEN, we also show a rate-limited warning in the kernel log. A new array state 'broken' was added too: it mimics the state 'clean' in every aspect, being useful only to distinguish if the array has some member missing. We rely on the MD_BROKEN flag to put the array in the 'broken' state. This state cannot be written in 'array_state' as it just shows one or more members of the array are missing but acts like 'clean', it wouldn't make sense to write it. With this patch, the filesystem reacts much faster to the event of missing array member: after some I/O errors, ext4 for instance aborts the journal and prevents corruption. Without this change, we're able to keep writing in the disk and after a machine reboot, e2fsck shows some severe fs errors that demand fixing. This patch was tested in ext4 and xfs filesystems, and requires a 'mdadm' counterpart to handle the 'broken' state. Cc: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-09-03 14:49:28 -07:00
Ard Biesheuvel	b1d1e29639	dm crypt: omit parsing of the encapsulated cipher Only the ESSIV IV generation mode used to use cc->cipher so it could instantiate the bare cipher used to encrypt the IV. However, this is now taken care of by the ESSIV template, and so no users of cc->cipher remain. So remove it altogether. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-03 16:46:16 -04:00
Ard Biesheuvel	a1a262b66e	dm crypt: switch to ESSIV crypto API template Replace the explicit ESSIV handling in the dm-crypt driver with calls into the crypto API, which now possesses the capability to perform this processing within the crypto subsystem. Note that we reorder the AEAD cipher_api string parsing with the TFM instantiation: this is needed because cipher_api is mangled by the ESSIV handling, and throws off the parsing of "authenc(" otherwise. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-09-03 16:45:54 -04:00
Kent Overstreet	a22a9602b8	closures: fix a race on wakeup from closure_sync The race was when a thread using closure_sync() notices cl->s->done == 1 before the thread calling closure_put() calls wake_up_process(). Then, it's possible for that thread to return and exit just before wake_up_process() is called - so we're trying to wake up a process that no longer exists. rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier somewhere in the process teardown path. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-03 08:08:31 -06:00
Dan Carpenter	d66c9920c0	bcache: Fix an error code in bch_dump_read() The copy_to_user() function returns the number of bytes remaining to be copied, but the intention here was to return -EFAULT if the copy fails. Fixes: `cafe563591` ("bcache: A block layer cache") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-03 08:08:29 -06:00
Shile Zhang	d55a4ae9e1	bcache: add cond_resched() in __bch_cache_cmp() Read /sys/fs/bcache/<uuid>/cacheN/priority_stats can take very long time with huge cache after long run. Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com> Tested-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-03 08:08:28 -06:00
Nigel Croxon	0009fad033	raid5 improve too many read errors msg by adding limits Often limits can be changed by admin. When discussing such things it helps if you can provide "self-sustained" facts. Also sometimes the admin thinks he changed a limit, but it did not take effect for some reason or he changed the wrong thing. V3: Only pr_warn when Faulty is 0. V2: Add read_errors value to pr_warn. Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-27 12:36:37 -07:00
NeilBrown	9d4b45d6af	md: don't report active array_state until after revalidate_disk() completes. Until revalidate_disk() has completed, the size of a new md array will appear to be zero. So we shouldn't report, through array_state, that the array is active until that time. udev rules check array_state to see if the array is ready. As soon as it appear to be zero, fsck can be run. If it find the size to be zero, it will fail. So add a new flag to provide an interlock between do_md_run() and array_state_show(). This flag is set while do_md_run() is active and it prevents array_state_show() from reporting that the array is active. Before do_md_run() is called, ->pers will be NULL so array is definitely not active. After do_md_run() is called, revalidate_disk() will have run and the array will be completely ready. We also move various sysfs_notify*() calls out of md_run() into do_md_run() after MD_NOT_READY is cleared. This ensure the information is ready before the notification is sent. Prior to v4.12, array_state_show() was called with the mddev->reconfig_mutex held, which provided exclusion with do_md_run(). Note that MD_NOT_READY cleared twice. This is deliberate to cover both success and error paths with minimal noise. Fixes: `b7b17c9b67` ("md: remove mddev_lock() from md_attr_show()") Cc: stable@vger.kernel.org (v4.12++) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-27 12:36:37 -07:00
NeilBrown	480523feae	md: only call set_in_sync() when it is expected to succeed. Since commit `4ad23a9764` ("MD: use per-cpu counter for writes_pending"), set_in_sync() is substantially more expensive: it can wait for a full RCU grace period which can be 10s of milliseconds. So we should only call it when the cost is justified. md_check_recovery() currently calls set_in_sync() every time it finds anything to do (on non-external active arrays). For an array performing resync or recovery, this will be quite often. Each call will introduce a delay to the md thread, which can noticeable affect IO submission latency. In md_check_recovery() we only need to call set_in_sync() if 'safemode' was non-zero at entry, meaning that there has been not recent IO. So we save this "safemode was nonzero" state, and only call set_in_sync() if it was non-zero. This measurably reduces mean and maximum IO submission latency during resync/recovery. Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com> Fixes: `4ad23a9764` ("MD: use per-cpu counter for writes_pending") Cc: stable@vger.kernel.org (v4.12+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-27 12:36:36 -07:00
ZhangXiaoxu	c1499a044d	dm space map common: remove check for impossible sm_find_free() return value The function sm_find_free() just returns -ENOSPC and 0. So remove lone caller's check for some other error. Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-26 15:39:53 -04:00
Gustavo A. R. Silva	bcd676542c	dm raid1: use struct_size() with kzalloc() One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct mirror_set { ... struct mirror mirror[0]; }; size = sizeof(struct mirror_set) + count * sizeof(struct mirror); instance = kzalloc(size, GFP_KERNEL) Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, mirror, count), GFP_KERNEL) Notice that, in this case, variable len is not necessary, hence it is removed. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-26 11:05:32 -04:00
Huaisheng Ye	5229b4896e	dm writecache: optimize performance by sorting the blocks for writeback_all During the process of writeback, the blocks, which have been placed in wbl.list for writeback soon, are partially ordered for the contiguous ones. When writeback_all has been set, for most cases, also by default, there will be a lot of blocks in pmem need to writeback at the same time. For this case, we could optimize the performance by sorting all blocks in wbl.list. writecache_writeback doesn't need to get blocks from the tail of wc->lru, whereas from the first rb_node from the rb_tree. The benefit is that, writecache_writeback doesn't need to have any cost to sort the blocks, because of all blocks are incremental originally in rb_tree. There will be a writecache_flush when writeback_all begins to work, that will eliminate duplicate blocks in cache by committed/uncommitted. Testing platform: Thinksystem SR630 with persistent memory. The cache comes from pmem, which has 1006MB size. The origin device is HDD, 2GB of which for using. Testing steps: 1) dmsetup create mycache --table '0 4194304 writecache p /dev/sdb1 /dev/pmem4 4096 0' 2) fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite -ioengine=libaio -bs=4k -loops=1 -size=2g -group_reporting -name=mytest1 3) time dmsetup message /dev/mapper/mycache 0 flush Here is the results below, With the patch: # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite -ioengine=libaio -bs=4k -loops=1 -size=2g -group_reporting -name=mytest1 iops : min= 1582, max=199470, avg=5305.94, stdev=21273.44, samples=197 # time dmsetup message /dev/mapper/mycache 0 flush real 0m44.020s user 0m0.002s sys 0m0.003s Without the patch: # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite -ioengine=libaio -bs=4k -loops=1 -size=2g -group_reporting -name=mytest1 iops : min= 1202, max=197650, avg=4968.67, stdev=20480.17, samples=211 # time dmsetup message /dev/mapper/mycache 0 flush real 1m39.221s user 0m0.001s sys 0m0.003s I also have checked the data accuracy with this patch by making EXT4 filesystem on mycache, then mount it for checking md5 of files on that. The test result is positive, with this patch it could save more than half of time when writeback_all. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-26 10:59:00 -04:00
Huaisheng Ye	62421b3880	dm writecache: add unlikely for getting two block with same LBA In function writecache_writeback, entries g and f has same original sector only happens at entry f has been committed, but entry g has NOT yet. The probability of this happening is very low in the following 256 blocks at most of entry e. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-26 10:54:41 -04:00
Huaisheng Ye	58912dbce6	dm writecache: remove unused member pointer in writeback_struct The stucture member pointer page in writeback_struct never has been used actually. Remove it. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-26 10:54:15 -04:00
Mikulas Patocka	0c8e9c2d66	dm zoned: fix invalid memory access Commit `75d66ffb48` ("dm zoned: properly handle backing device failure") triggers a coverity warning: *** CID 1452808: Memory - illegal accesses (USE_AFTER_FREE) /drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio() 131 clone->bi_private = bioctx; 132 133 bio_advance(bio, clone->bi_iter.bi_size); 134 135 refcount_inc(&bioctx->ref); 136 generic_make_request(clone); >>> CID 1452808: Memory - illegal accesses (USE_AFTER_FREE) >>> Dereferencing freed pointer "clone". 137 if (clone->bi_status == BLK_STS_IOERR) 138 return -EIO; 139 140 if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone)) 141 zone->wp_block += nr_blocks; 142 The "clone" bio may be processed and freed before the check "clone->bi_status == BLK_STS_IOERR" - so this check can access invalid memory. Fixes: `75d66ffb48` ("dm zoned: properly handle backing device failure") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-26 10:33:58 -04:00
Jaskaran Khurana	88cd3e6cfa	dm verity: add root hash pkcs#7 signature verification The verification is to support cases where the root hash is not secured by Trusted Boot, UEFI Secureboot or similar technologies. One of the use cases for this is for dm-verity volumes mounted after boot, the root hash provided during the creation of the dm-verity volume has to be secure and thus in-kernel validation implemented here will be used before we trust the root hash and allow the block device to be created. The signature being provided for verification must verify the root hash and must be trusted by the builtin keyring for verification to succeed. The hash is added as a key of type "user" and the description is passed to the kernel so it can look it up and use it for verification. Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root hash verification is needed. Kernel commandline dm_verity module parameter 'require_signatures' will indicate whether to force root hash signature verification (for all dm verity volumes). Signed-off-by: Jaskaran Khurana <jaskarankhurana@linux.microsoft.com> Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-23 10:13:14 -04:00
Ard Biesheuvel	39d13a1ac4	dm crypt: reuse eboiv skcipher for IV generation Instead of instantiating a separate cipher to perform the encryption needed to produce the IV, reuse the skcipher used for the block data and invoke it one additional time for each block to encrypt a zero vector and use the output as the IV. For CBC mode, this is equivalent to using the bare block cipher, but without the risk of ending up with a non-time invariant implementation of AES when the skcipher itself is time variant (e.g., arm64 without Crypto Extensions has a NEON based time invariant implementation of cbc(aes) but no time invariant implementation of the core cipher other than aes-ti, which is not enabled by default). This approach is a compromise between dm-crypt API flexibility and reducing dependence on parts of the crypto API that should not usually be exposed to other subsystems, such as the bare cipher API. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-23 10:13:13 -04:00
Mikulas Patocka	123d87d553	dm: make dm_table_find_target return NULL Currently, if we pass too high sector number to dm_table_find_target, it returns zeroed dm_target structure and callers test if the structure is zeroed with the macro dm_target_is_valid. However, returning NULL is common practice to indicate errors. This patch refactors the dm code, so that dm_table_find_target returns NULL and its callers test the returned value for NULL. The macro dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an extra zeroed target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-23 10:13:12 -04:00
Mikulas Patocka	1cfd5d3399	dm table: fix invalid memory accesses with too high sector number If the sector number is too high, dm_table_find_target() should return a pointer to a zeroed dm_target structure (the caller should test it with dm_target_is_valid). However, for some table sizes, the code in dm_table_find_target() that performs btree lookup will access out of bound memory structures. Fix this bug by testing the sector number at the beginning of dm_table_find_target(). Also, add an "inline" keyword to the function dm_table_get_size() because this is a hot path. Fixes: `512875bd96` ("dm: table detect io beyond device") Cc: stable@vger.kernel.org Reported-by: Zhang Tao <kontais@zoho.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-23 10:11:42 -04:00
ZhangXiaoxu	ae148243d3	dm space map metadata: fix missing store of apply_bops() return value In commit `6096d91af0` ("dm space map metadata: fix occasional leak of a metadata block on resize"), we refactor the commit logic to a new function 'apply_bops'. But when that logic was replaced in out() the return value was not stored. This may lead out() returning a wrong value to the caller. Fixes: `6096d91af0` ("dm space map metadata: fix occasional leak of a metadata block on resize") Cc: stable@vger.kernel.org Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-22 16:11:24 -04:00
ZhangXiaoxu	e4f9d60138	dm btree: fix order of block initialization in btree_split_beneath When btree_split_beneath() splits a node to two new children, it will allocate two blocks: left and right. If right block's allocation failed, the left block will be unlocked and marked dirty. If this happened, the left block'ss content is zero, because it wasn't initialized with the btree struct before the attempot to allocate the right block. Upon return, when flushing the left block to disk, the validator will fail when check this block. Then a BUG_ON is raised. Fix this by completely initializing the left block before allocating and initializing the right block. Fixes: `4dcb8b57df` ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path") Cc: stable@vger.kernel.org Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-22 16:11:23 -04:00
Wenwen Wang	dc1a3e8e0c	dm raid: add missing cleanup in raid_ctr() If rs_prepare_reshape() fails, no cleanup is executed, leading to leak of the raid_set structure allocated at the beginning of raid_ctr(). To fix this issue, go to the label 'bad' if the error occurs. Fixes: `11e4723206` ("dm raid: stop keeping raid set frozen altogether") Cc: stable@vger.kernel.org Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-21 11:47:05 -04:00
Dan Carpenter	e0702d90b7	dm zoned: fix potential NULL dereference in dmz_do_reclaim() This function is supposed to return error pointers so it matches the dmz_get_rnd_zone_for_reclaim() function. The current code could lead to a NULL dereference in dmz_do_reclaim() Fixes: `b234c6d7a7` ("dm zoned: improve error handling in reclaim") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-21 11:29:30 -04:00
Bryan Gurney	08c04c84a5	dm dust: use dust block size for badblocklist index Change the "frontend" dust_remove_block, dust_add_block, and dust_query_block functions to store the "dust block number", instead of the sector number corresponding to the "dust block number". For the "backend" functions dust_map_read and dust_map_write, right-shift by sect_per_block_shift. This fixes the inability to emulate failure beyond the first sector of each "dust block" (for devices with a "dust block size" larger than 512 bytes). Fixes: `e4f3fabd67` ("dm: add dust target") Cc: stable@vger.kernel.org Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-21 11:27:17 -04:00
Mikulas Patocka	5729b6e5a1	dm integrity: fix a crash due to BUG_ON in __journal_read_write() Fix a crash that was introduced by the commit `724376a04d`. The crash is reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468 When reading from the integrity device, the function dm_integrity_map_continue calls find_journal_node to find out if the location to read is present in the journal. Then, it calculates how many sectors are consecutively stored in the journal. Then, it locks the range with add_new_range and wait_and_add_new_range. The problem is that during wait_and_add_new_range, we hold no locks (we don't hold ic->endio_wait.lock and we don't hold a range lock), so the journal may change arbitrarily while wait_and_add_new_range sleeps. The code then goes to __journal_read_write and hits BUG_ON(journal_entry_get_sector(je) != logical_sector); because the journal has changed. In order to fix this bug, we need to re-check the journal location after wait_and_add_new_range. We restrict the length to one block in order to not complicate the code too much. Fixes: `724376a04d` ("dm integrity: implement fair range locks") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 16:01:57 -04:00
Dmitry Fomichev	ad1bd578bd	dm zoned: fix a few typos Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:43 -04:00
Dmitry Fomichev	bae9a0aa33	dm zoned: add SPDX license identifiers Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:42 -04:00
Dmitry Fomichev	75d66ffb48	dm zoned: properly handle backing device failure dm-zoned is observed to lock up or livelock in case of hardware failure or some misconfiguration of the backing zoned device. This patch adds a new dm-zoned target function that checks the status of the backing device. If the request queue of the backing device is found to be in dying state or the SCSI backing device enters offline state, the health check code sets a dm-zoned target flag prompting all further incoming I/O to be rejected. In order to detect backing device failures timely, this new function is called in the request mapping path, at the beginning of every reclaim run and before performing any metadata I/O. The proper way out of this situation is to do dmsetup remove <dm-zoned target> and recreate the target when the problem with the backing device is resolved. Fixes: `3b1a94c88b` ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:42 -04:00
Dmitry Fomichev	d7428c5011	dm zoned: improve error handling in i/o map code Some errors are ignored in the I/O path during queueing chunks for processing by chunk works. Since at least these errors are transient in nature, it should be possible to retry the failed incoming commands. The fix - Errors that can happen while queueing chunks are carried upwards to the main mapping function and it now returns DM_MAPIO_REQUEUE for any incoming requests that can not be properly queued. Error logging/debug messages are added where needed. Fixes: `3b1a94c88b` ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:41 -04:00
Dmitry Fomichev	b234c6d7a7	dm zoned: improve error handling in reclaim There are several places in reclaim code where errors are not propagated to the main function, dmz_reclaim(). This function is responsible for unlocking zones that might be still locked at the end of any failed reclaim iterations. As the result, some device zones may be left permanently locked for reclaim, degrading target's capability to reclaim zones. This patch fixes these issues as follows - Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and dmz_reclaim_rnd_data() return error codes to the caller. dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid clashing with "struct dmz_reclaim" and is modified to return the error to the caller. dmz_get_zone_for_reclaim() now returns an error instead of NULL pointer and reclaim code checks for that error. Error logging/debug messages are added where necessary. Fixes: `3b1a94c88b` ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:40 -04:00
Dmitry Fomichev	d1fef41465	dm kcopyd: always complete failed jobs This patch fixes a problem in dm-kcopyd that may leave jobs in complete queue indefinitely in the event of backing storage failure. This behavior has been observed while running 100% write file fio workload against an XFS volume created on top of a dm-zoned target device. If the underlying storage of dm-zoned goes to offline state under I/O, kcopyd sometimes never issues the end copy callback and dm-zoned reclaim work hangs indefinitely waiting for that completion. This behavior was traced down to the error handling code in process_jobs() function that places the failed job to complete_jobs queue, but doesn't wake up the job handler. In case of backing device failure, all outstanding jobs may end up going to complete_jobs queue via this code path and then stay there forever because there are no more successful I/O jobs to wake up the job handler. This patch adds a wake() call to always wake up kcopyd job wait queue for all I/O jobs that fail before dm_io() gets called for that job. The patch also sets the write error status in all sub jobs that are failed because their master job has failed. Fixes: `b73c67c2cb` ("dm kcopyd: add sequential write feature") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:39 -04:00
Mikulas Patocka	cf3591ef83	Revert "dm bufio: fix deadlock with loop device" Revert the commit `bd293d071f`. The proper fix has been made available with commit `d0a255e795` ("loop: set PF_MEMALLOC_NOIO for the worker thread"). Note that the fix offered by commit `bd293d071f` doesn't really prevent the deadlock from occuring - if we look at the stacktrace reported by Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex - i.e. it has already successfully taken the mutex. Changing the mutex from mutex_lock to mutex_trylock won't help with deadlocks that happen afterwards. PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: `bd293d071f` ("dm bufio: fix deadlock with loop device") Depends-on: `d0a255e795` ("loop: set PF_MEMALLOC_NOIO for the worker thread") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-08-15 15:57:39 -04:00
Linus Torvalds	50e73a4a41	for-linus-20190809 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1NmNIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpouEEADA1e2HH9AO8QGWBVSeFi5ZdRnEXsJUfNGD d576M9empGI/hkbui0yY2tuodnMwmhUj641GTRC2dzl7RRyGcTMmqtGYPcyczlpU +6Yil3+XVNYTXRpUsKWs32H9aubNZl/L3rCcJImGvMgLW5YtEjAZJFIFIzXWWwDJ aZpTtnOH1+D3HH6HT35xk+aytSYZ7LsZ7X3LI9ZumKOUd2HJZGUkfWLXxgSuuUh/ /WaBEQ9xzDNARmfx9Qb/2wSAE7XOInupPr86fI9dnmXHZ8rwhsvHRIZEvNIIqF5Z KzbF+rGJZ2bizKFpVFdlrIIyfBleFlQGFzYnGrs/+47zAl/CGicqmuAhGcOdGQXd wyj6G66iBcBIQC9hsYhPtglyQSk95tzsPZLZa7/1TlhEKl3mpor4w30NrLz/P5oy gdIivDhKP7aRFzBWw/2O4TOn3HhGxnZWVni6icOHE/pBQLBW12Ulc1nT0SiJ8RAt PYhMCFwz0Xc7pYuEfGZKwNSCIrx4i7f+spWEAt0Fget/KaB993zm+K4OzGbfBv6r FrZ5g/2MTVBJITmJVN2LDysF4EVEOzTULX+bCQOOWC7wmE/poCyNWj8gCpunIZsY V5DAB+cbEw/OLfpdiogXvCcxrbukKHV8AVdMdLYB5yEAuPN0JMqcunMjECxpNAha 1wsv86ljvw== =Au30 -----END PGP SIGNATURE----- Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Revert of a bcache patch that caused an oops for some (Coly) - ata rb532 unused warning fix (Gustavo) - AoE kernel crash fix (He) - Error handling fixup for blkdev_get() (Jan) - libata read/write translation and SFF PIO fix (me) - Use after free and error handling fix for O_DIRECT fragments. There's still a nowait + sync oddity in there, we'll nail that start next week. If all else fails, I'll queue a revert of the NOWAIT change. (me) - Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas) - Two BFQ regression fixes that caused crashes (Paolo) * tag 'for-linus-20190809' of git://git.kernel.dk/linux-block: bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()" loop: set PF_MEMALLOC_NOIO for the worker thread bdev: Fixup error handling in blkdev_get() block, bfq: handle NULL return value by bfq_init_rq() block, bfq: move update of waker and woken list to queue freeing block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed block: aoe: Fix kernel crash due to atomic sleep when exiting libata: add SG safety checks in SFF pio transfers libata: have ata_scsi_rw_xlat() fail invalid passthrough requests block: fix O_DIRECT error handling for bio fragments ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe	2019-08-09 09:28:18 -07:00
Coly Li	20621fedb2	bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()" This reverts commit `89e0341af0`. In drivers/md/bcache/sysfs.c:bch_snprint_string_list(), NULL pointer at the end of list is necessary. Remove the NULL from last element of each lists will cause the following panic, [ 4340.455652] bcache: register_cache() registered cache device nvme0n1 [ 4340.464603] bcache: register_bdev() registered backing device sdk [ 4421.587335] bcache: bch_cached_dev_run() cached dev sdk is running already [ 4421.587348] bcache: bch_cached_dev_attach() Caching sdk as bcache0 on set 354e1d46-d99f-4d8b-870b-078b80dc88a6 [ 5139.247950] general protection fault: 0000 [#1] SMP NOPTI [ 5139.247970] CPU: 9 PID: 5896 Comm: cat Not tainted 4.12.14-95.29-default #1 SLE12-SP4 [ 5139.247988] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019 [ 5139.248006] task: ffff888fb25c0b00 task.stack: ffff9bbacc704000 [ 5139.248021] RIP: 0010:string+0x21/0x70 [ 5139.248030] RSP: 0018:ffff9bbacc707bf0 EFLAGS: 00010286 [ 5139.248043] RAX: ffffffffa7e432e3 RBX: ffff8881c20da02a RCX: ffff0a00ffffff04 [ 5139.248058] RDX: 3f00656863616362 RSI: ffff8881c20db000 RDI: ffffffffffffffff [ 5139.248075] RBP: ffff8881c20db000 R08: 0000000000000000 R09: ffff8881c20da02a [ 5139.248090] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9bbacc707c48 [ 5139.248104] R13: 0000000000000fd6 R14: ffffffffc0665855 R15: ffffffffc0665855 [ 5139.248119] FS: 00007faf253b8700(0000) GS:ffff88903f840000(0000) knlGS:0000000000000000 [ 5139.248137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5139.248149] CR2: 00007faf25395008 CR3: 0000000f72150006 CR4: 00000000007606e0 [ 5139.248164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5139.248179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5139.248193] PKRU: 55555554 [ 5139.248200] Call Trace: [ 5139.248210] vsnprintf+0x1fb/0x510 [ 5139.248221] snprintf+0x39/0x40 [ 5139.248238] bch_snprint_string_list.constprop.15+0x5b/0x90 [bcache] [ 5139.248256] __bch_cached_dev_show+0x44d/0x5f0 [bcache] [ 5139.248270] ? __alloc_pages_nodemask+0xb2/0x210 [ 5139.248284] bch_cached_dev_show+0x2c/0x50 [bcache] [ 5139.248297] sysfs_kf_seq_show+0xbb/0x190 [ 5139.248308] seq_read+0xfc/0x3c0 [ 5139.248317] __vfs_read+0x26/0x140 [ 5139.248327] vfs_read+0x87/0x130 [ 5139.248336] SyS_read+0x42/0x90 [ 5139.248346] do_syscall_64+0x74/0x160 [ 5139.248358] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 5139.248370] RIP: 0033:0x7faf24eea370 [ 5139.248379] RSP: 002b:00007fff82d03f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 5139.248395] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faf24eea370 [ 5139.248411] RDX: 0000000000020000 RSI: 00007faf25396000 RDI: 0000000000000003 [ 5139.248426] RBP: 00007faf25396000 R08: 00000000ffffffff R09: 0000000000000000 [ 5139.248441] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007faf25396000 [ 5139.248456] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [ 5139.248892] Code: ff ff ff 0f 1f 80 00 00 00 00 49 89 f9 48 89 cf 48 c7 c0 e3 32 e4 a7 48 c1 ff 30 48 81 fa ff 0f 00 00 48 0f 46 d0 48 85 ff 74 45 <44> 0f b6 02 48 8d 42 01 45 84 c0 74 38 48 01 fa 4c 89 cf eb 0e The simplest way to fix is to revert commit `89e0341af0` ("bcache: use sysfs_match_string() instead of __sysfs_match_string()"). This bug was introduced in Linux v5.2, so this fix only applies to Linux v5.2 is enough for stable tree maintainer. Fixes: `89e0341af0` ("bcache: use sysfs_match_string() instead of __sysfs_match_string()") Cc: stable@vger.kernel.org Cc: Alexandru Ardelean <alexandru.ardelean@analog.com> Reported-by: Peifeng Lin <pflin@suse.com> Acked-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-09 07:37:33 -06:00
Hou Tao	449808a254	raid1: factor out a common routine to handle the completion of sync write It's just code clean-up. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Guoqing Jiang	0d8ed0e9bf	md: don't call spare_active in md_reap_sync_thread if all member devices can't work When add one disk to array, the md_reap_sync_thread is responsible to activate the spare and set In_sync flag for the new member in spare_active(). But if raid1 has one member disk A, and disk B is added to the array. Then we offline A before all the datas are synchronized from A to B, obviously B doesn't have the latest data as A, but B is still marked with In_sync flag. So let's not call spare_active under the condition, otherwise B is still showed with 'U' state which is not correct. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Guoqing Jiang	062f5b2ae1	md: don't set In_sync if array is frozen When a disk is added to array, the following path is called in mdadm. Manage_subdevs -> sysfs_freeze_array -> Manage_add -> sysfs_set_str(&info, NULL, "sync_action","idle") Then from kernel side, Manage_add invokes the path (add_new_disk -> validate_super = super_1_validate) to set In_sync flag. Since In_sync means "device is in_sync with rest of array", and the new added disk need to resync thread to help the synchronization of data. And md_reap_sync_thread would call spare_active to set In_sync for the new added disk finally. So don't set In_sync if array is in frozen. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Guoqing Jiang	9a567843f7	md: allow last device to be forcibly removed from RAID1/RAID10. When the 'last' device in a RAID1 or RAID10 reports an error, we do not mark it as failed. This would serve little purpose as there is no risk of losing data beyond that which is obviously lost (as there is with RAID5), and there could be other sectors on the device which are readable, and only readable from this device. This in general this maximises access to data. However the current implementation also stops an admin from removing the last device by direct action. This is rarely useful, but in many case is not harmful and can make automation easier by removing special cases. Also, if an attempt to write metadata fails the device must be marked as faulty, else an infinite loop will result, attempting to update the metadata on all non-faulty devices. So add 'fail_last_dev' member to 'struct mddev', then we can bypasses the 'last disk' checks for RAID1 and RAID10, and control the behavior per array by change sysfs node. Signed-off-by: NeilBrown <neilb@suse.de> [add sysfs node for fail_last_dev by Guoqing] Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Andy Shevchenko	cf89160793	md: Convert to use int_pow() Instead of linear approach to calculate power of 10, use generic int_pow() which does it better. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Yufen Yu	7cee6d4e60	md/raid10: end bio when the device faulty Just like raid1, we do not queue write error bio to retry write and acknowlege badblocks, when the device is faulty. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Yufen Yu	eeba6809d8	md/raid1: end bio when the device faulty When write bio return error, it would be added to conf->retry_list and wait for raid1d thread to retry write and acknowledge badblocks. In narrow_write_error(), the error bio will be split in the unit of badblock shift (such as one sector) and raid1d thread issues them one by one. Until all of the splited bio has finished, raid1d thread can go on processing other things, which is time consuming. But, there is a scene for error handling that is not necessary. When the device has been set faulty, flush_bio_list() may end bios in pending_bio_list with error status. Since these bios has not been issued to the device actually, error handlding to retry write and acknowledge badblocks make no sense. Even without that scene, when the device is faulty, badblocks info can not be written out to the device. Thus, we also no need to handle the error IO. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Xiao Ni	143f6e733b	md/raid6: Set R5_ReadError when there is read failure on parity disk `7471fb77ce` ("md/raid6: Fix anomily when recovering a single device in RAID6.") avoids rereading P when it can be computed from other members. However, this misses the chance to re-write the right data to P. This patch sets R5_ReadError if the re-read fails. Also, when re-read is skipped, we also missed the chance to reset rdev->read_errors to 0. It can fail the disk when there are many read errors on P member disk (other disks don't have read error) V2: upper layer read request don't read parity/Q data. So there is no need to consider such situation. This is Reported-by: kbuild test robot <lkp@intel.com> Fixes: `7471fb77ce` ("md/raid6: Fix anomily when recovering a single device in RAID6.") Cc: <stable@vger.kernel.org> #4.4+ Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Hou Tao	4675719d0f	raid1: use an int as the return value of raise_barrier() Using a sector_t as the return value is misleading, because raise_barrier() only return 0 or -EINTR. Also add comments for the return values of raise_barrier(). Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2019-08-07 10:25:02 -07:00
Ming Lei	226b4fc75c	blk-mq: add callback of .cleanup_rq SCSI maintains its own driver private data hooked off of each SCSI request, and the pridate data won't be freed after scsi_queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has fully dispatched them, due to a lower level SCSI driver's resource limitation identified in scsi_queue_rq(). Currently SCSI's per-request private data is leaked when the upper layer driver (dm-rq) frees and then retries these requests in response to BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE returns from scsi_queue_rq(). This usecase is so specialized that it doesn't warrant training an existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to account for freeing its driver private data -- doing so would add an extra branch for handling a special case that all other consumers of SCSI (and blk-mq) won't ever need to worry about. So the most pragmatic way forward is to delegate freeing SCSI driver private data to the upper layer driver (dm-rq). Do so by adding new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method from dm-rq. A following commit will implement the .cleanup_rq() hook in scsi_mq_ops. Cc: Ewan D. Milne <emilne@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: <stable@vger.kernel.org> Fixes: `396eaf21ee` ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:42:06 -06:00
Mike Snitzer	9c50a98f55	dm table: fix various whitespace issues with recent DAX code Also, rename device_synchronous to device_dax_synchronous. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-07-30 18:59:24 -04:00
Pankaj Gupta	5348deb138	dm table: fix dax_dev NULL dereference in device_synchronous() If a device doesn't support DAX its 'dax_dev' is NULL. Fix device_synchronous() to first check if dax_dev is NULL before dereferencing it. Fixes: `2e9ee0955d` ("dm: enable synchronous dax") Reported-by: jencce.kernel@gmail.com Signed-off-by: Pankaj Gupta <pagupta@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-07-30 18:58:54 -04:00
Linus Torvalds	0441281965	for-linus-20190726 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z 0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB 0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4 1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn /Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3 oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071 BphJd2RamQ== =HIzH -----END PGP SIGNATURE----- Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...	2019-07-26 10:32:12 -07:00
Wei Yongjun	5d9e06d60e	bcache: fix possible memory leak in bch_cached_dev_run() memory malloced in bch_cached_dev_run() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Fixes: `0b13efecf5` ("bcache: add return value check to bch_cached_dev_run()") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-22 08:15:17 -06:00
Linus Torvalds	3bfe1fc467	- Fix zone state management race in DM zoned target by eliminating the unnecessary DMZ_ACTIVE state. - A couple fixes for issues the DM snapshot target's optional discard support added during first week of the 5.3 merge. - Increase default size of outstanding IO that is allowed for a each dm-kcopyd client and introduce tunable to allow user adjust. - Update DM core to use printk ratelimiting functions rather than duplicate them and in doing so fix an issue where DMDEBUG_LIMIT() rate limited KERN_DEBUG messages had excessive "callbacks suppressed" messages. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0w2IsTHHNuaXR6ZXJA cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgPeCACMtDQrHqMTOT7OPDRxSJjZixefzL32 lFi31mjjEb7GoxiS3dBepdJmQiUwROINdGLIGTfBAlH05b/8fgFgE6iCGZ9uzad4 0PNe9q7pbtfQDLXx+mVMjEdK6P/ilmVFXCW8VQpAAeUFL+dwXYHHIbmQZ/rahOz5 8nn6wGBQ/LRRcbV0hBHNXQymIXPxMweMxO3usSuKbfhe7JjRwslThGbZ4KVwjCwl sLl5mEWXwTKUemGXsXFbCbtH/rnZpbaiAkBedT0oV8g8atRBeQyj0vk48htincj7 Uv6xGjJGXuqUkcvQnTx3C1fk3lH5xJb5MTL3WEN0g6fOmyJd1sMd0gd/ =4Rpg -----END PGP SIGNATURE----- Merge tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull more device mapper updates from Mike Snitzer: - Fix zone state management race in DM zoned target by eliminating the unnecessary DMZ_ACTIVE state. - A couple fixes for issues the DM snapshot target's optional discard support added during first week of the 5.3 merge. - Increase default size of outstanding IO that is allowed for a each dm-kcopyd client and introduce tunable to allow user adjust. - Update DM core to use printk ratelimiting functions rather than duplicate them and in doing so fix an issue where DMDEBUG_LIMIT() rate limited KERN_DEBUG messages had excessive "callbacks suppressed" messages. * tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: use printk ratelimiting functions dm kcopyd: Increase default sub-job size to 512KB dm snapshot: fix oversights in optional discard support dm zoned: fix zone state management race	2019-07-18 14:49:33 -07:00
Linus Torvalds	f8c3500cd1	- virtio_pmem: The new virtio_pmem facility introduces a paravirtualized persistent memory device that allows a guest VM to use DAX mechanisms to access a host-file with host-page-cache. It arranges for MAP_SYNC to be disabled and instead triggers a host fsync() when a 'write-cache flush' command is sent to the virtual disk device. - Miscellaneous small fixups. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJdMHwpAAoJEB7SkWpmfYgCUYoP/3vcgYBAaXNksyALF0iowPoP z4J0KoaOA1CzRFEQtCWUQa84CWj+XoSewwSeyrIkqKQvx/gghXblK+GVjVzBn0BD hmmiKr8af4DdxfzYdEXJp65cCpIiVMaJiGr20Aj9ObwvWJb4QZbz9q7hnPt6KgiI jVND3BpP3OERb4ZFcibdmJT5foKooMcXVG6+luVe+hc1+ZZQxJBsBaqie4brQIFq j59NX3HfHH2fr1vVwnVH0CO4tgbgYg9wZ2EivGu6wBWvORjrr7KiSSbOYP68EBtd lUoNps+vQtGnfXGwNzAjp1wuknrQYYh4/KMKjep7hiZD39rgyvBpbHbyynKzQCWV REe8cXr/nwphsENvBAUBiqY999EWVIxdT2iaVaSA6K/31JQAC5AFyxVK/P2Ke1SK rvePZ++iLQ1o4phTxQPNlVUqF9jOrFVVICGwMDqaqSkOsD9YKQdFClfOF/1ntlDz V0bs+Y0Pe8AJCd9ESep4X+vHAWRRIb4EQIuwLaX8RJoY+r1fGye9RPthpYYzvXKp DI2iJztFO3anzj2i9htNPUFIaiUmIhzEvG32O2If2yc5FL02hMpHPoFx6vHhe6s3 f8OJ+olsJK+/IIrV8+DHqYvhzylOYIhmRTvIxIxaNDPHkhR1i2RDQ6KKK1YZmsr8 MjAZ+Ym0GadDivs+wcM6 =uAMG -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Primarily just the virtio_pmem driver: - virtio_pmem The new virtio_pmem facility introduces a paravirtualized persistent memory device that allows a guest VM to use DAX mechanisms to access a host-file with host-page-cache. It arranges for MAP_SYNC to be disabled and instead triggers a host fsync() when a 'write-cache flush' command is sent to the virtual disk device. - Miscellaneous small fixups" * tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: virtio_pmem: fix sparse warning xfs: disable map_sync for async flush ext4: disable map_sync for async flush dax: check synchronous mapping is supported dm: enable synchronous dax libnvdimm: add dax_dev sync flag virtio-pmem: Add virtio pmem driver libnvdimm: nd_region flush callback support libnvdimm, namespace: Drop uuid_t implementation detail	2019-07-18 10:52:08 -07:00
Nikos Tsironis	c663e04097	dm kcopyd: Increase default sub-job size to 512KB Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8 sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of I/O in flight. This upper limit to the amount of in-flight I/O under-utilizes fast devices and results in decreased throughput, e.g., when writing to a snapshotted thin LV with I/O size less than the pool's block size (so COW is performed using kcopyd). Increase kcopyd's default sub-job size to 512KB, so we have a maximum of 4MB of I/O in flight for each kcopyd job. This results in an up to 96% improvement of bandwidth when writing to a snapshotted thin LV, with I/O sizes less than the pool's block size. Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users to fine tune the sub-job size of kcopyd. The default value of this parameter is 512KB and the maximum allowed value is 1024KB. We evaluate the performance impact of the change by running the snap_breaking_throughput benchmark, from the device mapper test suite [1]. The benchmark: 1. Creates a 1G thin LV 2. Provisions the thin LV 3. Takes a snapshot of the thin LV 4. Writes to the thin LV with: dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size> Running this benchmark with various thin pool block sizes and dd I/O sizes (all combinations triggering the use of kcopyd) we get the following results: +-----------------+-------------+------------------+-----------------+ \| Pool block size \| dd I/O size \| BW before (MB/s) \| BW after (MB/s) \| +-----------------+-------------+------------------+-----------------+ \| 1 MB \| 256 KB \| 242 \| 280 \| \| 1 MB \| 512 KB \| 238 \| 295 \| \| \| \| \| \| \| 2 MB \| 256 KB \| 238 \| 354 \| \| 2 MB \| 512 KB \| 241 \| 380 \| \| 2 MB \| 1 MB \| 245 \| 394 \| \| \| \| \| \| \| 4 MB \| 256 KB \| 248 \| 412 \| \| 4 MB \| 512 KB \| 234 \| 432 \| \| 4 MB \| 1 MB \| 251 \| 474 \| \| 4 MB \| 2 MB \| 257 \| 504 \| \| \| \| \| \| \| 8 MB \| 256 KB \| 239 \| 420 \| \| 8 MB \| 512 KB \| 256 \| 431 \| \| 8 MB \| 1 MB \| 264 \| 467 \| \| 8 MB \| 2 MB \| 264 \| 502 \| \| 8 MB \| 4 MB \| 281 \| 537 \| +-----------------+-------------+------------------+-----------------+ [1] https://github.com/jthornber/device-mapper-test-suite Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2019-07-17 11:24:16 -04:00

1 2 3 4 5 ...

6028 Commits