linux

Commit Graph

Author	SHA1	Message	Date
Junichi Nomura	15b94a6904	dm: fix reload failure of 0 path multipath mapping on blk-mq devices dm-multipath accepts 0 path mapping. # echo '0 2097152 multipath 0 0 0 0' \| dmsetup create newdev Such a mapping can be used to release underlying devices while still holding requests in its queue until working paths come back. However, once the multipath device is created over blk-mq devices, it rejects reloading of 0 path mapping: # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \ \| dmsetup create mpath1 # echo '0 2097152 multipath 0 0 0 0' \| dmsetup load mpath1 device-mapper: reload ioctl on mpath1 failed: Invalid argument Command failed With following kernel message: device-mapper: ioctl: can't change device type after initial table load. DM tries to inherit the current table type using dm_table_set_type() but it doesn't work as expected because of unnecessary check about whether the target type is hybrid or not. Hybrid type is for targets that work as either request-based or bio-based and not required for blk-mq or non blk-mq checking. Fixes: `65803c2059` ("dm table: train hybrid target type detection to select blk-mq if appropriate") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 13:41:16 -04:00
Linus Torvalds	c492e2d464	Assorted fixes for new RAID5 stripe-batching functionality. Unfortunately this functionality was merged a little prematurely. The necessary testing and code review is now complete (or as complete as it can be) and to code passes a variety of tests and looks quite sensible. Also a fix for some recent locking changes - a race was introduced which causes a reshape request to sometimes fail. No data safety issues. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVWf/sDnsnt1WYoG5AQJ8hQ/+KIGUijacXXUXBE4QuO1DMTkltV61bk6E TJQ6fTuMvXeOuyGm+BoSFTJrOJiP6/PVxl4jnAkjLlvAK/JVKekG0PXv2flmD9EJ udK/g8d2k+4L2O0uiGdGSfOQaEaQ4OQvNmQOP9GF/FXNdyYfZbSJnxG+kzWnStGZ 3LNEMoDok9TiUDVSJ3PgibnUHYr3zNJFjBGszfRW0HqXBRWM5TI6HQ0bWwrm61mQ sIOvFeS7CVOBQWW7zkY3uvz/g7dpuPlXqmDOomF+prKlU320SrpSDDBD2Qg56rXh 8YGAzLPV8R6xB5hjGFnoHtvxF/f5Fntb3WbC5az0zv+q/phDYA9Nd2UN5APemyGB PJuxW4Ojq2DWIAvmf0HQEkvjJlqeugCgIQXJJ8yvIaBXJJjit1jMSEXjolM4vlLh h6Su/hwoyTi9NxdYpFeR6JuyHjzTyrjyBkbW8y12wVQjmDncBdKtieYZX4TvPxVz n7Qrk2bpFhR/icP6eYWCvt6iwU1e+5lXNb/18AYm9bJe5BE5/N1X0azrxbdZT4cl 1DvQw2HAMBGp+nSr+R1lqO4yX+busBZUTYsaGvH4T7Ubs+UjwgTE3tPoevj6w829 0/7r/UPfSn0XFbd5rrPY+bOBsAOIMDG5g3mj7K7+38sVeX9VOVN4sGftS5dWTr9e RQBTZAK0+qI= =Y0Vm -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md Pull m,ore md bugfixes gfrom Neil Brown: "Assorted fixes for new RAID5 stripe-batching functionality. Unfortunately this functionality was merged a little prematurely. The necessary testing and code review is now complete (or as complete as it can be) and to code passes a variety of tests and looks quite sensible. Also a fix for some recent locking changes - a race was introduced which causes a reshape request to sometimes fail. No data safety issues" * tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md: md: fix race when unfreezing sync_action md/raid5: break stripe-batches when the array has failed. md/raid5: call break_stripe_batch_list from handle_stripe_clean_event md/raid5: be more selective about distributing flags across batch. md/raid5: add handle_flags arg to break_stripe_batch_list. md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list md/raid5: remove condition test from check_break_stripe_batch_list. md/raid5: Ensure a batch member is not handled prematurely. md/raid5: close race between STRIPE_BIT_DELAY and batching. md/raid5: ensure whole batch is delayed for all required bitmap updates.	2015-05-29 10:35:21 -07:00
Mike Snitzer	e5d8de32cc	dm: fix false warning in free_rq_clone() for unmapped requests When stacking request-based dm device on non blk-mq device and device-mapper target could not map the request (error target is used, multipath target with all paths down, etc), the WARN_ON_ONCE() in free_rq_clone() will trigger when it shouldn't. The warning was added by commit `aa6df8d` ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request"). But free_rq_clone() with clone->q == NULL is valid usage for the case where dm_kill_unmapped_request() initiates request cleanup. Fix this false warning by just removing the WARN_ON -- it only generated false positives and was never useful in catching the intended case (completing clone request not being mapped e.g. clone->q being NULL). Fixes: `aa6df8d` ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 11:07:36 -04:00
NeilBrown	56ccc1125b	md: fix race when unfreezing sync_action A recent change removed the need for locking around writing to "sync_action" (and various other places), but introduced a subtle race. When e.g. setting 'reshape' on a 'frozen' array, the 'frozen' flag is cleared before 'reshape' is set, so the md thread can get in and start trying recovery - which isn't wanted. So instead of clearing MD_RECOVERY_FROZEN for any command except 'frozen', only clear it when each specific command is parsed. This allows the handling of 'reshape' to clear the bit while a lock is held. Also remove some places where we set MD_RECOVERY_NEEDED, as it is always set on non-error exit of the function. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.")	2015-05-28 18:04:45 +10:00
NeilBrown	626f2092c8	md/raid5: break stripe-batches when the array has failed. Once the array has too much failure, we need to break stripe-batches up so they can all be dealt with. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:48:59 +10:00
NeilBrown	787b76fa37	md/raid5: call break_stripe_batch_list from handle_stripe_clean_event Now that the code in break_stripe_batch_list() is nearly identical to the end of handle_stripe_clean_event, replace the later with a function call. The only remaining difference of any interest is the masking that is applieds to dev[i].flags copied from head_sh. R5_WriteError certainly isn't wanted as it is set per-stripe, not per-patch. R5_Overlap isn't wanted as it is explicitly handled. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:47:02 +10:00
NeilBrown	1b956f7a8f	md/raid5: be more selective about distributing flags across batch. When a batch of stripes is broken up, we keep some of the flags that were per-stripe, and copy other flags from the head to all others. This only happens while a stripe is being handled, so many of the flags are irrelevant. The "SYNC_FLAGS" (which I've renamed to make it clear there are several) and STRIPE_DEGRADED are set per-stripe and so need to be preserved. STRIPE_INSYNC is the only flag that is set on the head that needs to be propagated to all others. For safety, add a WARN_ON if others are set, except: STRIPE_HANDLE - this is safe and per-stripe and we are going to set in several cases anyway STRIPE_INSYNC STRIPE_IO_STARTED - this is just a hint and doesn't hurt. STRIPE_ON_PLUG_LIST STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched stripe to be on one of these lists, but it can happen as can be safely ignored. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:40:01 +10:00
NeilBrown	3960ce7961	md/raid5: add handle_flags arg to break_stripe_batch_list. When we break a stripe_batch_list we sometimes want to set STRIPE_HANDLE on the individual stripes, and sometimes not. So pass a 'handle_flags' arg. If it is zero, always set STRIPE_HANDLE (on non-head stripes). If not zero, only set it if any of the given flags are present. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:39:30 +10:00
NeilBrown	fb642b92c2	md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list break_stripe_batch list didn't clear head_sh->batch_head. This was probably a bug. Also clear all R5_Overlap flags and if any were cleared, wake up 'wait_for_overlap'. This isn't always necessary but the worst effect is a little extra checking for code that is waiting on wait_for_overlap. Also, don't use wake_up_nr() because that does the wrong thing if 'nr' is zero, and it number of flags cleared doesn't strongly correlate with the number of threads to wake. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:36:25 +10:00
NeilBrown	4e3d62ff49	md/raid5: remove condition test from check_break_stripe_batch_list. handle_stripe_clean_event() contains a chunk of code very similar to check_break_stripe_batch_list(). If we make the latter more like the former, we can end up with just one copy of this code. This first step removed the condition (and the 'check_') part of the name. This has the added advantage of making it clear what check is being performed at the point where the function is called. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:36:06 +10:00
NeilBrown	b15a9dbdbf	md/raid5: Ensure a batch member is not handled prematurely. If a stripe is a member of a batch, but not the head, it must not be handled separately from the rest of the batch. 'clear_batch_ready()' handles this requirement to some extent but not completely. If a member is passed to handle_stripe() a second time it returns '0' indicating the stripe can be handled, which is wrong. So add an extra test. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:35:47 +10:00
NeilBrown	d0852df543	md/raid5: close race between STRIPE_BIT_DELAY and batching. When we add a write to a stripe we need to make sure the bitmap bit is set. While doing that the stripe is not locked so it could be added to a batch after which further changes to STRIPE_BIT_DELAY and ->bm_seq are ineffective. So we need to hold off adding to a stripe until bitmap_startwrite has completed at least once, and we need to avoid further changes to STRIPE_BIT_DELAY once the stripe has been added to a batch. If a bitmap_startwrite() completes after the stripe was added to a batch, it will not have set the bit, only incremented a counter, so no extra delay of the stripe is needed. Reported-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:34:40 +10:00
NeilBrown	2b6b245742	md/raid5: ensure whole batch is delayed for all required bitmap updates. When we add a stripe to a batch, we need to be sure that head stripe will wait for the bitmap update required for the new stripe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:29:14 +10:00
Mike Snitzer	45714fbed4	dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the DM blk-mq device's .queue_rq. This cleans up the previous convoluted handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK (even though it wasn't) and then run blk_mq_requeue_request() followed by blk_mq_kick_requeue_list(). Also, document that DM blk-mq ontop of old request_fn devices cannot fail in clone_rq() since the clone request is preallocated as part of the pdu. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-27 17:37:23 -04:00
Mike Snitzer	4c6dd53dd3	dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path Otherwise kmemleak reported: unreferenced object 0xffff88009b14e2b0 (size 16): comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s) hex dump (first 16 bytes): 40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00 @............... backtrace: [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0 [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160 [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20 [<ffffffff8111cb37>] mempool_alloc+0x57/0x150 [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath] [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath] [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod] [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod] [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380 [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100 [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300 [<ffffffff812c0f30>] generic_make_request+0xc0/0x110 [<ffffffff812c0ff2>] submit_bio+0x72/0x150 [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0 [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40 [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390 Fixes: `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-05-27 17:37:22 -04:00
Junichi Nomura	3a1407559a	dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED When stacking request-based DM on blk_mq device, request cloning and remapping are done in a single call to target's clone_and_map_rq(). The clone is allocated and valid only if clone_and_map_rq() returns DM_MAPIO_REMAPPED. The "IS_ERR(clone)" check in map_request() does not cover all the !DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices are not ready or unavailable, clone_and_map_rq() may return DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this by explicitly checking for a return that is not DM_MAPIO_REMAPPED in map_request(). Without this fix, DM core may call setup_clone() for a NULL clone and oops like this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137 ... CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1 ... Call Trace: [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod] [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071fdd>] kthread+0xe6/0xee [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff814c2d98>] ret_from_fork+0x58/0x90 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b Fixes: `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-05-27 09:48:51 -04:00
Junichi Nomura	4ae9944d13	dm: run queue on re-queue Without kicking queue, requeued request may stay forever in the queue if there are no other I/O activities to the device. The original error had been in v2.6.39 with commit `7eaceaccab` ("block: remove per-queue plugging"), which replaced conditional plugging by periodic runqueue. Commit `9d1deb83d4` in v4.1-rc1 removed the periodic runqueue and the problem started to manifest. Fixes: `9d1deb83d4` ("dm: don't schedule delayed run of the queue if nothing to do") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-26 09:57:36 -04:00
Linus Torvalds	a30ec4b347	md fixes for 4.1-rc4 - one serious RAID0 data corruption - caused by recent bugfix that wasn't reviewed properly. - one raid5 fix in new code (a couple more of those to come). - one little fix to stop static analysis complaining about silly rcu annotation. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVV6rmDnsnt1WYoG5AQICpQ//VleOWnZERlVQN1nYzGHqIyPUz+dV6x8W RfzccMzUWqaHpZY/BaTMqJxPWgt6lxqqtN5F3u+XMdM+L2Pl0NXoRTqB+FzpLucR d0p76FwbW2IWnycxXUZtGjm8xT942KPdqXIP+LQRKaIhguV9P9th9WbPRvlXFfB6 pO1h1AdxVlsfn6qgBpe49pYoUuNmpwSTTJ8Gvb7YRsHXnmB/lsjQvVsWLiMqPwug +bR0gM07OvTfYenrY82ztu42+xAoVoeu/XhHmPPLEV3S/SA/9kVLgBcWjq2bBjw0 REzR2i+exEtCRe2yvetDX8320IKZcZSOE8BwBB2iUm8OVjplPxw83Bk/kB2+Zr8n VxAa9/vCZPLNCWyzOSi2V471NeFBys9PVkVIroHT5nQZGadkw0JYt1cEbcHSBKS/ NYmJCu1IM0o9DPzn8jW7sXjDcB0WE8VwFMJ09QSiEYQu2bAb7j2f1pcoweN2MeNF qOitGj/luygXg2TBvema9qPL533DUj+eSYyyO7OQBsfkfZSDM5di4bGoZBVw118Q kARaJ3IZffT8gsVpgSTCsviDv5QUcs0izVS1sMKB6/SDJjtLXBMERW+LkrmjBsnB 6+CMmwNZUIzugzBgSPBVozdNQQYjAXL7LNV6Q9YQhE0RK0+10fLLxmpovJR62bRf PWlg23+n4+E= =jXKM -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "I have a few more raid5 bugfixes pending, but I want them to get a bit more review first. In the meantime: - one serious RAID0 data corruption - caused by recent bugfix that wasn't reviewed properly. - one raid5 fix in new code (a couple more of those to come). - one little fix to stop static analysis complaining about silly rcu annotation" * tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md: md/bitmap: remove rcu annotation from pointer arithmetic. md/raid0: fix restore to sector variable in raid0_make_request raid5: fix broken async operation chain	2015-05-22 15:10:07 -07:00
Christoph Hellwig	5f1b670d0b	block, dm: don't copy bios for request clones Currently dm-multipath has to clone the bios for every request sent to the lower devices, which wastes cpu cycles and ties down memory. This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio to not complete bios attached to a request, which we set on clone requests similar to bios in a flush sequence. With this change I/O errors on a path failure only get propagated to dm-multipath, which can then either resubmit the I/O or complete the bios on the original request. I've done some basic testing of this on a Linux target with ALUA support, and it survives path failures during I/O nicely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-22 08:58:57 -06:00
Mike Snitzer	326e1dbb57	block: remove management of bi_remaining when restoring original bi_end_io Commit `c4cf5261` ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: `c4cf5261` ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-22 08:58:55 -06:00
NeilBrown	8532e34390	md/bitmap: remove rcu annotation from pointer arithmetic. Evaluating "&mddev->disks" is simple pointer arithmetic, so it does not need 'rcu' annotations - no dereferencing is happening. Also enhance the comment to explain that 'rdev' in that case is not actually a pointer to an rdev. Reported-by: Patrick Marlier <patrick.marlier@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-21 09:14:41 +10:00
Eric Work	a81157768a	md/raid0: fix restore to sector variable in raid0_make_request The variable "sector" in "raid0_make_request()" was improperly updated by a call to "sector_div()" which modifies its first argument in place. Commit `47d68979cc` restored this variable after the call for later re-use. Unfortunetly the restore was done after the referenced variable "bio" was advanced. This lead to the original value and the restored value being different. Here we move this line to the proper place. One observed side effect of this bug was discarding a file though unlinking would cause an unrelated file's contents to be discarded. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `47d68979cc` ("md/raid0: fix bug with chunksize not a power of 2.") Cc: stable@vger.kernel.org (any that received above backport) URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501	2015-05-21 09:14:25 +10:00
Shaohua Li	487696957e	raid5: fix broken async operation chain ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned by async_gen_syndrome should be added as the dependent tx of next stripe. The issue is introduced by commit `59fc630b8b` RAID5: batch adjacent full stripe write Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-21 09:14:20 +10:00
Linus Torvalds	c91aa67eed	A few fixes for md. Most of these are related to the new "batched stripe writeout", but there are a few others. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVVAgfznsnt1WYoG5AQKNsBAAw+yXVnRlLfzTPOhiIolvdmdmck6ghpV7 eTHTbDEVkDo+a+caU01mhkHtL2wEuxGqr4XHTASIM9Xjn2YLhH14HOtBPl6+IddR F58X+SN8/pK4tzc5P42AAbfGfqeJSBPSl49L94+Pd9BsagNYZOUx7c/WK5lKay6O ZQRvI3Bafg7VTzHCw931sN11yY/PS+45zpOentk5TlavRPcZfuVTh1QzSyAUvbtN sAFK2+ISflj8+TW0rLo44c3kkC7ARFKcq5lJt4WsQUntJTjiIROvqvaEP+IUZkGr y+yvCJbskJ6B03XrcuvdkFrp3Qe0yWtHCN6qKhc8AI/c/pcuz24xTYQ6V6LDyAy8 N2z6m1Kri97qxZpOONNh31WOCmVyFmwiz7+54soEeTDg3xgTb60UiZe8T+9ZcxoY sN8ksVSGGapgQUUGYcidbZla6uSWNjZq9hQgDJFUwG3HhvT3E/qiUT9kqAxfPsLE Vxw/KQ5GkW+7ldaPmGuv29hc3THjsiLY5Bx0kBlmgao3fcT61WbNcZPJOQfMxlT7 6qoJSiNoq612TaHj3X06FXhMGE/fwjwjOO+PpULxM47giHcKrVR/qlG4kREYa2vj UgV05rP/S+k3Mce4dHzvUuE57t0ADff0d/mZvhfzERjNdwTMdWgia89XLg7D06rk UGB/o2IwWug= =tapF -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "A few fixes for md. Most of these are related to the new "batched stripe writeout", but there are a few others" * tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md: md/raid5: fix handling of degraded stripes in batches. md/raid5: fix allocation of 'scribble' array. md/raid5: don't record new size if resize_stripes fails. md/raid5: avoid reading parity blocks for full-stripe write to degraded array md/raid5: more incorrect BUG_ON in handle_stripe_fill. md/raid5: new alloc_stripe() to allocate an initialize a stripe. md-raid0: conditional mddev->queue access to suit dm-raid	2015-05-11 10:33:31 -07:00
Linus Torvalds	5d5df5ee7c	Two additional fixes for changes introduced via DM during the 4.1 merge window: The first reverts a dm-crypt change that wasn't correct. The second fixes a device format regression that impacted userspace. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVTRZOAAoJEMUj8QotnQNaW7EIAJlHn+S8czm1Cb4gWBn7kg+X vzH5NIzr/SpDX8o3R8NBdrB8rgqTm4jQZrptbmgLG+j9XoaQupuFyNCiaAw47v2G P/WYlodNwTkb3I48XjwCRo00MtR3cEJ8ywNzvEJUvgPkgMMIzhieHsVT9L8bZv3n XDs8JzZyF966U0BeCjF4oDAazUrpEvWf0h4C5L47g8C0UQI7aGwYKoSvZm3DAImP awbJbnqtQuoRcI0HISHrjYi1vghgnmJY6aSx3tYSJPTNRkFNqgap7eZrUacicnOH bUVL3snBVebK3JMJhJXgfGW/FeeP9juhEY08JNTOZ5wa6BNuru0GHeqKuI3arHY= =jlAN -----END PGP SIGNATURE----- Merge tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Two additional fixes for changes introduced via DM during the 4.1 merge window. The first reverts a dm-crypt change that wasn't correct. The second fixes a device format regression that impacted userspace" * tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: init: fix regression by supporting devices with major:minor:offset format Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"	2015-05-08 20:38:21 -07:00
Linus Torvalds	1daac193f2	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A collection of fixes since the merge window; - fix for a double elevator module release, from Chao Yu. Ancient bug. - the splice() MORE flag fix from Christophe Leroy. - a fix for NVMe, fixing a patch that went in in the merge window. From Keith. - two fixes for blk-mq CPU hotplug handling, from Ming Lei. - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md. - two blk-mq fixes from Shaohua, fixing a race on queue stop and a bad merge issue with FUA writes. - division-by-zero fix for writeback from Tejun. - a block bounce page accounting fix, making sure we inc/dec after bouncing so that pre/post IO pages match up. From Wang YanQing" * 'for-linus' of git://git.kernel.dk/linux-block: splice: sendfile() at once fails for big files blk-mq: don't lose requests if a stopped queue restarts blk-mq: fix FUA request hang block: destroy bdi before blockdev is unregistered. block:bounce: fix call inc_\|dec_zone_page_state on different pages confuse value of NR_BOUNCE elevator: fix double release of elevator module writeback: use \|1 instead of +1 to protect against div by zero blk-mq: fix CPU hotplug handling blk-mq: fix race between timeout and CPU hotplug NVMe: Fix VPD B0 max sectors translation	2015-05-08 19:49:35 -07:00
NeilBrown	bb27051f9f	md/raid5: fix handling of degraded stripes in batches. There is no need for special handling of stripe-batches when the array is degraded. There may be if there is a failure in the batch, but STRIPE_DEGRADED does not imply an error. So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is degraded. This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in check_break_stripe_batch_list() and so the bitmap bit gets cleared when it shouldn't. So in check_break_stripe_batch_list(), split the batch up completely - again STRIPE_DEGRADED isn't meaningful. Also don't set STRIPE_BATCH_ERR when there is a write error to a replacement device. This simply removes the replacement device and requires no extra handling. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:47:57 +10:00
NeilBrown	738a273806	md/raid5: fix allocation of 'scribble' array. As the new 'scribble' array is sized based on chunk size, we need to make sure the size matches the largest of 'old' and 'new' chunk sizes when the array is undergoing reshape. We also potentially need to resize it even when not resizing the stripe cache, as chunk size can change without changing number of devices. So move the 'resize' code into a separate function, and consider old and new sizes when allocating. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `46d5b78562` ("raid5: use flex_array for scribble data")	2015-05-08 18:47:48 +10:00
NeilBrown	6e9eac2dce	md/raid5: don't record new size if resize_stripes fails. If any memory allocation in resize_stripes fails we will return -ENOMEM, but in some cases we update conf->pool_size anyway. This means that if we try again, the allocations will be assumed to be larger than they are, and badness results. So only update pool_size if there is no error. This bug was introduced in 2.6.17 and the patch is suitable for -stable. Fixes: `ad01c9e375` ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array") Cc: stable@vger.kernel.org (v2.6.17+) Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:47:35 +10:00
NeilBrown	10d82c5f0d	md/raid5: avoid reading parity blocks for full-stripe write to degraded array When performing a reconstruct write, we need to read all blocks that are not being over-written .. except the parity (P and Q) blocks. The code currently reads these (as they are not being over-written!) unnecessarily. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `ea664c8245` ("md/raid5: need_this_block: tidy/fix last condition.")	2015-05-08 18:47:17 +10:00
NeilBrown	b0c783b323	md/raid5: more incorrect BUG_ON in handle_stripe_fill. It is not incorrect to call handle_stripe_fill() when a batch of full-stripe writes is active. It is, however, a BUG if fetch_block() then decides it needs to actually fetch anything. So move the 'BUG_ON' to where it belongs. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `59fc630b8b` ("RAID5: batch adjacent full stripe write")	2015-05-08 18:46:52 +10:00
NeilBrown	f18c1a35f6	md/raid5: new alloc_stripe() to allocate an initialize a stripe. The new batch_lock and batch_list fields are being initialized in grow_one_stripe() but not in resize_stripes(). This causes a crash on resize. So separate the core initialization into a new function and call it from both allocation sites. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `59fc630b8b` ("RAID5: batch adjacent full stripe write")	2015-05-08 18:40:01 +10:00
Heinz Mauelshagen	b6538fe329	md-raid0: conditional mddev->queue access to suit dm-raid This patch is a prerequisite for dm-raid "raid0" support to allow dm-raid to access the MD RAID0 personality doing unconditional accesses to mddev->queue, which is NULL in case of dm-raid stacked on top of MD. Most of the conditional mddev->queue accesses made it to upstream but this missing one, which prohibits md raid0 to set disk stack limits (being done in dm core in case of md underneath dm). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:39:40 +10:00
Jens Axboe	dac56212e8	bio: skip atomic inc/dec of ->bi_cnt for most use cases Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-05 13:32:49 -06:00
Jens Axboe	c4cf5261f8	bio: skip atomic inc/dec of ->bi_remaining for non-chains Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-05 13:32:47 -06:00
Rabin Vincent	c0403ec0bb	Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY" This reverts Linux 4.1-rc1 commit `0618764cb2`. The problem which that commit attempts to fix actually lies in the Freescale CAAM crypto driver not dm-crypt. dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG. This means the the crypto driver should internally backlog requests which arrive when the queue is full and process them later. Until the crypto hw's queue becomes full, the driver returns -EINPROGRESS. When the crypto hw's queue if full, the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is expected to backlog the request and process it when the hardware has queue space. At the point when the driver takes the request from the backlog and starts processing it, it calls the completion function with a status of -EINPROGRESS. The completion function is called (for a second time, in the case of backlogged requests) with a status/err of 0 when a request is done. Crypto drivers for hardware without hardware queueing use the helpers, crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request() and crypto_get_backlog() helpers to implement this behaviour correctly, while others implement this behaviour without these helpers (ccp, for example). dm-crypt (before the patch that needs reverting) uses this API correctly. It queues up as many requests as the hw queues will allow (i.e. as long as it gets back -EINPROGRESS from the request function). Then, when it sees at least one backlogged request (gets -EBUSY), it waits till that backlogged request is handled (completion gets called with -EINPROGRESS), and then continues. The references to af_alg_wait_for_completion() and af_alg_complete() in that commit's commit message are irrelevant because those functions only handle one request at a time, unlink dm-crypt. The problem is that the Freescale CAAM driver, which that commit describes as having being tested with, fails to implement the backlogging behaviour correctly. In cam_jr_enqueue(), if the hardware queue is full, it simply returns -EBUSY without backlogging the request. What the observed deadlock was is not described in the commit message but it is obviously the wait_for_completion() in crypto_convert() where dm-crypto would wait for the completion being called with -EINPROGRESS in the case of backlogged requests. This completion will never be completed due to the bug in the CAAM driver. Commit `0618764cb2` incorrectly made dm-crypt wait for every request, even when the driver/hardware queues are not full, which means that dm-crypt will never see -EBUSY. This means that that commit will cause a performance regression on all crypto drivers which implement the API correctly. Revert it. Correct backlog handling should be implemented in the CAAM driver instead. Cc'ing stable purely because commit `0618764cb2` did. If for some reason a stable@ kernel did pick up commit `0618764cb2` it should get reverted. Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Reviewed-by: Horia Geanta <horia.geanta@freescale.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-05 12:16:43 -04:00
Mike Snitzer	aa6df8dd28	dm: fix free_rq_clone() NULL pointer when requeueing unmapped request Commit `022333427a` ("dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check before testing clone->q->mq_ops. It was an oversight to discontinue that check for 1 of the 2 use-cases for free_rq_clone(): 1) free_rq_clone() called when an unmapped original request is requeued 2) free_rq_clone() called in the request-based IO completion path The clone->q check made sense for case #1 but not for #2. However, we cannot just reinstate the check as it'd mask a serious bug in the IO completion case #2 -- no in-flight request should have an uninitialized request_queue (basic block layer refcounting _should_ ensure this). The NULL pointer seen for case #1 is detailed here: https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html Fix this free_rq_clone() NULL pointer by simply checking if the mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is blk-mq) rather than checking clone->q->mq_ops. This avoids the need to dereference clone->q, but a WARN_ON_ONCE is added to let us know if an uninitialized clone request is being completed. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-30 10:25:21 -04:00
Christoph Hellwig	3e6180f0c8	dm: only initialize the request_queue once Commit `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") didn't properly account for the need to short-circuit re-initializing DM's blk-mq request_queue if it was already initialized. Otherwise, reloading a blk-mq request-based DM table (either manually or via multipathd) resulted in errors, see: https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html Fix is to only initialize the request_queue on the initial table load (when the mapped_device type is assigned). This is better than having dm_init_request_based_blk_mq_queue() return early if the queue was already initialized because it elevates the constraint to a more meaningful location in DM core. As such the pre-existing early return in dm_init_request_based_queue() can now be removed. Fixes: `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-30 10:25:21 -04:00
NeilBrown	6cd18e711d	block: destroy bdi before blockdev is unregistered. Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit `c4db59d31e` Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: `c4db59d31e` Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-04-27 10:27:20 -06:00
Linus Torvalds	474095e46c	md updates for 4.1 Highlights: - "experimental" code for managing md/raid1 across a cluster using DLM. Code is not ready for general use and triggers a WARNING if used. However it is looking good and mostly done and having in mainline will help co-ordinate development. - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to handle a full (chunk wide) stripe as a single unit. - RAID6 can now perform read-modify-write cycles which should help performance on larger arrays: 6 or more devices. - RAID5/6 stripe cache now grows and shrinks dynamically. The value set is used as a minimum. - Resync is now allowed to go a little faster than the 'mininum' when there is competing IO. How much faster depends on the speed of the devices, so the effective minimum should scale with device speed to some extent. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVTbIxDnsnt1WYoG5AQIgAA//Z9FlEpkHcJJ75WGjrXgGJPyNTfEZOkoz jnD8PpBY2Afp341vMatd0XKErdEAhuPQmJMAa+tbxht6pk77X3fSzXghGA8FEafg yazn5pfBt6NXmepV4vhl/+LNYuWRxxLSA9EDm7wEg+tiO0UEts+a0w2TSXzcT1w+ 30Yi1EjAlTaJ5yjlBHUOtTXWE43D6RKnUr6FMy2dRlnzlFyDlezRMChWo9v6OkQF YqJ20FcvmJdLHY/6Yif3jvpm7eQecMqdCZENTvW/mJI86zqf6E+ToCYS1VNjfDnK ud61iU9eshu4WtNWBG6KLuHBD0grO1NaEL7/S16w1KdNJMhYgiK8WussvIAJEesA 5SlETM7Y/1XFq8puwlAq2/tuPfhZ+TFxnAwce/C3hMTDcYAACnS/R6INFQXqGvy3 nX1NLogrCycX8oqxv3jTFKLVqIVwlkSlHcUGzIWjcfCF37StcXFKI5q862agyg2+ NNocFMuXhPPM1YcB9JJSo2nCsor4e9tTdVEZlFm2B3cc8LJ9BLWUMSoi1h7VK/1g P7psnPIjz7/cdI2TZTFjGTZ0Kvhx/NTYp41AZealDNxeGWUNM+5xGZnUF8QRBc/E 0dGHtEAah834BDQFvNnJtuuh/s+KwbvswjNP+njoBsHjIQIvngDABpOwpIkdqF6r diQ2gUPnHN0= =OHG6 -----END PGP SIGNATURE----- Merge tag 'md/4.1' of git://neil.brown.name/md Pull md updates from Neil Brown: "More updates that usual this time. A few have performance impacts which hould mostly be positive, but RAID5 (in particular) can be very work-load ensitive... We'll have to wait and see. Highlights: - "experimental" code for managing md/raid1 across a cluster using DLM. Code is not ready for general use and triggers a WARNING if used. However it is looking good and mostly done and having in mainline will help co-ordinate development. - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to handle a full (chunk wide) stripe as a single unit. - RAID6 can now perform read-modify-write cycles which should help performance on larger arrays: 6 or more devices. - RAID5/6 stripe cache now grows and shrinks dynamically. The value set is used as a minimum. - Resync is now allowed to go a little faster than the 'mininum' when there is competing IO. How much faster depends on the speed of the devices, so the effective minimum should scale with device speed to some extent" * tag 'md/4.1' of git://neil.brown.name/md: (58 commits) md/raid5: don't do chunk aligned read on degraded array. md/raid5: allow the stripe_cache to grow and shrink. md/raid5: change ->inactive_blocked to a bit-flag. md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe md/raid5: pass gfp_t arg to grow_one_stripe() md/raid5: introduce configuration option rmw_level md/raid5: activate raid6 rmw feature md/raid6 algorithms: xor_syndrome() for SSE2 md/raid6 algorithms: xor_syndrome() for generic int md/raid6 algorithms: improve test program md/raid6 algorithms: delta syndrome functions raid5: handle expansion/resync case with stripe batching raid5: handle io error of batch list RAID5: batch adjacent full stripe write raid5: track overwrite disk count raid5: add a new flag to track if a stripe can be batched raid5: use flex_array for scribble data md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid md: allow resync to go faster when there is competing IO. md: remove 'go_faster' option from ->sync_request() ...	2015-04-24 09:28:01 -07:00
Eric Mei	9ffc8f7cb9	md/raid5: don't do chunk aligned read on degraded array. When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei <eric.mei@seagate.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	edbe83ab4c	md/raid5: allow the stripe_cache to grow and shrink. The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	5423399a84	md/raid5: change ->inactive_blocked to a bit-flag. This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	486f0644c3	md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
NeilBrown	a9683a795b	md/raid5: pass gfp_t arg to grow_one_stripe() This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	d06f191f8e	md/raid5: introduce configuration option rmw_level Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	584acdd49c	md/raid5: activate raid6 rmw feature Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
shli@kernel.org	dabc4ec6ba	raid5: handle expansion/resync case with stripe batching expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	72ac733015	raid5: handle io error of batch list If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	59fc630b8b	RAID5: batch adjacent full stripe write stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	7a87f43405	raid5: track overwrite disk count Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	da41ba6597	raid5: add a new flag to track if a stripe can be batched A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	46d5b78562	raid5: use flex_array for scribble data Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
Heinz Mauelshagen	753f2856cd	md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
NeilBrown	ac8fa4196d	md: allow resync to go faster when there is competing IO. When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	09314799e4	md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	50c37b136a	md: don't require sync_min to be a multiple of chunk_size. There is really no need for sync_min to be a multiple of chunk_size, and values read from here often aren't. That means you cannot read a value and expect to be able to write it back later. So remove the chunk_size check, and round down to a multiple of 4K, to be sure everything works with 4K-sector devices. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	d51e4fe6d6	Merge branch 'cluster' into for-next	2015-04-22 08:00:20 +10:00
Goldwyn Rodrigues	97f6cd39da	md-cluster: re-add capabilities When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	a6da4ef85c	md: re-add a failed disk This adds the capability of re-adding a failed disk by writing "re-add" to /sys/block/mdXX/md/dev-YYY/state. This facilitates adding disks which have encountered a temporary error such as a network disconnection/hiccup in an iSCSI device, or a SAN cable disconnection which has been restored. In such a situation, you do not need to remove and re-add the device. Writing re-add to the failed device's state would add it again to the array and perform the recovery of only the blocks which were written after the device failed. This works for generic md, and is not related to clustering. However, this patch is to ease re-add operations listed above in clustering environments. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	88bcfef7be	md-cluster: remove capabilities This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	57d051dcca	md: Export and rename find_rdev_nr_rcu This is required by the clustering module (patches to follow) to find the device to remove or re-add. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	fb56dfef4e	md: Export and rename kick_rdev_from_array This export is required for clustering module in order to co-ordinate remove/readd a rdev from all nodes. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Guoqing Jiang	8c58f02e24	md-cluster: correct the num for comparison Since the node num of md-cluster is from zero, and cinfo->slot_number represents the slot num of dlm, no need to check for equality. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:58:31 +10:00
Linus Torvalds	afad97eee4	- Most extensive changes this cycle are the DM core improvements to add full blk-mq support to request-based DM. - disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT - depends on some blk-mq changes from Jens' for-4.1/core branch so that explains why this pull is built on linux-block.git - Update DM to use name_to_dev_t() rather than open-coding a less capable device parser. - includes a couple small improvements to name_to_dev_t() that offer stricter constraints that DM's code provided. - Improvements to the dm-cache "mq" cache replacement policy. - A DM crypt crypt_ctr() error path fix and an async crypto deadlock fix. - A small efficiency improvement for DM crypt decryption by leveraging immutable biovecs. - Add error handling modes for corrupted blocks to DM verity. - A new "log-writes" DM target from Josef Bacik that is meant for file system developers to test file system integrity at particular points in the life of a file system. - A few DM log userspace cleanups and fixes. - A few Documentation fixes (for thin, cache, crypt and switch). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVMGvhAAoJEMUj8QotnQNaOKgH/iuWyEZ+ndWbeEBZs9ZJPEgU sRiKjSaPujkdB+78o+2DfAYjwCchqH1Oy781PqG85asTENVkcjD7LBPaIom/A2ml Z9ZeCzAKbrtB3lZR8QAg4u7+90kkuZv1SeOPMWBZCEV4GYLEpV1J4/f+RgdEs1kT upQcH1fFoQdyHGikwAKXf85Gmi4OrjVb19yoxu2xZnfwT2sCPq0okU4yBxb1B9mL X/FcHUeLS9LJGewEqTD3AvWrk+J5pqj+1EeKY2kRxisp1065TaljYwetgR76Vnfb o0J5eovJVL3AKRTYxvr5JABWD09xq9nG1hVBiuQMN5VmmvxjDKdbBPaKI4dZTt0= =uVjp -----END PGP SIGNATURE----- Merge tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - the most extensive changes this cycle are the DM core improvements to add full blk-mq support to request-based DM. - disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT - depends on some blk-mq changes from Jens' for-4.1/core branch so that explains why this pull is built on linux-block.git - update DM to use name_to_dev_t() rather than open-coding a less capable device parser. - includes a couple small improvements to name_to_dev_t() that offer stricter constraints that DM's code provided. - improvements to the dm-cache "mq" cache replacement policy. - a DM crypt crypt_ctr() error path fix and an async crypto deadlock fix - a small efficiency improvement for DM crypt decryption by leveraging immutable biovecs - add error handling modes for corrupted blocks to DM verity - a new "log-writes" DM target from Josef Bacik that is meant for file system developers to test file system integrity at particular points in the life of a file system - a few DM log userspace cleanups and fixes - a few Documentation fixes (for thin, cache, crypt and switch) * tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (34 commits) dm crypt: fix missing error code return from crypt_ctr error path dm crypt: fix deadlock when async crypto algorithm returns -EBUSY dm crypt: leverage immutable biovecs when decrypting on read dm crypt: update URLs to new cryptsetup project page dm: add log writes target dm table: use bool function return values of true/false not 1/0 dm verity: add error handling modes for corrupted blocks dm thin: remove stale 'trim' message documentation dm delay: use msecs_to_jiffies for time conversion dm log userspace base: fix compile warning dm log userspace transfer: match wait_for_completion_timeout return type dm table: fall back to getting device using name_to_dev_t() init: stricter checking of major:minor root= values init: export name_to_dev_t and mark name argument as const dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq dm: add full blk-mq support to request-based DM dm: impose configurable deadline for dm_request_fn's merge heuristic dm sysfs: introduce ability to add writable attributes dm: don't start current request if it would've merged with the previous ...	2015-04-18 08:14:18 -04:00
Wei Yongjun	44c144f9c8	dm crypt: fix missing error code return from crypt_ctr error path Fix to return a negative error code from crypt_ctr()'s optional parameter processing error path. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-16 22:00:50 -04:00
Ben Collins	0618764cb2	dm crypt: fix deadlock when async crypto algorithm returns -EBUSY I suspect this doesn't show up for most anyone because software algorithms typically don't have a sense of being too busy. However, when working with the Freescale CAAM driver it will return -EBUSY on occasion under heavy -- which resulted in dm-crypt deadlock. After checking the logic in some other drivers, the scheme for crypt_convert() and it's callback, kcryptd_async_done(), were not correctly laid out to properly handle -EBUSY or -EINPROGRESS. Fix this by using the completion for both -EBUSY and -EINPROGRESS. Now crypt_convert()'s use of completion is comparable to af_alg_wait_for_completion(). Similarly, kcryptd_async_done() follows the pattern used in af_alg_complete(). Before this fix dm-crypt would lockup within 1-2 minutes running with the CAAM driver. Fix was regression tested against software algorithms on PPC32 and x86_64, and things seem perfectly happy there as well. Signed-off-by: Ben Collins <ben.c@servergy.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-04-15 12:10:26 -04:00
Mike Snitzer	5977907937	dm crypt: leverage immutable biovecs when decrypting on read Commit `003b5c571` ("block: Convert drivers to immutable biovecs") stopped short of changing dm-crypt to leverage the fact that the biovec array of a bio will no longer be modified. Switch to using bio_clone_fast() when cloning bios for decryption after read. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:25 -04:00
Milan Broz	e44f23b32d	dm crypt: update URLs to new cryptsetup project page Cryptsetup home page moved to GitLab. Also remove link to abandonded Truecrypt page. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:24 -04:00
Josef Bacik	0e9cebe724	dm: add log writes target Introduce a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and associated data and log them to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers a tool to verify that the file system is always consistent. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Zach Brown <zab@zabbo.net> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:24 -04:00
Joe Perches	7f61f5a022	dm table: use bool function return values of true/false not 1/0 Use the normal return values for bool functions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:23 -04:00
Sami Tolvanen	65ff5b7ddf	dm verity: add error handling modes for corrupted blocks Add device specific modes to dm-verity to specify how corrupted blocks should be handled. The following modes are defined: - DM_VERITY_MODE_EIO is the default behavior, where reading a corrupted block results in -EIO. - DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does not block the read. - DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted block is discovered. In addition, each mode sends a uevent to notify userspace of corruption and to allow further recovery actions. The driver defaults to previous behavior (DM_VERITY_MODE_EIO) and other modes can be enabled with an additional parameter to the verity table. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:22 -04:00
Nicholas Mc Guire	aca607ba24	dm delay: use msecs_to_jiffies for time conversion Converting milliseconds to jiffies by "val * HZ / 1000" is technically OK but msecs_to_jiffies(val) is the cleaner solution and handles all corner cases correctly. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:21 -04:00
Nicholas Mc Guire	18cc980ac8	dm log userspace base: fix compile warning This fixes up a compile warning [-Wunused-but-set-variable] - given the comment in userspace_set_region_sync() the non-reporting of errors is intentional so the return value can be dropped to make gcc happy. Also, fix typo in comment. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:20 -04:00
Nicholas Mc Guire	c32a512fdf	dm log userspace transfer: match wait_for_completion_timeout return type Return type of wait_for_completion_timeout() is unsigned long not int. An appropriately named unsigned long is added and the assignment fixed. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:20 -04:00
Dan Ehrenberg	644bda6f34	dm table: fall back to getting device using name_to_dev_t() If a device is used as the root filesystem, it can't be built off of devices which are within the root filesystem (just like command line arguments to root=). For this reason, Linux has a pseudo-filesystem for root= and MD initialization (based on the function name_to_dev_t) which handles different ways of specifying devices including PARTUUID and major:minor. Switch to using name_to_dev_t() in dm_get_device(). Rather than having DM assume that all things which are not major:minor are paths in an already-mounted filesystem, change dm_get_device() to first attempt to look up the device in the filesystem, and if not found it will fall back to using name_to_dev_t(). In terms of backwards compatibility, there are some cases where behavior will be different: - If you have a file in the current working directory named 1:2 and you initialze DM there, then it will try to use that file rather than the disk with that major:minor pair as a backing device. - Similarly for other bdev types which name_to_dev_t() knows how to interpret, the previous behavior was to repeatedly check for the existence of the file (e.g., while waiting for rootfs to come up) but the new behavior is to use the name_to_dev_t() interpretation. For example, if you have a file named /dev/ubiblock0_0 which is a symlink to /dev/sda3, but it is not yet present when DM starts to initialize, then the name_to_dev_t() interpretation will take precedence. These incompatibilities would only show up in really strange setups with bad practices so we shouldn't have to worry about them. Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:19 -04:00
Mike Snitzer	17e149b8f7	dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr Request-based DM's blk-mq support defaults to off; but a user can easily change the default using the dm_mod.use_blk_mq module/boot option. Also, you can check what mode a given request-based DM device is using with: cat /sys/block/dm-X/dm/use_blk_mq This change enabled further cleanup and reduced work (e.g. the md->io_pool and md->rq_pool isn't created if using blk-mq). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:17 -04:00
Mike Snitzer	022333427a	dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq dm_mq_queue_rq() is in atomic context so care must be taken to not sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations and dm-mpath's call to blk_get_request(). In the future the bioset allocations will hopefully go away (by removing support for partial completions of bios in a cloned request). Also prepare for supporting DM blk-mq ontop of old-style request_fn device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread will still be used to queue work if blk-mq is used ontop of old-style request_fn device(s). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:17 -04:00
Mike Snitzer	bfebd1cdb4	dm: add full blk-mq support to request-based DM Commit `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") served as the first step toward fully utilizing blk-mq in request-based DM -- it enabled stacking an old-style (request_fn) request_queue ontop of the underlying blk-mq device(s). That first step didn't improve performance of DM multipath ontop of fast blk-mq devices (e.g. NVMe) because the top-level old-style request_queue was severely limited by the queue_lock. The second step offered here enables stacking a blk-mq request_queue ontop of the underlying blk-mq device(s). This unlocks significant performance gains on fast blk-mq devices, Keith Busch tested on his NVMe testbed and offered this really positive news: "Just providing a performance update. All my fio tests are getting roughly equal performance whether accessed through the raw block device or the multipath device mapper (~470k IOPS). I could only push ~20% of the raw iops through dm before this conversion, so this latest tree is looking really solid from a performance standpoint." Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Keith Busch <keith.busch@intel.com>	2015-04-15 12:10:16 -04:00
Mike Snitzer	0ce65797a7	dm: impose configurable deadline for dm_request_fn's merge heuristic Otherwise, for sequential workloads, the dm_request_fn can allow excessive request merging at the expense of increased service time. Add a per-device sysfs attribute to allow the user to control how long a request, that is a reasonable merge candidate, can be queued on the request queue. The resolution of this request dispatch deadline is in microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline: echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline The dm_request_fn's merge heuristic and associated extra accounting is disabled by default (rq_based_seq_io_merge_deadline is 0). This sysfs attribute is not applicable to bio-based DM devices so it will only ever report 0 for them. By allowing a request to remain on the queue it will block others requests on the queue. But introducing a short dequeue delay has proven very effective at enabling certain sequential IO workloads on really fast, yet IOPS constrained, devices to build up slightly larger IOs -- yielding 90+% throughput improvements. Having precise control over the time taken to wait for larger requests to build affords control beyond that of waiting for certain IO sizes to accumulate (which would require a deadline anyway). This knob will only ever make sense with sequential IO workloads and the particular value used is storage configuration specific. Given the expected niche use-case for when this knob is useful it has been deemed acceptable to expose this relatively crude method for crafting optimal IO on specific storage -- especially given the solution is simple yet effective. In the context of DM multipath, it is advisable to tune this sysfs attribute to a value that offers the best performance for the common case (e.g. if 4 paths are expected active, tune for that; if paths fail then performance may be slightly reduced). Alternatives were explored to have request-based DM autotune this value (e.g. if/when paths fail) but they were quickly deemed too fragile and complex to warrant further design and development time. If this problem proves more common as faster storage emerges we'll have to look at elevating a generic solution into the block core. Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:15 -04:00
Mike Snitzer	b898320d68	dm sysfs: introduce ability to add writable attributes Add DM_ATTR_RW() macro and establish .store method in dm_sysfs_ops. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:15 -04:00
Mike Snitzer	de3ec86dff	dm: don't start current request if it would've merged with the previous Request-based DM's dm_request_fn() is so fast to pull requests off the queue that steps need to be taken to promote merging by avoiding request processing if it makes sense. If the current request would've merged with previous request let the current request stay on the queue longer. Suggested-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:14 -04:00
Mike Snitzer	d548b34b06	dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms Commit `7eaceaccab` ("block: remove per-queue plugging") didn't justify DM's use of a 100ms delay; such an extended delay is a liability when there is reason to re-kick the queue. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:13 -04:00
Mike Snitzer	9d1deb83d4	dm: don't schedule delayed run of the queue if nothing to do In request-based DM's dm_request_fn(), if blk_peek_request() returns NULL just return. Avoids unnecessary blk_delay_queue(). Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:13 -04:00
Mike Snitzer	9a0e609e3f	dm: only run the queue on completion if congested or no requests pending On really fast storage it can be beneficial to delay running the request_queue to allow the elevator more opportunity to merge requests. Otherwise, it has been observed that requests are being sent to q->request_fn much quicker than is ideal on IOPS-bound backends. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:12 -04:00
Mike Snitzer	ff36ab3458	dm: remove request-based logic from make_request_fn wrapper The old dm_request() method used for q->make_request_fn had a branch for request-based DM support but it isn't needed given that dm_init_request_based_queue() sets it to the standard blk_queue_bio() anyway. Cleanup dm_init_md_queue() to be DM device-type agnostic and have dm_setup_md_queue() properly finish queue setup based on DM device-type (bio-based vs request-based). A followup block patch can be made to remove the export for blk_queue_bio() now that DM no longer calls it directly. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:08:48 -04:00
NeilBrown	47d68979cc	md/raid0: fix bug with chunksize not a power of 2. Since commit `20d0189b10` in v3.14-rc1 RAID0 has performed incorrect calculations when the chunksize is not a power of 2. This happens because "sector_div()" modifies its first argument, but this wasn't taken into account in the patch. So restore that first arg before re-using the variable. Reported-by: Joe Landman <joe.landman@gmail.com> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: `20d0189b10` Cc: stable@vger.kernel.org (3.14 and later). Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-10 15:36:31 +10:00
Gu Zheng	74672d069b	md: fix md io stats accounting broken Simon reported the md io stats accounting issue: " I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5. It's not abnormal other than it's 3-disk, with one being SSD (sdc) and the other two being write-mostly: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00 " The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the generic_start_io_acct to account the disk stats rather than the open code, but it also introduced the increase to .in_flight[rw] which is needless to md. So we re-use the open code here to fix it. Reported-by: Simon Kirby <sim@hostway.ca> Cc: <stable@vger.kernel.org> 3.19 Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-08 12:53:00 +10:00
Mike Snitzer	d56b9b28a4	dm: remove request-based DM queue's lld_busy_fn hook DM multipath is the only caller of blk_lld_busy() -- which calls a queue's lld_busy_fn hook. Request-based DM doesn't support stacking multipath devices so there is no reason to register the lld_busy_fn hook on a multipath device's queue using blk_queue_lld_busy(). As such, remove functions dm_lld_busy and dm_table_any_busy_target. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:49 -04:00
Mike Snitzer	52b09914af	dm: remove unnecessary wrapper around blk_lld_busy There is no need for DM to export a wrapper around the already exported blk_lld_busy(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:49 -04:00
Mike Snitzer	09c2d53101	dm: rename __dm_get_reserved_ios() helper to __dm_get_module_param() __dm_get_module_param() could be useful for future DM module parameters besides those related to "reserved_ios". Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:49 -04:00
Joe Thornber	e65ff8703f	dm cache policy mq: try not to writeback data that changed in the last second Writeback takes out a lock on the cache block, so will increase the latency for any concurrent io. This patch works by placing 2 sentinel objects on each level of the multiqueues. Every WRITEBACK_PERIOD the oldest sentinel gets moved to the newest end of the queue level. When looking for writeback work: if less than 25% of the cache is clean: we select the oldest object with the lowest hit count otherwise: we select the oldest object that is not past a writeback sentinel. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:48 -04:00
Joe Thornber	fdecee3224	dm cache policy mq: remove unused generation member of struct entry Remove to stop wasting memory. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:48 -04:00
Joe Thornber	3e45c91e5c	dm cache policy mq: track entries hit this 'tick' via sentinel objects A sentinel object is placed on each level of the multiqueues. When an object is hit it is requeued behind the sentinel. When the tick is incremented we iterate through all objects behind the sentinel and update the hit_count, then reposition the sentinel at the very back. This saves memory by avoiding tracking the tick explicitly for every struct entry object in the multiqueues. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:48 -04:00
Joe Thornber	c74ffc5c63	dm cache policy mq: remove queue_shift_down() queue_shift_down() didn't adjust the hit_counts to the new levels, so it just had the effect of scrambling levels. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:48 -04:00
Joe Thornber	75da39bf25	dm cache policy mq: keep track of the number of entries in a multiqueue Small optimisation, now queue_empty() doesn't need to walk all levels of the multiqueue. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:48 -04:00
Mike Snitzer	ac1f9ef211	dm log userspace: split flush_entry_pool to be per dirty-log Use a single slab cache to allocate a mempool for each dirty-log. This _should_ eliminate DM's need for io_schedule_timeout() in mempool_alloc(); so io_schedule() should be sufficient now. Also, rename struct flush_entry to dm_dirty_log_flush_entry to allow KMEM_CACHE() to create a meaningful global name for the slab cache. Also, eliminate some holes in struct log_c by rearranging members. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com>	2015-03-31 12:03:47 -04:00
Linus Torvalds	be8a9bc633	. Fix DM core device cleanup regression -- due to a latent race that was exposed by the bdi changes that were introduced during the 4.0 merge. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVExTfAAoJEMUj8QotnQNaMMAIAOT0IjDILfxuP/m6DLl+Qq89 SWrKWJrTDc4j8yjaehELH5Vm5Y8W2qlxQAmwritOEm+53HdODSv9I8WpG/6eCEkt sx+zwQ0wRF0TK6wrqtXe44WU2Zgi9KTj31Y5nTuOAcEqR7S4JHYmbvuzFv/kSO7T GusA+A1HAp0klGgqxT/mzjz61J/frOCDIXWGFwGPapD+6ZLcHfnCz+Ss4CWhdzvo xCMXkeoCgvT5dGj5HQQ3RbucaeUIIeYsja2IT2h572GRiUkau+7qtg/YyBvdMnw+ mEcdMcIg+Higc9z3ZSFu/U/QxGwRF/AycXp4zziuiW8KHfRa2BA7+iyQ0odUHlA= =Ikkj -----END PGP SIGNATURE----- Merge tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mike Snitzer: "Fix DM core device cleanup regression -- due to a latent race that was exposed by the bdi changes that were introduced during the 4.0 merge" * tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix add_disk() NULL pointer due to race with free_dev()	2015-03-26 14:53:47 -07:00
Goldwyn Rodrigues	124eb761ed	md: Fix bitmap offset calculations The calculations of bitmap offset is incorrect with respect to bits to bytes conversion. Also, remove an irrelevant duplicate message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-25 13:07:55 +11:00
Mike Snitzer	63a4f065ec	dm: fix add_disk() NULL pointer due to race with free_dev() Commit `c4db59d31e` ("fs: don't reassign dirty inodes to default_backing_dev_info") exposed DM to a latent race in free_dev() vs add_disk() in relation to management of the device's minor number. Fix this by refactoring free_dev() to match cleanup order of the alloc_dev() error path. Move cleanup of the gendisk, queue, and bdev to _before_ the cleanup of the idr managed minor number. Also, purely due to cleanup that fell out during the free_dev() audit: - adjust dm_blk_close() to access the gendisk's private_data under the _minor_lock spinlock. - move __dm_destroy()'s dm_get_live_table() call out from under the _minor_lock spinlock. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1202449 Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-23 18:14:00 -04:00
Linus Torvalds	1b717b1af5	One fix for md in 4.0-rc4 Regression in recent patch causes crash on error path. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVQyiMDnsnt1WYoG5AQKV4A//SwSZQmnbSTDZ1H5DyERXBkNNKkyFkZCL 9jwl60HaVA5U0N/fP97ku/757QtODYjjHLzlkYqgiiqMBGPFFIKa78D6Su0Z8uJK FtMJQaalhsWOCXHUz2R6DPHR1e9x2mlrJgqvnIbnbFXym1PZJ/fo3ccsLX3qqmNJ F4oFYpy79+yMCUSTn8Sv9czzSE19k5XC/Hz14drWi5ox9zpZmpFbMbQdNaKFa2Uy kphgEdvFNF9C2X+rJeeY91ofz/keSsZVL8HqxLThjxKXsZ+krpywsXa2nX7W00ya zt49VK78PL3v2eIAgf1XaBYO3oyjyzEaVuAkFOktv/LuLKAbFbup2Yi5JtHzieor NoaV9hfgCq1OqxkY8zI8hLcL6HS2vvk8vUXbb2Xqa6X5w6xuyjRMzzAyxqXuG4qm 16jAU2RvITfdwAeNacyzq/xAWBg+jUgeg+lO/Z71O0qS7chGUNzve+hwQOtoeHzP fmEetiZa+gbOQEBZT3Ubsw4L9CqCaYVx7KOiE8Py5nbUVPMz2nF6jjt6WSPOgyWv JrflLIoEE5l6QffnEzcJb3OR920TKZx17/87KvBwct9XH3R0+UgJLFggHrYQAiUA 54CnXVPl0MPq5vtROviX2ZWMThOk1s5HAsMhCBuTdqghtQQdNP7DWmGfg7ys5uUq 152cPd+xGw4= =PM5i -----END PGP SIGNATURE----- Merge tag 'md/4.0-rc4-fix' of git://neil.brown.name/md Pull bugfix for md from Neil Brown: "One fix for md in 4.0-rc4 Regression in recent patch causes crash on error path" * tag 'md/4.0-rc4-fix' of git://neil.brown.name/md: md: fix problems with freeing private data after ->run failure.	2015-03-22 16:38:19 -07:00
Linus Torvalds	da6b9a2049	A handful of stable fixes for DM: - fix thin target to always zero-fill reads to unprovisioned blocks - fix to interlock device destruction's suspend from internal suspends - fix 2 snapshot exception store handover bugs - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVDIJ8AAoJEMUj8QotnQNaynEIAK4Q3mvOeI6PIwuC2iOWRglJ C0xzTKMR6VDQeM/uMRLeiiqxdPs6b89OdSmJHKJYOrdsusjG0a4IBAO9vDayb6sW AxOlbnpXdoiH3cqQ4dISJIfS9qM49qGM40CAflcLKYsGfLnstVLt+4l9HaWaRrHj b2kAC+I6PDDybFlKD5KsMB56sjzhhqg/lnnwkY2omV4vLHf6RrhVhK5gcEPF7VN0 SJ5HKSNBHQnpYSEECHXkxGvZ/ZKTe9n7q0t6Q3nnTSUFTXS6esgTsK1q2AykADDR 3W1LkrAAFP31AWKPkUwCEe3C8Rk3rCsrq2H14woWX1/UxSXNPlMYCU50ak4TVmc= =Svyk -----END PGP SIGNATURE----- Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull devicemapper fixes from Mike Snitzer: "A handful of stable fixes for DM: - fix thin target to always zero-fill reads to unprovisioned blocks - fix to interlock device destruction's suspend from internal suspends - fix 2 snapshot exception store handover bugs - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing" * tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME dm snapshot: suspend merging snapshot when doing exception handover dm snapshot: suspend origin when doing exception handover dm: hold suspend_lock while suspending device during device deletion dm thin: fix to consistently zero-fill reads to unprovisioned blocks	2015-03-21 11:15:13 -07:00
kbuild test robot	09dd1af2e0	md/cluster: Communication Framework: fix semicolon.cocci warnings drivers/md/md-cluster.c:328:2-3: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-21 10:33:00 +11:00
kbuild test robot	6dc69c9c46	md: recover_bitmaps() can be static drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static? Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-21 10:33:00 +11:00
Goldwyn Rodrigues	fa8259da0e	md: Fix stray --cluster-confirm crash A --cluster-confirm without an --add (by another node) can crash the kernel. Fix it by guarding it using a state. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-21 10:33:00 +11:00
NeilBrown	0c35bd4723	md: fix problems with freeing private data after ->run failure. If ->run() fails, it can either free the data structures it allocated, or leave that task to ->free() which will be called on failures. However: md.c calls ->free() even if ->private_data is NULL, which causes problems in some personalities. raid0.c frees the data, but doesn't clear ->private_data, which will become a problem when we fix md.c So better fix both these issues at once. Reported-by: Richard W.M. Jones <rjones@redhat.com> Fixes: `5aa61f427e` URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381 Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-21 09:40:36 +11:00
Stephen Rothwell	3b0e6aacbf	md/bitmap: use sector_div for sector_t divisions neilb: modified to not corrupt ->resync_max_sectors. sector_div usage fixed by Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-04 13:08:16 +11:00
NeilBrown	935f3d4fc6	md/bitmap: fix incorrect DIV_ROUND_UP usage. DIV_ROUTND_UP doesn't work on "long long", - and it should be sector_t anyway. Signed-off-by: NeilBrown <neilb@suse.de>	2015-03-04 12:54:29 +11:00
Darrick J. Wong	e5db29806b	dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME Since it's possible for the discard and write same queue limits to change while the upper level command is being sliced and diced, fix up both of them (a) to reject IO if the special command is unsupported at the start of the function and (b) read the limits once and let the commands error out on their own if the status happens to change. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:53:32 -05:00
Mikulas Patocka	09ee96b214	dm snapshot: suspend merging snapshot when doing exception handover The "dm snapshot: suspend origin when doing exception handover" commit fixed a exception store handover bug associated with pending exceptions to the "snapshot-origin" target. However, a similar problem exists in snapshot merging. When snapshot merging is in progress, we use the target "snapshot-merge" instead of "snapshot-origin". Consequently, during exception store handover, we must find the snapshot-merge target and suspend its associated mapped_device. To avoid lockdep warnings, the target must be suspended and resumed without holding _origins_lock. Introduce a dm_hold() function that grabs a reference on a mapped_device, but unlike dm_get(), it doesn't crash if the device has the DMF_FREEING flag set, it returns an error in this case. In snapshot_resume() we grab the reference to the origin device using dm_hold() while holding _origins_lock (_origins_lock guarantees that the device won't disappear). Then we release _origins_lock, suspend the device and grab _origins_lock again. NOTE to stable@ people: When backporting to kernels 3.18 and older, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:53:16 -05:00
Mikulas Patocka	b735fede8d	dm snapshot: suspend origin when doing exception handover In the function snapshot_resume we perform exception store handover. If there is another active snapshot target, the exception store is moved from this target to the target that is being resumed. The problem is that if there is some pending exception, it will point to an incorrect exception store after that handover, causing a crash due to dm-snap-persistent.c:get_exception()'s BUG_ON. This bug can be triggered by repeatedly changing snapshot permissions with "lvchange -p r" and "lvchange -p rw" while there are writes on the associated origin device. To fix this bug, we must suspend the origin device when doing the exception store handover to make sure that there are no pending exceptions: - introduce _origin_hash that keeps track of dm_origin structures. - introduce functions __lookup_dm_origin, __insert_dm_origin and __remove_dm_origin that manipulate the origin hash. - modify snapshot_resume so that it calls dm_internal_suspend_fast() and dm_internal_resume_fast() on the origin device. NOTE to stable@ people: When backporting to kernels 3.12-3.18, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. When backporting to kernels older than 3.12, you need to pick functions dm_internal_suspend and dm_internal_resume from the commit `fd2ed4d252`. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:49:47 -05:00
Mikulas Patocka	ab7c7bb6f4	dm: hold suspend_lock while suspending device during device deletion __dm_destroy() must take the suspend_lock so that its presuspend and postsuspend calls do not race with an internal suspend. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:09:23 -05:00
Joe Thornber	5f027a3bf1	dm thin: fix to consistently zero-fill reads to unprovisioned blocks It was always intended that a read to an unprovisioned block will return zeroes regardless of whether the pool is in read-only or read-write mode. thin_bio_map() was inconsistent with its handling of such reads when the pool is in read-only mode, it now properly zero-fills the bios it returns in response to unprovisioned block reads. Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA and just allow the IO to be deferred to the worker which will result in pool->process_bio() handling the IO (which already properly zero-fills reads to unprovisioned blocks). Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 09:59:12 -05:00
NeilBrown	ba599aca52	md: fix error paths from bitmap_create. Recent change to bitmap_create mishandles errors. In particular a failure doesn't alway cause 'err' to be set. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-25 11:44:11 +11:00
NeilBrown	750f199ee8	md: mark some attributes as pre-alloc Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18 it can now be used by md. This ensure that writing to these sysfs attributes will never block due to a memory allocation. Such blocking could become a deadlock if mdmon is trying to reconfigure an array after a failure prior to re-enabling writes. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-25 11:38:46 +11:00
Eric Mei	16d9cfab93	raid5: check faulty flag for array status during recovery. When we have more than 1 drive failure, it's possible we start rebuild one drive while leaving another faulty drive in array. To determine whether array will be optimal after building, current code only check whether a drive is missing, which could potentially lead to data corruption. This patch is to add checking Faulty flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-25 11:38:26 +11:00
Tomáš Hodek	d1901ef099	md/raid1: fix read balance when a drive is write-mostly. When a drive is marked write-mostly it should only be the target of reads if there is no other option. This behaviour was broken by commit `9dedf60313` md/raid1: read balance chooses idlest disk for SSD which causes a write-mostly device to be preferred is some cases. Restore correct behaviour by checking and setting best_dist_disk and best_pending_disk rather than best_disk. We only need to test one of these as they are both changed from -1 or >=0 at the same time. As we leave min_pending and best_dist unchanged, any non-write-mostly device will appear better than the write-mostly device. Reported-by: Tomáš Hodek <tomas.hodek@volny.cz> Reported-by: Dark Penguin <darkpenguin@yandex.ru> Signed-off-by: NeilBrown <neilb@suse.de> Link: http://marc.info/?l=linux-raid&m=135982797322422 Fixes: `9dedf60313` Cc: stable@vger.kernel.org (3.6+)	2015-02-25 11:37:02 +11:00
Goldwyn Rodrigues	1aee41f637	Add new disk to clustered array Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues	7d49ffcfa3	Read from the first device when an area is resyncing set choose_first true for cluster read in read balance when the area is resyncing. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues	589a1c4916	Suspend writes in RAID1 if within range If there is a resync going on, all nodes must suspend writes to the range. This is recorded in the suspend_info/suspend_list. If there is an I/O within the ranges of any of the suspend_info, should_suspend will return 1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues	e59721ccdc	Resync start/Finish actions When a RESYNC_START message arrives, the node removes the entry with the current slot number and adds the range to the suspend_list. Simlarly, when a RESYNC_FINISHED message is received, node clears entry with respect to the bitmap number. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues	965400eb61	Send RESYNCING while performing resync start/stop When a resync is initiated, RESYNCING message is sent to all active nodes with the range (lo,hi). When the resync is over, a RESYNCING message is sent with (0,0). A high sector value of zero indicates that the resync is over. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	1d7e3e9611	Reload superblock if METADATA_UPDATED is received Re-reads the devices by invalidating the cache. Since we don't write to faulty devices, this is detected using events recorded in the devices. If it is old as compared to the mddev mark it is faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	293467aa1f	metadata_update sends message to other nodes - request to send a message - make changes to superblock - send messages telling everyone that the superblock has changed - other nodes all read the superblock - other nodes all ack the messages - updating node release the "I'm sending a message" resource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	601b515c5d	Communication Framework: Sending functions The sending part is split in two functions to make sure atomicity of the operations, such as the MD superblock update. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	4664680c38	Communication Framework: Receiving 1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has processed the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	4b26a08af9	Perform resync for cluster node failure If bitmap_copy_slot returns hi>0, we need to perform resync. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	e94987db2e	Initiate recovery on node failure The DLM informs us in case of node failure with the DLM slot number. cluster_info->recovery_map sets the bit corresponding to the slot number and wakes up the recovery thread. The recovery thread: 1. Derives the slot number from the recovery_map 2. Locks the bitmap corresponding to the slot 3. Copies the set bits to the node-local bitmap Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:05 -06:00
Goldwyn Rodrigues	11dd35daaa	Copy set bits from another slot bitmap_copy_from_slot reads the bitmap from the slot mentioned. It then copies the set bits to the node local bitmap. This is helper function for the resync operation on node failure. bitmap_set_memory_bits() currently assumes it is only run at startup and that they bitmap is currently empty. So if it finds that a region is already marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits() to always set the NEEDED_MASK bit if 'needed' is set. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:05 -06:00
Goldwyn Rodrigues	f9209a3235	bitmap_create returns bitmap pointer This is done to have multiple bitmaps open at the same time. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:57:57 -06:00
Goldwyn Rodrigues	96ae923ab6	Gather on-going resync information of other nodes When a node joins, it does not know of other nodes performing resync. So, each node keeps the resync information in it's LVB. When a new node joins, it reads the LVB of each "online" bitmap. [TODO] The new node attempts to get the PW lock on other bitmap, if it is successful, it reads the bitmap and performs the resync (if required) on it's behalf. If the node does not get the PW, it requests CR and reads the LVB for the resync information. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues	54519c5f4b	Lock bitmap while joining the cluster Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues	b97e92574c	Use separate bitmaps for each nodes in the cluster On-disk format: 0 4k 8k 12k ------------------------------------------------------------------- \| idle \| md super \| bm super [0] + bits \| \| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \| \| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \| \| bm bits [3, contd] \| \| \| Bitmap super has a field nodes, which defines the maximum number of nodes the device can use. While reading the bitmap super, if the cluster finds out that the number of nodes is > 0: 1. Requests the md-cluster module. 2. Calls md_cluster_ops->join(), which sets up clustering such as joining DLM lockspace. Since the first time, the first bitmap is read. After the call to the cluster_setup, the bitmap offset is adjusted and the superblock is re-read. This also ensures the bitmap is read the bitmap lock (when bitmap lock is introduced in later patches) Questions: 1. cluster name is repeated in all bitmap supers. Is that okay? Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues	cf921cc19c	Add node recovery callbacks DLM offers callbacks when a node fails and the lock remastery is performed: 1. recover_prep: called when DLM discovers a node is down 2. recover_slot: called when DLM identifies the node and recovery can start 3. recover_done: called when all nodes have completed recover_slot recover_slot() and recover_done() are also called when the node joins initially in order to inform the node with its slot number. These slot numbers start from one, so we deduct one to make it start with zero which the cluster-md code uses. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues	ca8895d9bb	Return MD_SB_CLUSTERED if mddev is clustered Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:43 -06:00
Goldwyn Rodrigues	c4ce867fda	Introduce md_cluster_info md_cluster_info stores the cluster information in the MD device. The join() is called when mddev detects it is a clustered device. The main responsibilities are: 1. Setup a DLM lockspace 2. Setup all initial locks such as super block locks and bitmap lock (will come later) The leave() clears up the lockspace and all the locks held. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues	edb39c9ded	Introduce md_cluster_operations to handle cluster functions This allows dynamic registering of cluster hooks. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues	47741b7ca7	DLM lock and unlock functions A dlm_lock_resource is a structure which contains all information required for locking using DLM. The init function allocates the lock and acquires the lock in NL mode. The unlock function converts the lock resource to NL mode. This is done to preserve LVB and for faster processing of locks. The lock resource is DLM unlocked only in the lockres_free function, which is the end of life of the lock resource. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues	8e854e9cfd	Create a separate module for clustering support Tagged as EXPERIMENTAL for now. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues	183bdf5106	Add number of nodes to bitmap structure for clustering Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:30 -06:00
Linus Torvalds	a911dcdba1	- Significant dm-crypt CPU scalability performance improvements thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here: https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html - A few additional stable fixes (to DM core, dm-snapshot and dm-mirror) and a small fix to the dm-space-map-disk. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJU51MsAAoJEMUj8QotnQNan8wH/3Oa2/ofbX8zNYtdDxFBK7W0 pHwVA2CH9yzDDi90X9GsQODmXerMrYf+SHo/RsQTpSnt5WI/4ZVP/1EoNGu6T2Yr XiaKc/Jwo+ScxdhoHSNtyGL5vKeipX6clgz1wcd/4/UBcBAr0Vxj25s8Ta7naK2V hZyWnt3MCGEDQ45NVBmLOU78Cl68LGf8JOUY0l3cd8ehQjGyR9a12GXwRktepslp hw9xu8uYmE93fD+MIe/fUs7W/WvIVgFSskhswlvD3kAFpEAsMczjLGVSRQlBE90L OK7v1HKxiIUJa17g3LmiCFa59ZjrnSHGpOOv9IPAFU/LzFIU47lcFwRFjgfB8po= =8vFi -----END PGP SIGNATURE----- Merge tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull more device mapper changes from Mike Snitzer: - Significant dm-crypt CPU scalability performance improvements thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here: https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html - A few additional stable fixes (to DM core, dm-snapshot and dm-mirror) and a small fix to the dm-space-map-disk. * tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm snapshot: fix a possible invalid memory access on unload dm: fix a race condition in dm_get_md dm crypt: sort writes dm crypt: add 'submit_from_crypt_cpus' option dm crypt: offload writes to thread dm crypt: remove unused io_pool and _crypt_io_pool dm crypt: avoid deadlock in mempools dm crypt: don't allocate pages for a partial request dm crypt: use unbound workqueue for request processing dm io: reject unsupported DISCARD requests with EOPNOTSUPP dm mirror: do not degrade the mirror on discard error dm space map disk: fix sm_disk_count_is_more_than_one()	2015-02-21 13:28:45 -08:00
Linus Torvalds	b11a278397	Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig updates from Michal Marek: "Yann E Morin was supposed to take over kconfig maintainership, but this hasn't happened. So I'm sending a few kconfig patches that I collected: - Fix for missing va_end in kconfig - merge_config.sh displays used if given too few arguments - s/boolean/bool/ in Kconfig files for consistency, with the plan to only support bool in the future" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kconfig: use va_end to match corresponding va_start merge_config.sh: Display usage if given too few arguments kconfig: use bool instead of boolean for type definition attributes	2015-02-19 10:36:45 -08:00
Mikulas Patocka	22aa66a3ee	dm snapshot: fix a possible invalid memory access on unload When the snapshot target is unloaded, snapshot_dtr() waits until pending_exceptions_count drops to zero. Then, it destroys the snapshot. Therefore, the function that decrements pending_exceptions_count should not touch the snapshot structure after the decrement. pending_complete() calls free_pending_exception(), which decrements pending_exceptions_count, and then it performs up_write(&s->lock) and it calls retry_origin_bios() which dereferences s->origin. These two memory accesses to the fields of the snapshot may touch the dm_snapshot struture after it is freed. This patch moves the call to free_pending_exception() to the end of pending_complete(), so that the snapshot will not be destroyed while pending_complete() is in progress. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-18 09:41:54 -05:00
Mikulas Patocka	2bec1f4a88	dm: fix a race condition in dm_get_md The function dm_get_md finds a device mapper device with a given dev_t, increases the reference count and returns the pointer. dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the device, tests that the device doesn't have DMF_DELETING or DMF_FREEING flag, drops _minor_lock and returns pointer to the device. dm_get_md then calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag, otherwise it increments the reference count. There is a possible race condition - after dm_find_md exits and before dm_get is called, there are no locks held, so the device may disappear or DMF_FREEING flag may be set, which results in BUG. To fix this bug, we need to call dm_get while we hold _minor_lock. This patch renames dm_find_md to dm_get_md and changes it so that it calls dm_get while holding the lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-18 09:41:19 -05:00
Linus Torvalds	0d695d6d8b	3 bug md fixes for 3.20 yet-another-livelock in raid5, and a problem with write errors to 4K-block devices. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVOPorznsnt1WYoG5AQKa5A//RmLurotaJvHt8YA8jzlzo2KY/9jxIv8x jF0CQ0Kva056GG27dlkMPBoddrVk7/WaCoz/ICBKgfSTCpiEFekNlExLkeLXrgUb mKljS/HKhP4KQjiy519HiT2v9BRTqZD3l6P2405qVAZGTymwhelAp3y1I5ZIRFOr k2Uo6PNT/XYvzwWAy5iEBrE7zZl3MZJG5RjW2Wq6JKaWMUQaQERu5OhugWEfCI3U yVomILlwiYqC85WLgrVob/OqCSoA3UsIZkZKvKJCP8Y4C9QJzquEeZy4z0swgJq0 FMsu2WqmI3/ZNFvmlSozN05CEUMkZF6E8hGJcT+f1Wa1NE40zrzDkVsVvsGMD8v0 Ek7PTiA3X7AhqRl4Lt2Gs8kfDJ9MIRXzvXJv9UUO7PtVQGUhVdqmcqsodyqA7+lw 63hgSpEUtQTkpeWwcYQY6Impr/6jGxHwQKzFlLltaDvmAeOBd6gjAq2bdfPO5tox FiSoClhor4yTc274+s5ORk9xDh61d7ZbnFJ+QN7YioyaSD2LHYIl9EkVFDe4OMXV q39VzXt+uEYx6hxfmnpjPyPvPiyuDsnXzoTMcaaGsJL4TOazoIEj5AFGmg5i3/yw 6UVjKxJ43SMM7qg3VbgZpenGJwvRkP5mqQwMNDZICXdWulQTMwM5w3AidNqwYAaU oEPAMfZNGvE= =1BWV -----END PGP SIGNATURE----- Merge tag 'md/3.20-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "Three bug md fixes for 3.20 yet-another-livelock in raid5, and a problem with write errors to 4K-block devices" * tag 'md/3.20-fixes' of git://neil.brown.name/md: md/raid5: Fix livelock when array is both resyncing and degraded. md/raid10: round up to bdev_logical_block_size in narrow_write_error. md/raid1: round up to bdev_logical_block_size in narrow_write_error	2015-02-17 17:34:21 -08:00
NeilBrown	26ac107378	md/raid5: Fix livelock when array is both resyncing and degraded. Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f: md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Causes an RCW cycle to be forced even when the array is degraded. A degraded array cannot support RCW as that requires reading all data blocks, and one may be missing. Forcing an RCW when it is not possible causes a live-lock and the code spins, repeatedly deciding to do something that cannot succeed. So change the condition to only force RCW on non-degraded arrays. Reported-by: Manibalan P <pmanibalan@amiindia.co.in> Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com> Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `a7854487cd` Cc: stable@vger.kernel.org (v3.7+)	2015-02-18 11:35:14 +11:00
Mikulas Patocka	b3c5fd3052	dm crypt: sort writes Write requests are sorted in a red-black tree structure and are submitted in the sorted order. In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting. The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally. Even on SSD sorting can be beneficial since in-order request dispatch promotes lower latency IO completion to the upper layers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-16 11:11:15 -05:00
Mikulas Patocka	0f5d8e6ee7	dm crypt: add 'submit_from_crypt_cpus' option Make it possible to disable offloading writes by setting the optional 'submit_from_crypt_cpus' table argument. There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly. The default is to offload write bios to the same thread because it benefits CFQ to have writes submitted using the same IO context. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-16 11:11:15 -05:00
Mikulas Patocka	dc2676210c	dm crypt: offload writes to thread Submitting write bios directly in the encryption thread caused serious performance degradation. On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation. Move the submission of write requests to a separate thread so that the requests can be sorted before submitting. But this commit improves dm-crypt performance even without having dm-crypt perform request sorting (in particular it enables IO schedulers like CFQ to sort more effectively). Note: it is required that a previous commit ("dm crypt: don't allocate pages for a partial request") be applied before applying this patch. Otherwise, this commit could introduce a crash. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-16 11:11:14 -05:00
Mikulas Patocka	94f5e0243c	dm crypt: remove unused io_pool and _crypt_io_pool The previous commit ("dm crypt: don't allocate pages for a partial request") stopped using the io_pool slab mempool and backing _crypt_io_pool kmem cache. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-16 11:11:13 -05:00

1 2 3 4 5 ...

3776 Commits