Set mddev queue's max_write_same_sectors to its chunk_sector value (before
disk_stack_limits merges the underlying disk limits.) With that in place,
be sure to handle writes coming down from the block layer that have the
REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned
write bio.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fix the warning:
drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an fsync occurs on a read-only array, we need to send a
completion for the IO and may not increment the active IO count.
Otherwise, we hit a bug trace and can't stop the MD array anymore.
By advice of Christoph Hellwig we return success upon a flush
request but we return -EROFS for other writes.
We detect flush requests by checking if the bio has zero sectors.
This patch is suitable to any -stable kernel to which it applies.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Paul Menzel <paulepanter@users.sourceforge.net>
Signed-off-by: NeilBrown <neilb@suse.de>
support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRCn+IAAoJEK2W1qbAHj1naQEP/2eXMOslRyws7M6CcsEgpEK9
N2L2hf6bD3xF/04ZSLbHFI6hPe9wDXSL9Vxd+DRLYTSnc0E9WYBXHmE6Eb0L0xK8
m0Iubk/hi7mk6mnMJtpTFT5pazBTPhVz0nXOijguh5U6PW0xL+4ypXe9nrH2jtW0
DvEHFDIPbKcqwplm8nvo/QJ5O3YNQaMifKUtpXF/JWGlCYP4vPk0dJVg9ATbscEV
Fg3kefoJzZM09q3Uvo01wigbj+wRkpBK9+CiyW6XcE0lkOAnFmpvyYerIoAHAK37
Rsw5J4aMPA9U8mggBEtlHBWa0q5utZafHM11lT2ZeFGCXkdn+TSWni6O7ov54xPP
Cd7jx+uNpe/OuLT5YjbCg2IMXgJs+zIZMSeqSj3SrywE0a0EQHECWiXaFmMmrCCJ
TgZtmp/HS1UsdoiHA3v3ZX3AaX4W+mggYp/5md9P1vHyYS9uTlgSVplhwtVgsL23
EsDxNNxODSIFMAMnrXxAV+NBPiQRY42K22hK/RrnWew9roAQHxroIvQDmzzm5ZRL
BqCFW3w/x2loJOZZ6NH/J8IUEoF9RhCK1tOGVjFuAVn30srt3zXb4pOzYeydykT7
m04HaGO7rCBJI75XdVDhm6ozOvV/GhXF2fJOt4qyoX/X6M8YN5i0jfwC3rhKeuOe
U9fyyYoQV37EWIRI9K5Q
=qtaF
-----END PGP SIGNATURE-----
Merge tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull more device-mapper fixes from Alasdair G Kergon:
"A fix for stacked dm thin devices and a fix for the new dm WRITE SAME
support."
* tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm: fix write same requests counting
dm thin: fix queue limits stacking
When processing write same requests, fix dm to send the configured
number of WRITE SAME requests to the target rather than the number of
discards, which is not always the same.
Device-mapper WRITE SAME support was introduced by commit
23508a96cd ("dm: add WRITE SAME support").
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
thin_io_hints() is blindly copying the queue limits from the thin-pool
which can lead to incorrect limits being set. The fix here simply
deletes the thin_io_hints() hook which leaves the existing stacking
infrastructure to set the limits correctly.
When a thin-pool uses an MD device for the data device a thin device
from the thin-pool must respect MD's constraints about disallowing a bio
from spanning multiple chunks. Otherwise we can see problems. If the raid0
chunksize is 1152K and thin-pool chunksize is 256K I see the following
md/raid0 error (with extra debug tracing added to thin_endio) when
mkfs.xfs is executed against the thin device:
md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0
This extra DM debugging shows that the failing bio is spanning across
the first and second logical 1152K chunk (sector 2080 + 255 takes the
bio beyond the first chunk's boundary of sector 2304). So the bio
splitting that DM is doing clearly isn't respecting the MD limits.
max_hw_sectors_kb is 127 for both the thin-pool and thin device
(queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
precision). So this explains why bi_size is 130560.
But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
that it doesn't have a .merge function (for bio_add_page to consult
indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
device that has a compulsory merge_bvec_fn. This scenario is exactly
why DM must resort to sending single PAGE_SIZE bios to the underlying
layer. Some additional context for this is available in the header for
commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").
Long story short, the reason a thin device doesn't properly get
configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
thin_io_hints() is blindly copying the queue limits from the thin-pool
device directly to the thin device's queue limits.
Fix this by eliminating thin_io_hints. Doing so is safe because the
block layer's queue limits stacking already enables the upper level thin
device to inherit the thin-pool device's discard and minimum_io_size and
optimal_io_size limits that get set in pool_io_hints. But avoiding the
queue limits copy allows the thin and thin-pool limits to be different
where it is important, namely max_hw_sectors_kb.
Reported-by: Daniel Browning <db@kavod.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Before attempting to activate a RAID array, it is checked for sufficient
redundancy. That is, we make sure that there are not too many failed
devices - or devices specified for rebuild - to undermine our ability to
activate the array. The current code performs this check twice - once to
ensure there were not too many devices specified for rebuild by the user
('validate_rebuild_devices') and again after possibly experiencing a failure
to read the superblock ('analyse_superblocks'). Neither of these checks are
sufficient. The first check is done properly but with insufficient
information about the possible failure state of the devices to make a good
determination if the array can be activated. The second check is simply
done wrong in the case of RAID10 because it doesn't account for the
independence of the stripes (i.e. mirror sets). The solution is to use the
properly written check ('validate_rebuild_devices'), but perform the check
after the superblocks have been read and we know which devices have failed.
This gives us one check instead of two and performs it in a location where
it can be done right.
Only RAID10 was affected and it was affected in the following ways:
- the code did not properly catch the condition where a user specified
a device for rebuild that already had a failed device in the same mirror
set. (This condition would, however, be caught at a deeper level in MD.)
- the code triggers a false positive and denies activation when devices in
independent mirror sets have failed - counting the failures as though they
were all in the same set.
The most likely place this error was introduced (or this patch should have
been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
Consequently this fix should also go in v3.7.y, however there is a
small conflict on the .version in raid_target, so I'll submit a
separate patch to -stable.
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.
This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.
* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.
* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.
* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.
This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.
v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Of particular note:
- Disable broken WRITE SAME support in all targets except linear and striped.
Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the bio's new
front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQ1M5OAAoJEK2W1qbAHj1nrQcP/itnnAw8RNsSHBrFMrL9wVnB
5dmZ1BXPZmEbG+ViU4wzVmRUPHuSHTwhIqH7UFPjyCgbWaz1jaXfpyIxBsxlJi4E
zuGjv46akANMwH0o/aJDRuEIrCnMtjLrMiY2Oq00lJFvATurwYAKSIgmnwRVdAYy
gDehJhaymNtHVjhymu33xEn/hqqkQtUbMDj9o+IZppmAw1aQyNuYnwQu3HvcETuz
/JBcs8isXKIQMJdMLFdGg7lZjLO241UvSwCAeGycKkupHLaYfycumPywgdiNFVUg
L6pQP9RtAQ+H2VBQ1OIVMJxqiXxQ0xHhyxUYIe3reTar+RXoMA0yK+FiJTwSY1cE
Xk0s8x2DXwUyu3Vx7UmvgUXnMgd4TIPITYBYiOAanEF/8Xt0voZn8mzNyyzsyFXy
0u1vMRK+ZK7+QPio9LRh7bgHNK1g5ZyShvwqTMDmtlp+uskaP4iHDDGtVUjFA+Wf
r9Ms0CXPbXIN6laUIT/4L3LJZtyRWB6e8wuCrUWIWWRbjrMPaPnB+/NlckGJ0CHa
P/5r1rmLdneTEZ8Vx/2g3fBJ+H2uNQKhYujjnE0HqtHP+tvjt7ernibyU2QhNBeE
Zy0PXRatY0Xn7UFpn44uJ2qxkWaO5Dloaa4HkWdlWFdR3f/u5MzVjy5mDXLUxkGq
wj2Z3YkjYjy948MViBhD
=yzhS
-----END PGP SIGNATURE-----
Merge tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.
Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."
* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE
SAME bio processing.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Eliminate struct map_info from dm-snap.
map_info->ptr was used in dm-snap to indicate if the bio was tracked.
If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.
This patch removes the use of map_info->ptr. We determine if the bio was
tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
the bio is not tracked, if it is false, the bio is tracked.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.
This patch removes any use of map_info in preparation for the next patch
that removes map_info from bio-based device mapper.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Don't use map_info any more in dm-raid1.
map_info was used for writes to hold the region number. For this purpose
we add a new field dm_bio_details to dm_raid1_bio_record.
map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
the pointer was non-NULL, bio details were saved; if the pointer was
NULL, bio details were not saved). We use
dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
NULL, details were not saved, if bi_bdev is non-NULL, details were
saved.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename struct read_record to bio_record in dm-raid1.
In the following patch, the structure will be used for both read and
write bios, so rename it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch moves target_request_nr from map_info to dm_target_io and
makes it accessible with dm_bio_get_target_request_nr.
This patch is a preparation for the next patch that removes map_info.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace tracked_chunk_pool with per_bio_data in dm-snap.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace io_mempool with per_bio_data in dm-verity.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace read_record_pool with per_bio_data in dm-raid1.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a field per_bio_data_size in struct dm_target.
Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.
Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.
If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.
WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.
The dm thin target uses this.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The linear target can already support WRITE SAME requests so signal
this by setting num_write_same_requests to 1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
WRITE SAME bios have a payload that contain a single page. When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental). DM need only alter the
start and end of the WRITE SAME bio accordingly.
Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.
A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().
vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.
In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.
This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Improve space map error message when unable to allocate a new
metadata block.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Throttle all errors logged from the IO path by dm thin.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically. If this is not a power of 2,
the md code will report an error later.
This patch catches the problem early and rounds the region_size to the
next power of two.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
that tests if read_record is non-NULL is always true.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device. The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When deleting nested btrees, the code forgets to delete the innermost
btree. The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.
This patch corrects both problems.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When discards are prepared it is best to directly wake the worker that
will process them. The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.
Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding. DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.
The race manifests as follows:
1. A non-discard bio is mapped in thin_bio_map.
This doesn't lock out parallel activity to the same block.
2. A discard bio is issued to the same block as the non-discard bio.
3. The discard bio is locked in a dm_bio_prison_cell in process_discard
to lock out parallel activity against the same block.
4. The non-discard bio's mapping continues and its all_io_entry is
incremented so the bio is accounted for in the thin pool's all_io_ds
which is a dm_deferred_set used to track time locality of non-discard IO.
5. The non-discard bio is finally locked in a dm_bio_prison_cell in
process_bio.
The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:
INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby D ffffffff8160f0e0 0 15354 15314 0x00000000
ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
[<ffffffff814e5a19>] schedule+0x29/0x70
[<ffffffff814e3d85>] schedule_timeout+0x195/0x220
[<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
[<ffffffff814e589e>] wait_for_common+0x11e/0x190
[<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
[<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
[<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
[<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
[<ffffffff8119a65c>] block_ioctl+0x3c/0x40
[<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
[<ffffffff8119a547>] ? block_llseek+0x67/0xb0
[<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
[<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
[<ffffffff814ef099>] system_call_fastpath+0x16/0x1b
The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.
The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks. That cell locking wasn't
occurring early enough in thin_bio_map. This patch fixes this.
Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.
Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so. Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.
This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.
Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.
If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton. Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.
Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.
This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.
As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Abort dm ioctl processing if userspace changes the data_size parameter
after we validated it but before we finished copying the data buffer
from userspace.
The dm ioctl parameters are processed in the following sequence:
1. ctl_ioctl() calls copy_params();
2. copy_params() makes a first copy of the fixed-sized portion of the
userspace parameters into the local variable "tmp";
3. copy_params() then validates tmp.data_size and allocates a new
structure big enough to hold the complete data and copies the whole
userspace buffer there;
4. ctl_ioctl() reads userspace data the second time and copies the whole
buffer into the pointer "param";
5. ctl_ioctl() reads param->data_size without any validation and stores it
in the variable "input_param_size";
6. "input_param_size" is further used as the authoritative size of the
kernel buffer.
The problem is that userspace code could change the contents of user
memory between steps 2 and 4. In particular, the data_size parameter
can be changed to an invalid value after the kernel has validated it.
This lets userspace force the kernel to access invalid kernel memory.
The fix is to ensure that the size has not changed at step 4.
This patch shouldn't have a security impact because CAP_SYS_ADMIN is
required to run this code, but it should be fixed anyway.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
This patch fixes a compilation failure on sparc32 by renaming struct node.
struct node is already defined in include/linux/node.h. On sparc32, it
happens to be included through other dependencies and persistent-data
doesn't compile because of conflicting declarations.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mostly just little fixes. Probably biggest part is
AVX accelerated RAID6 calculations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh
2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy
HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0
jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb
lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO
/eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG
JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP
fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11
b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB
OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211
oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX
pDR/ey2CB6E=
=uK52
-----END PGP SIGNATURE-----
Merge tag 'md-3.8' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."
* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
handle_stripe_expansion contains:
if (tx) {
async_tx_ack(tx);
dma_wait_for_async_tx(tx);
}
which is very similar to the body of async_tx_quiesce(),
except that the later handles an error from dma_wait_for_async_tx()
(admittedly by panicing, but that decision belongs in the dma
code, not the md code).
So just us async_tx_quiesce().
Acked-by: Dan Williams <djbw@fb.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a resync is aborted cleanly, ->curr_resync is a reliable
record of where we got up to.
If there was an error it is less reliable but we always know that
->curr_resync_completed is safe.
So add a flag MD_RECOVERY_ERROR to differentiate between these cases
and set recovery_cp accordingly.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md will current only only checkpoint recovery or resync ever 1/16th
of the device size. As devices get larger this can become a long time
an so a lot of work that might need to be duplicated after a shutdown.
So add a time-based checkpoint. Every 5 minutes limits the amount of
duplicated effort to at most 5 minutes, and has almost zero impact on
performance.
[changelog entry re-written by NeilBrown]
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In resyncing, recovery_cp only updated when resync aborted or completed.
But in md drives,many place used it to judge.So add a place to update.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we remove a device from an md array, the final removal of
the "dev-XX" sys entry is run asynchronously.
If we then re-add that device immediately before the worker thread
gets to run, we can end up trying to add the "dev-XX" sysfs entry back
before it has been removed.
So in both places where we add a device, call
flush_workqueue(md_misc_wq);
before taking the md lock (as holding the md lock can prevent removal
to complete).
Signed-off-by: NeilBrown <neilb@suse.de>
New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.
The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.
All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.
Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the raid1 or raid10 unplug function gets called
from a make_request function (which is very possible) when
there are bios on the current->bio_list list, then it will not
be able to successfully call bitmap_unplug() and it could
need to submit more bios and wait for them to complete.
But they won't complete while current->bio_list is non-empty.
So detect that case and handle the unplugging off to another thread
just like we already do when called from within the scheduler.
RAID1 version of bug was introduced in 3.6, so that part of fix is
suitable for 3.6.y. RAID10 part won't apply.
Cc: stable@vger.kernel.org
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de>
Signed-off-by: NeilBrown <neilb@suse.de>
- raid5 discard has problems
- raid10 replacement devices have problems
- bad block lock seqlock usage has problems
- dm-raid doesn't free everything
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUK/PfTnsnt1WYoG5AQJlFBAAry6TrfIEed7Sz1BwY0w1Ofd5ZFt6DCN3
CXc6yi7LQhaMAUYsMcF07BFfuphal0St68vwckFkd1jPShUgruetzsUPLdS1+cql
AKOQZmJegN+yvpf+N6PxER8z0Ju8M0RNVCvgRZB166ujmoEHGf7A564Hby+FINpZ
zk1d5eVtcRL05oV0NbeLaX8bNp42nNx2wwvFtM6NEVF4vwbzGzXkC9ePQ6oERJvQ
Oqsu6F+TzqztIPYk/fbl1Yr/FPVAWXi4dR7KNxs/jHFcnWPi9vKcjjh1jrq46rNy
xQY+y0xW6FlN0uApIKT6NC3UWutgwOGUqRdCRc4LJ1nT6aHVIn5OCIsipgRrlV0O
da5pM+rgIMJK3kyT6NjhtuWuQZE4P4OSOmnq5q81VT9XOKADVsFOfibtrIr8cxYS
c/8mNJVfd+cU58XNKGIEt886DsN+uzWiY8U8HZVckfeVxrBTIPas4ERXlurx+G1D
jhXqK8TuEfi6ILNdBlWPphAr2ytFqWWpQIGXgYGHEIJp5WaUHoEoEblznl1MiRlZ
+tYIYy0SRkcZuxs6nUNF8Or5vFidjvaIFJPjIJwSIhwgzkaV+YFad4GfI7/WgWaq
7VU12MG7UlXLlaGN1Yadvh3jAk7L45DPzWUa/Zgvvtrvvdp3JU7VQhD8d6oc/kxD
3IOrUdAXWxU=
=fznK
-----END PGP SIGNATURE-----
Merge tag 'md-3.7-fixes' of git://neil.brown.name/md
Pull md fixes from NeilBrown:
"Several bug fixes for md in 3.7:
- raid5 discard has problems
- raid10 replacement devices have problems
- bad block lock seqlock usage has problems
- dm-raid doesn't free everything"
* tag 'md-3.7-fixes' of git://neil.brown.name/md:
md/raid10: decrement correct pending counter when writing to replacement.
md/raid10: close race that lose writes lost when replacement completes.
md/raid5: Make sure we clear R5_Discard when discard is finished.
md/raid5: move resolving of reconstruct_state earlier in stripe_handle.
md/raid5: round discard alignment up to power of 2.
md: make sure everything is freed when dm-raid stops an array.
md: Avoid write invalid address if read_seqretry returned true.
md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
Request based dm attempts to re-run the request queue off the
request completion path. If used with a driver that potentially does
end_io from its request_fn, we could deadlock trying to recurse
back into request dispatch. Fix this by punting the request queue
run to kblockd.
Tested to fix a quickly reproducible deadlock in such a scenario.
Cc: stable@kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a write to a replacement device completes, we carefully
and correctly found the rdev that the write actually went to
and the blithely called rdev_dec_pending on the primary rdev,
even if this write was to the replacement.
This means that any writes to an array while a replacement
was ongoing would cause the nr_pending count for the primary
device to go negative, so it could never be removed.
This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.
Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When a replacement operation completes there is a small window
when the original device is marked 'faulty' and the replacement
still looks like a replacement. The faulty should be removed and
the replacement moved in place very quickly, bit it isn't instant.
So the code write out to the array must handle the possibility that
the only working device for some slot in the replacement - but it
doesn't. If the primary device is faulty it just gives up. This
can lead to corruption.
So make the code more robust: if either the primary or the
replacement is present and working, write to them. Only when
neither are present do we give up.
This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.
Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
commit 9e44476851
MD: raid5 avoid unnecessary zero page for trim
change raid5 to clear R5_Discard when the complete request is
handled rather than when submitting the per-device discard request.
However it did not clear R5_Discard for the parity device.
This means that if the stripe_head was reused before it expired from
the cache, the setting would be wrong and a hang would result.
Also if the R5_Uptodate bit happens to be set, R5_Discard again
won't be cleared. But R5_Uptodate really should be clear at this point.
So make sure R5_Discard is cleared in all cases, and clear
R5_Uptodate when a 'discard' completes.
Signed-off-by: NeilBrown <neilb@suse.de>
stripe_handle.
The chunk of code in stripe_handle which responds to a
*_result value in reconstruct_state is really the completion
of some processing that happened outside of handle_stripe
(possibly asynchronously) and so should be one of the first
things done in handle_stripe().
After the next patch it will be important that it happens before
handle_stripe_clean_event(), as that will clear some dev->flags
bit that this code tests.
Signed-off-by: NeilBrown <neilb@suse.de>
blkdev_issue_discard currently assumes that the granularity
is a power of 2. So in raid5, round the chosen number up to
avoid embarrassment.
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.
This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.
Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug was introduced by commit(v3.0-rc7-126-g2230dfe).
So fix is suitable for 3.0.y thru 3.6.y.
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 2863b9eb didn't take into account the changes to add TRIM support to
RAID10 (commit 532a2a3fb). That is, when using dm-raid.c to create the
RAID10 arrays, there is no mddev->gendisk or mddev->queue. The code added
to support TRIM simply assumes that mddev->queue is available without
checking. The result is an oops any time dm-raid.c attempts to create a
RAID10 device.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
setup_conf in raid1.c uses conf->raid_disks before assigning
a value. It is used when including 'Replacement' devices.
The consequence is that assembling an array which contains a
replacement will misbehave and either not include the replacement, or
not include the device being replaced.
Though this doesn't lead directly to data corruption, it could lead to
reduced data safety.
So use mddev->raid_disks, which is initialised, instead.
Bug was introduced by commit c19d57980b
md/raid1: recognise replacements when assembling arrays.
in 3.3, so fix is suitable for 3.3.y thru 3.6.y.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
in:
fe86cdce block: do not artificially constrain max_sectors for stacking drivers
max_sectors defaults to UINT_MAX. md faulty wasn't using
disk_stack_limits(), so inherited this large value as well.
This triggered a bug in XFS when stressed over md_faulty, when
a very large bio_alloc() failed.
That was on an older kernel, and I can't reproduce exactly the
same thing upstream, but I think the fix is appropriate in any
case.
Thanks to Mike Snitzer for pointing out the problem.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
"discard" support, some dm-raid improvements and other assorted
bits and pieces.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+
+kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX
qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz
aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b
/f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t
/DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W
tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN
47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY
34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef
MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1
DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+
/ULNzkRU1w4=
=XRmL
-----END PGP SIGNATURE-----
Merge tag 'md-3.7' of git://neil.brown.name/md
Pull md updates from NeilBrown:
- "discard" support, some dm-raid improvements and other assorted bits
and pieces.
* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
md: refine reporting of resync/reshape delays.
md/raid5: be careful not to resize_stripes too big.
md: make sure manual changes to recovery checkpoint are saved.
md/raid10: use correct limit variable
md: writing to sync_action should clear the read-auto state.
Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
md/raid5: make sure to_read and to_write never go negative.
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
md/raid5: protect debug message against NULL derefernce.
md/raid5: add some missing locking in handle_failed_stripe.
MD: raid5 avoid unnecessary zero page for trim
MD: raid5 trim support
md/bitmap:Don't use IS_ERR to judge alloc_page().
md/raid1: Don't release reference to device while handling read error.
raid: replace list_for_each_continue_rcu with new interface
add further __init annotations to crypto/xor.c
DM RAID: Fix for "sync" directive ineffectiveness
DM RAID: Fix comparison of index and quantity for "rebuild" parameter
DM RAID: Add rebuild capability for RAID10
DM RAID: Move 'rebuild' checking code to its own function
...
Use the recently-added bio front_pad field to allocate struct dm_target_io.
Prior to this patch, dm_target_io was allocated from a mempool. For each
dm_target_io, there is exactly one bio allocated from a bioset.
This patch merges these two allocations into one allocation: we create a
bioset with front_pad equal to the size of dm_target_io so that every
bio allocated from the bioset has sizeof(struct dm_target_io) bytes
before it. We allocate a bio and use the bytes before the bio as
dm_target_io.
_tio_cache is removed and the tio_pool mempool is now only used for
request-based devices.
This idea was introduced by Kent Overstreet.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: tj@kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to other future DM targets so
move it to a separate module.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to share with future DM targets.
Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.
This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize"). That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable
can be modified asynchronously (through sysfs) and we want to prevent
compiler optimizations that assume that the variable hasn't changed.
(See Documentation/atomic_ops.txt.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use list_move() instead of list_del() + list_add().
spatch with a semantic match was used to find this.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The mpio dereference should be moved below the BUG_ON NULL test
in multipath_end_io().
spatch with a semantic match was used to found this.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.
So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.
Signed-off-by: NeilBrown <neilb@suse.de>
When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large. This particularly happens
when md_check_recovery calls ->check_reshape().
If we use ->previous_raid_disks, we don't risk this.
Signed-off-by: NeilBrown <neilb@suse.de>
If you make an array bigger but suppress resync of the new region with
mdadm --grow /dev/mdX --size=max --assume-clean
then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Clang complains that we are assigning a variable to itself. This should
be using bad_sectors like the similar earlier check does.
Bug has been present since 3.1-rc1. It is minor but could
conceivably cause corruption or other bad behaviour.
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to. The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).
Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.
So allow any successful write to sync_action to clear any read-auto
state.
Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
to_read and to_write are part of the result of analysing
a stripe before handling it.
Their use is to avoid some loops and tests if the values are
known to be zero. Thus it is not a problem if they are a
little bit larger than they should be.
So decrementing them in handle_failed_stripe serves little value, and
due to races it could cause some loops to be skipped incorrectly.
So remove those decrements.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The pr_debug in add_stripe_bio could race with something
changing *bip, so it is best to hold the lock until
after the pr_debug.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We really should hold the stripe_lock while accessing
'toread' else we could race with add_stripe_bio and corrupt
a list.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and will
get inconsistent data.
To avoid zero the page, we don't set R5_UPTODATE flag after construction is
done. In this way, discard write request is still issued and finished, but read
will not hit the page. If the stripe gets accessed soon, we need reread the
stripe, but since the chance is low, the reread isn't a big deal.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we get a read error, we arrange for raid1d to handle it.
Currently we release the reference on the device. This can result
in
conf->mirrors[read_disk].rdev
being NULL in fix_read_error, if the device happens to get removed
before the read error is handled.
So instead keep the reference until the read error has been fully
handled.
Reported-by: hank <pyu@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch replaces list_for_each_continue_rcu() with
list_for_each_entry_continue_rcu() to save a few lines
of code and allow removing list_for_each_continue_rcu().
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There are two table arguments that can be given to a DM RAID target
that control whether the array is forced to (re)synchronize or skip
initialization: "sync" and "nosync". When "sync" is given, we set
mddev->recovery_cp to 0 in order to cause the device to resynchronize.
This is insufficient if there is a bitmap in use, because the array
will simply look at the bitmap and see that there is no recovery
necessary.
The fix is to skip over the loading of the superblocks when "sync" is
given, causing new superblocks to be written that will force the array
to go through initialization (i.e. synchronization).
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
DM RAID: Fix comparison of index and quantity for "rebuild" parameter
The "rebuild" parameter takes an index argument that starts counting from
zero. The conditional used to validate the index was using '>' rather than
'>=', leaving the door open for an index value that would be 1 too large.
Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>