Commit Graph

342 Commits

Author SHA1 Message Date
NeilBrown 055d3747db md/raid10: fix failure when trying to repair a read error.
commit 58c54fcca3
     md/raid10: handle further errors during fix_read_error better.

in 3.1 added "r10_sync_page_io" which takes an IO size in sectors.
But we were passing the IO size in bytes!!!
This resulting in bio_add_page failing, and empty request being sent
down, and a consequent BUG_ON in scsi_lib.

[fix missing space in error message at same time]

This fix is suitable for 3.1.y and later.

Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:55:33 +10:00
NeilBrown fc448a18ae md/raid10: Don't try to recovery unmatched (and unused) chunks.
If a RAID10 has an odd number of chunks - as might happen when there
are an odd number of devices - the last chunk has no pair and so is
not mirrored.  We don't store data there, but when recovering the last
device in an array we retry to recover that last chunk from a
non-existent location.  This results in an error, and the recovery
aborts.

When we get to that last chunk we should just stop - there is nothing
more to do anyway.

This bug has been present since the introduction of RAID10, so the
patch is appropriate for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Tested-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 10:37:30 +10:00
NeilBrown aba336bd1d md: raid1/raid10: fix problem with merge_bvec_fn
The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.

However were were only setting that when a device was hot-added,
not when a device was present from the start.

This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels.  However that are conflicts in raid10.c so a separate
patch will be needed for 3.4.y.

Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-31 15:56:30 +10:00
NeilBrown 63aced6102 md/raid10: Remove extras after reshape to smaller number of devices.
When a reshape which reduced the number of devices finishes
we must remove the extra devices.

So ensure  that raid10_remove_disk won't try to keep them, and
have raid10_finish_reshape clear the 'in_sync' flag.  Then
remove_and_add_spares will be able to remove them.

Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:33 +10:00
NeilBrown bb63a7019d md/raid10: resize bitmap when required during reshape.
If a reshape changes the size of the array, then we can now
update the bitmap to suit - so do so.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:28 +10:00
NeilBrown a4a6125a07 md: allow array to be resized while bitmap is present.
Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.

This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:27 +10:00
majianpeng 5fdd2cf826 md/raid10: Fix memleak in r10buf_pool_alloc
If the allocation of rep1_bio fails, we currently don't free the 'bio'
of the same dev.

Reported by kmemleak.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:03 +10:00
NeilBrown 3ea7daa5d7 md/raid10: add reshape support
A 'near' or 'offset' lay RAID10 array can be reshaped to a different
'near' or 'offset' layout, a different chunk size, and a different
number of devices.
However the number of copies cannot change.

Unlike RAID5/6, we do not support having user-space backup data that
is being relocated during a 'critical section'.  Rather, the
data_offset of each device must change so that when writing any block
to a new location, it will not over-write any data that is still
'live'.

This means that RAID10 reshape is not supportable on v0.90 metadata.

The different between the old data_offset and the new_offset must be
at least the larger of the chunksize multiplied by offset copies of
each of the old and new layout. (for 'near' mode, offset_copies == 1).

A larger difference of around 64M seems useful for in-place reshapes
as more data can be moved between metadata updates.
Very large differences (e.g. 512M) seem to slow the process down due
to lots of long seeks (on oldish consumer graded devices at least).

Metadata needs to be updated whenever the place we are about to write
to is considered - by the current metadata - to still contain data in
the old layout.

[unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:53:47 +10:00
NeilBrown deb200d085 md/raid10: split out interpretation of layout to separate function.
We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape.  So split it out into separate
function.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown f8c9e74ff0 md/raid10: Introduce 'prev' geometry to support reshape.
When RAID10 supports reshape it will need a 'previous' and a 'current'
geometry, so introduce that here.
Use the 'prev' geometry when before the reshape_position, and the
current 'geo' when beyond it.  At other times, use both as
appropriate.

For now, both are identical (And reshape_position is never set).

When we use the 'prev' geometry, we must use the old data_offset.
When we use the current (And a reshape is happening) we must use
the new_data_offset.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown 5cf00fcd3c md/raid10: collect some geometry fields into a dedicated structure.
We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there.  Then we will more easily be able to add
a second set of fields.

Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:20 +10:00
NeilBrown c6563a8c38 md: add possibility to change data-offset for devices.
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
 we need a new FEATURE flag for this.
 A (belatedly) check that all remaining 'pad' fields are
 zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be.  At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors.  So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown b0d634d568 md/raid10: fix transcription error in calc_sectors conversion.
The old code was
		sector_div(stride, fc);
the new code was
		sector_dir(size, conf->near_copies);

'size' is right (the stride various wasn't really needed), but
'fc' means 'far_copies', and that is an important difference.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-19 09:01:13 +10:00
NeilBrown 6508fdbf40 md/raid10: set dev_sectors properly when resizing devices in array.
raid10 stores dev_sectors in 'conf' separately from the one in
'mddev' because it can have a very significant effect on block
addressing and so need to be updated carefully.

However raid10_resize isn't updating it at all!

To update it correctly, we need to make sure it is a proper
multiple of the chunksize taking various details of the layout
in to account.
This calculation is currently done in setup_conf.   So split it
out from there and call it from raid10_resize as well.
Then set conf->dev_sectors properly.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:08:45 +10:00
majianpeng f4380a9158 md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.
If r1bio->sectors % 8 != 0,then the memcmp and a later
memcpy will omit the last bio_vec.

This is suitable for any stable kernel since 3.1 when bad-block
management was introduced.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:04:47 +10:00
NeilBrown 5020ad7d14 md/raid1,raid10: don't compare excess byte during consistency check.
When comparing two pages read from different legs of a mirror, only
compare the bytes that were read, not the whole page.

In most cases we read a whole page, but in some cases with
bad blocks or odd sizes devices we might read fewer than that.

This bug has been present "forever" but at worst it might cause
a report of two many mismatches and generate a little bit
extra resync IO, so there is no need to back-port to -stable
kernels.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:39:23 +10:00
NeilBrown 006a09a0ae md/raid10 - support resizing some RAID10 arrays.
'resizing' an array in this context means making use of extra
space that has become available in component devices, not adding new
devices.
It also includes shrinking the array to take up less space of
component devices.

This is not supported for array with a 'far' layout.  However
for 'near' and 'offset' layout arrays, adding and removing space at
the end of the devices is easy to support, and this patch provides
that support.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 050b66152f md/raid10: handle merge_bvec_fn in member devices.
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown dafb20fa34 md: tidy up rdev_for_each usage.
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev.  However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
 - rename rdev_for_each to rdev_for_each_safe
 - create a new rdev_for_each which uses the plain
   list_for_each_entry,
 - use the 'safe' version only where needed, and convert all other
   list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown d6b42dcb99 md/raid1,raid10: avoid deadlock during resync/recovery.
If RAID1 or RAID10 is used under LVM or some other stacking
block device, it is possible to enter a deadlock during
resync or recovery.
This can happen if the upper level block device creates
two requests to the RAID1 or RAID10.  The first request gets
processed, blocks recovery and queue requests for underlying
requests in current->bio_list.  A resync request then starts
which will wait for those requests and block new IO.

But then the second request to the RAID1/10 will be attempted
and it cannot progress until the resync request completes,
which cannot progress until the underlying device requests complete,
which are on a queue behind that second request.

So allow that second request to proceed even though there is
a resync request about to start.

This is suitable for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Ray Morris <support@bettercgi.com>
Tested-by: Ray Morris <support@bettercgi.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:38 +11:00
NeilBrown dc10c643e8 md: allow re-add to failed arrays.
When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.

However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.

So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays  types that check for a failed array).

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
NeilBrown 547414d19f md/raid10: remove unnecessary smp_mb() from end_sync_write
Recent commit 4ca40c2ce0 (md/raid10: Allow replacement device ...)
added an smp_mb in end_sync_write.
This was to close a possible race with raid10_remove_disk.
However there is no such race as it is never attempted to remove a
disk while resync (or recovery) is happening.
so the smp_mb is just noise.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:20 +11:00
NeilBrown 7a90484825 md/raid10: fix assembling of arrays with replacement devices.
commit 56a2559bb6 (md/raid10: recognise replacements ...)
changed 'run' to set ->replacement or ->rdev depending on the
'Replacement' status if the device, but it didn't remove the
old unconditional setting of 'rdev'.  So it was largely ineffective.

So remove that now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-06 10:12:45 +11:00
NeilBrown fae8cc5ed0 md/raid10: fix handling of error on last working device in array.
If we get a read error on the last working device in a RAID10 which
contains the target block, then we don't fail the device (which is
good) but we don't abort retries, which is wrong.
We end up in an infinite loop retrying the read on the one device.

This patch fixes the problem in two places:
1/ in raid10_end_read_request we don't even ask for a retry if this
   was the last usable device.  This is efficient but a little racy
   and will sometimes retry when it should not.

2/ in handle_read_error we are careful to exclude any device from
   retry which we tried to mark as faulty (that might have failed if
   it was the last device).  This is race-free but less efficient.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-14 11:10:10 +11:00
NeilBrown b7044d41b5 md/raid10: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID10 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown 56a2559bb6 md/raid10: recognise replacements when assembling array.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 4ca40c2ce0 md/raid10: Allow replacement device to be replace old drive.
When recovery finish and spare_active is called, check for a
replace that might have just become fully synced and mark it
as such, marking the original as failed.

Then when the original is removed, move the replacement into
its position.

This means that 'replacement' and spontaneously become NULL in some
situations.  Make sure we check for those.
It also means that 'rdev' and 'replacement' could appear to be
identical - check for that too.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 24afd80d99 md/raid10: handle recovery of replacement devices.
If there is a replacement device, then recover to it,
reading from any drives - maybe the one being replaced, maybe not.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 9ad1aefc8a md/raid10: Handle replacement devices during resync.
If we need to resync an array which has replacement devices,
we always write any block checked to every replacement.

If the resync was bitmap-based resync we will then complete the
replacement normally.
If it was a full resync, we mark the replacements as fully recovered
when the resync finishes so no further recovery is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 475b0321a4 md/raid10: writes should get directed to replacement as well as original.
When writing, we need to submit two writes, one to the original,
and one to the replacements - if there is a replacement.

If the write to the replacement results in a write error we just
fail the device.  We only try to record write errors to the
original.

This only handles writing new data.  Writing for resync/recovery
will come later.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown c8ab903ea9 md/raid10: allow removal of failed replacement devices.
Enhance raid10_remove_disk to be able to remove ->replacement
as well as ->rdev

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown abbf098e6e md/raid10: preferentially read from replacement device if possible.
When reading (for array reads, not for recovery etc) we read from the
replacement device if it has recovered far enough.
This requires storing the chosen rdev in the 'r10_bio' so we can make
sure to drop the ref on the right device when the read finishes.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown 96c3fd1f38 md/raid10: change read_balance to return an rdev
It makes more sense to return an rdev than just an index as
read_balance() gets a reference to the rdev and so returning
the pointer make this more idiomatic.

This will be needed in a future patch when we might return
a 'replacement' rdev instead of the main rdev.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown 69335ef3bc md/raid10: prepare data structures for handling replacement.
Allow each slot in the RAID10 to have 2 devices, the want_replacement
and the replacement.

Also an r10bio to have 2 bios, and for resync/recovery allocate the
second bio if there are any replacement devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown b8321b68d1 md: change hot_remove_disk to take an rdev rather than a number.
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement).  So removing
a device based on the number won't work.  So pass the actual device
handle instead.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds b4fdcb02f1 Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
  block: don't call blk_drain_queue() if elevator is not up
  blk-throttle: use queue_is_locked() instead of lockdep_is_held()
  blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
  blk-throttle: Free up policy node associated with deleted rule
  block: warn if tag is greater than real_max_depth.
  block: make gendisk hold a reference to its queue
  blk-flush: move the queue kick into
  blk-flush: fix invalid BUG_ON in blk_insert_flush
  block: Remove the control of complete cpu from bio.
  block: fix a typo in the blk-cgroup.h file
  block: initialize the bounce pool if high memory may be added later
  block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
  block: drop @tsk from attempt_plug_merge() and explain sync rules
  block: make get_request[_wait]() fail if queue is dead
  block: reorganize throtl_get_tg() and blk_throtl_bio()
  block: reorganize queue draining
  block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  block: move blk_throtl prototypes to block/blk.h
  block: fix genhd refcounting in blkio_policy_parse_and_set()
  ...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
 - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
 - drivers/staging/zram/zram_drv.c
2011-11-04 17:06:58 -07:00
Paul Gortmaker 056075c764 md: Add module.h to all files using it implicitly
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore.  Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:18 -04:00
NeilBrown 7fcc7c8acf md/raid10: Fix bug when activating a hot-spare.
This is a fairly serious bug in RAID10.

When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional.   It would normally
be possible to recover the data, but that would need care and is not
guaranteed.

This bug was introduced in commit
   2bb77736ae
which first appeared in 3.1.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-31 12:59:44 +11:00
NeilBrown d890fa2b05 md: Fix some bugs in recovery_disabled handling.
In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
  1/ one place in raid1 was missed and still sets to '1'.
  2/ We didn't explicitly set the conf-> value at array creation
     time.
     It defaulted to '0' just like the mddev value does so they
     could appear equal and thus disable recovery.
     This did not affect normal 'md' as it calls bind_rdev_to_array
     which changes the mddev value.  However the dmraid interface
     doesn't call this and so doesn't change ->recovery_disabled; so at
     array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.

Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown  <neilb@suse.de>
2011-10-26 11:54:39 +11:00
Jens Axboe 5c04b426f2 Merge branch 'v3.1-rc10' into for-3.2/core
Conflicts:
	block/blk-core.c
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:30:42 +02:00
NeilBrown 34db0cd60f md: add proper write-congestion reporting to RAID1 and RAID10.
RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread.  This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.

However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput.  The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).

So impose a limit on the number of pending write requests.  By default
it is 1024 which seems to be generally suitable.  Make it configurable
via module option just in case someone finds a regression.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:50:01 +11:00
NeilBrown 84fc4b56db md: rename "mdk_personality" to "md_personality"
"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:58 +11:00
NeilBrown e879a8793f md/raid10: typedef removal: conf_t -> struct r10conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:02 +11:00
NeilBrown e373ab1091 md/raid0: typedef removal: raid0_conf_t -> struct r0conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:59 +11:00
NeilBrown 0f6d02d580 md: remove typedefs: mirror_info_t -> struct mirror_info
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:46 +11:00
NeilBrown 9f2c9d12bc md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:43 +11:00
NeilBrown fd01b88c75 md: remove typedefs: mddev_t -> struct mddev
Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:47:53 +11:00
NeilBrown 3cb0300200 md: removing typedefs: mdk_rdev_t -> struct md_rdev
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:45:26 +11:00
NeilBrown 01f96c0a99 md: Avoid waking up a thread after it has been freed.
Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
   without subsequently clearing ->thread.  A subsequent call
   to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
   disappeared either by:
      - holding the ->mutex
      - having an active request, so something else must be keeping
        the array active.
   However mddev_unlock calls md_wakeup_thread after dropping the
   mutex and without any certainty of an active request, so the
   ->thread could theoretically disappear.
   So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org
2011-09-21 15:30:20 +10:00
Christoph Hellwig 5a7bbad27a block: remove support for bio remapping from ->make_request
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:12:01 +02:00
NeilBrown 079fa166a2 md/raid1,10: Remove use-after-free bug in make_request.
A single request to RAID1 or RAID10 might result in multiple
requests if there are known bad blocks that need to be avoided.

To detect if we need to submit another write request we test:
 	if (sectors_handled < (bio->bi_size >> 9)) {

However this is after we call **_write_done() so the 'bio' no longer
belongs to us - the writes could have completed and the bio freed.

So move the **_write_done call until after the test against
bio->bi_size.

This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

Reported-by: Bruno Wolff III <bruno@wolff.to>
Tested-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:23 +10:00
NeilBrown 19d5f834d6 md/raid10: unify handling of write completion.
A write can complete at two different places:
1/ when the last member-device write completes, through
   raid10_end_write_request
2/ in make_request() when we remove the initial bias from ->remaining.

These two should do exactly the same thing and the comment says they
do, but they don't.

So factor the correct code out into a function and call it in both
places.  This makes the code much more similar to RAID1.

The difference is only significant if there is an error, and they
usually take a while, so it is unlikely that there will be an error
already when make_request is completing, so this is unlikely to cause
real problems.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:17 +10:00
NeilBrown 58c54fcca3 md/raid10: handle further errors during fix_read_error better.
If we find more read/write errors we should record a bad block before
failing the device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown 5e5702898e md/raid10: Handle read errors during recovery better.
Currently when we get a read error during recovery, we simply abort
the recovery.

Instead, repeat the read in page-sized blocks.
On successful reads, write to the target.
On read errors, record a bad block on the destination,
and only if that fails do we abort the recovery.

As we now retry reads we need to know where we read from.  This was in
bi_sector but that can be changed during a read attempt.
So store the correct from_addr and to_addr in the r10_bio for later
access.


Signed-off-by: NeilBrown<neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown e684e41db3 md/raid10: simplify read error handling during recovery.
If a read error is detected during recovery the code currently
fails the read device.
This isn't really necessary.  recovery_request_write will signal
a write error to end_sync_write and it will record a write
error on the destination device which will record a bad block
there or kick it from the array.

So just remove this call to do md_error.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown 1a0b7cd826 md/raid10: record bad blocks due to write errors during resync/recovery.
If we get a write error during resync/recovery don't fail the device
but instead record a bad block.  If that fails we can then fail the
device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown f84ee364dd md/raid10: attempt to fix read errors during resync/check
We already attempt to fix read errors found during normal IO
and a 'repair' process.
It is best to try to repair them at any time they are found,
so move a test so that during sync and check a read error will
be corrected by over-writing with good data.

If both (all) devices have known bad blocks in the sync section we
won't try to fix even though the bad blocks might not overlap.  That
should be considered later.

Also if we hit a read error during recovery we don't try to fix it.
It would only be possible to fix if there were at least three copies
of data, which is not very common with RAID10.  But it should still
be considered later.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown bd870a16c5 md/raid10: Handle write errors by updating badblock log.
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 749c55e942 md/raid10: clear bad-block record when write succeeds.
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown d4432c23be md/raid10: avoid writing to known bad blocks on known bad drives.
Writing to known bad blocks on drives that have seen a write error
is asking for trouble.  So try to avoid these blocks.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown e875ecea26 md/raid10 record bad blocks as needed during recovery.
When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 40c356ce5a md/raid10: avoid reading known bad blocks during resync/recovery.
During resync/recovery limit the size of the request to avoid
reading into a bad block that does not start at-or-before the current
read address.

Similarly if there is a bad block at this address, don't allow the
current request to extend beyond the end of that bad block.

Now that we don't ever read from known bad blocks, it is safe to allow
devices with those blocks into the array.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 8dbed5cebd md/raid10 - avoid reading from known bad blocks - part 3
When attempting to repair a read error, don't read from
devices with a known bad block.

As we are only reading PAGE_SIZE blocks, we don't try to
narrow down to smaller regions in the hope that only part of this
page is bad - it isn't worth the effort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 7399c31bc9 md/raid10: avoid reading from known bad blocks - part 2
When redirecting a read error to a different device, we must
again avoid bad blocks and possibly split the request.

Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com>

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 856e08e237 md/raid10: avoid reading from known bad blocks - part 1
This patch just covers the basic read path:
 1/ read_balance needs to check for badblocks, and return not only
    the chosen slot, but also how many good blocks are available
    there.
 2/ read submission must be ready to issue multiple reads to
    different devices as different bad blocks on different devices
    could mean that a single large read cannot be served by any one
    device, but can still be served by the array.
    This requires keeping count of the number of outstanding requests
    per bio.  This count is stored in 'bi_phys_segments'

On read error we currently just fail the request if another target
cannot handle the whole request.  Next patch refines that a bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 560f8e5532 md/raid10: Split handle_read_error out from raid10d.
raid10d() is too big and is about to get bigger, so split
handle_read_error() out as a separate function.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 1294b9c973 md/raid10: simplify/reindent some loops.
When a loop ends with a large if, it can be neater to change the
if to invert the condition and just 'continue'.
Then the body of the if can be indented to a lower level.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown de393cdea6 md: make it easier to wait for bad blocks to be acknowledged.
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown 34b343cff4 md: don't allow arrays to contain devices with bad blocks.
As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:47 +10:00
Namhyung Kim cbea21703b md/raid10: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown 700c721389 md/raid10: Improve decision on whether to fail a device with a read error.
Normally we would fail a device with a READ error.  However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.

The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown 2bb77736ae md/raid10: Make use of new recovery_disabled handling
When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart.  This is misleading and wrong.

Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.

Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Christian Dietrich 8bda470e8e md/raid: use printk_ratelimited instead of printk_ratelimit
As per printk_ratelimit comment, it should not be used.

Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim c65060ad42 md/raid10: share pages between read and write bio's during recovery
When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.

Get rid of those unused pages and share read pages properly.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-18 17:38:49 +10:00
Namhyung Kim 778ca01852 md/raid10: factor out common bio handling code
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-18 17:38:47 +10:00
Namhyung Kim 2c4193df37 md/raid10: get rid of duplicated conditional expression
Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-18 17:38:43 +10:00
NeilBrown ab9d47e990 md/raid10: reformat some loops with less indenting.
When a loop ends with an 'if' with a large body, it is neater
to make the if 'continue' on the inverse condition, and then
the body is indented less.

Apply this pattern 3 times, and wrap some other long lines.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11 14:54:41 +10:00
NeilBrown f17ed07c85 md/raid10: remove unused variable.
This variable 'disk' is never used - how odd.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11 14:54:32 +10:00
NeilBrown a8830bcaf3 md/raid10: make more use of 'slot' in raid10d.
Now that we have a 'slot' variable, make better use of it to simplify
some code a little.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11 14:54:19 +10:00
NeilBrown 7c4e06ff2b md/raid10: some tidying up in fix_read_error
Currently the rdev on which a read error happened could be removed
before we perform the fix_error handling.  This requires extra tests
for NULL.

So delay the rdev_dec_pending call until after the call to
fix_read_error so that we can be sure that the rdev still exists.

This allows an 'if' clause to be removed so the body gets re-indented
back one level.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11 14:53:17 +10:00
NeilBrown 56d9912106 md: simplify raid10 read_balance
raid10 read balance has two different loop for looking through
possible devices to chose the best.
Collapse those into one loop and generally make the code more
readable.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11 14:27:03 +10:00
NeilBrown c3b328ac84 md: fix up raid1/raid10 unplugging.
We just need to make sure that an unplug event wakes up the md
thread, which is exactly what mddev_check_plugged does.

Also remove some plug-related code that is no longer needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18 18:25:43 +10:00
NeilBrown e1dfa0a297 md: use new plugging interface for RAID IO.
md/raid submits a lot of IO from the various raid threads.
So adding start/finish plug calls to those so that some
plugging happens.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18 18:25:41 +10:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Martin K. Petersen a91a2785b2 block: Require subsystems to explicitly allocate bio_set integrity mempool
MD and DM create a new bio_set for every metadevice. Each bio_set has an
integrity mempool attached regardless of whether the metadevice is
capable of passing integrity metadata. This is a waste of memory.

Instead we defer the allocation decision to MD and DM since we know at
metadevice creation time whether integrity passthrough is needed or not.

Automatic integrity mempool allocation can then be removed from
bioset_create() and we make an explicit integrity allocation for the
fs_bio_set.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snizer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-17 11:11:05 +01:00
Jens Axboe 4c63f5646e Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core
Conflicts:
	block/blk-core.c
	block/blk-flush.c
	drivers/md/raid1.c
	drivers/md/raid10.c
	drivers/md/raid5.c
	fs/nilfs2/btnode.c
	fs/nilfs2/mdt.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:58:35 +01:00
Jens Axboe 7eaceaccab block: remove per-queue plugging
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:07 +01:00
NeilBrown da9cf5050a md: avoid spinlock problem in blk_throtl_exit
blk_throtl_exit assumes that ->queue_lock still exists,
so make sure that it does.
To do this, we stop redirecting ->queue_lock to conf->device_lock
and leave it pointing where it is initialised - __queue_lock.

As the blk_plug functions check the ->queue_lock is held, we now
take that spin_lock explicitly around the plug functions.  We don't
need the locking, just the warning removal.

This is needed for any kernel with the blk_throtl code, which is
which is 2.6.37 and later.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-21 18:25:57 +11:00
Krzysztof Wojcik 02214dc546 FIX: md: process hangs at wait_barrier after 0->10 takeover
Following symptoms were observed:
1. After raid0->raid10 takeover operation we have array with 2
missing disks.
When we add disk for rebuild, recovery process starts as expected
but it does not finish- it stops at about 90%, md126_resync process
hangs in "D" state.
2. Similar behavior is when we have mounted raid0 array and we
execute takeover to raid10. After this when we try to unmount array-
it causes process umount hangs in "D"

In scenarios above processes hang at the same function- wait_barrier
in raid10.c.
Process waits in macro "wait_event_lock_irq" until the
"!conf->barrier" condition will be true.
In scenarios above it never happens.

Reason was that at the end of level_store, after calling pers->run,
we call mddev_resume. This calls pers->quiesce(mddev, 0) with
RAID10, that calls lower_barrier.
However raise_barrier hadn't been called on that 'conf' yet,
so conf->barrier becomes negative, which is bad.

This patch introduces setting conf->barrier=1 after takeover
operation. It prevents to become barrier negative after call
lower_barrier().

Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-08 11:49:02 +11:00
Jonathan Brassow ccebd4c415 md-new-param-to_sync_page_io
Add new parameter to 'sync_page_io'.

The new parameter allows us to distinguish between metadata and data
operations.  This becomes important later when we add the ability to
use separate devices for data and metadata.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
2011-01-14 09:14:33 +11:00
Joe Perches 067032bc62 md: Fix single printks with multiple KERN_<level>s
Noticed-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-14 09:14:33 +11:00
NeilBrown 589a594be1 md: protect against NULL reference when waiting to start a raid10.
When we fail to start a raid10 for some reason, we call
md_unregister_thread to kill the thread that was created.

Unfortunately md_thread() will then make one call into the handler
(raid10d) even though md_wakeup_thread has not been called.  This is
not safe and as md_unregister_thread is called after mddev->private
has been set to NULL, it will definitely cause a NULL dereference.

So fix this at both ends:
 - md_thread should only call the handler if THREAD_WAKEUP has been
   set.
 - raid10 should call md_unregister_thread before setting things
   to NULL just like all the other raid modules do.

This is applicable to 2.6.35 and later.

Cc: stable@kernel.org
Reported-by: "Citizen" <citizen_lee@thecus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09 17:02:14 +11:00
NeilBrown a167f66324 md: use separate bio pool for each md device.
bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.

So allocate a local bio pool for each md array and use that rather
than the common pool.

This pool is used both for regular IO and metadata updates.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:15 +11:00
NeilBrown 2b193363ef md: change type of first arg to sync_page_io.
Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:11 +11:00
NeilBrown 6746557f03 md: use bio_kmalloc rather than bio_alloc when failure is acceptable.
bio_alloc can never fail (as it uses a mempool) but an block
indefinitely, especially if the caller is holding a reference to a
previously allocated bio.

So these to places which both handle failure and hold multiple bios
should not use bio_alloc, they should use bio_kmalloc.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:06 +11:00
NeilBrown 4e78064f42 md: Fix possible deadlock with multiple mempool allocations.
It is not safe to allocate from a mempool while holding an item
previously allocated from that mempool as that can deadlock when the
mempool is close to exhaustion.

So don't use a bio list to collect the bios to write to multiple
devices in raid1 and raid10.
Instead queue each bio as it becomes available so an unplug will
activate all previously allocated bios and so a new bio has a chance
of being allocated.

This means we must set the 'remaining' count to '1' before submitting
any requests, then when all are submitted, decrement 'remaining' and
possible handle the write completion at that point.

Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:34:07 +11:00
NeilBrown 57dab0bdf6 md: use sector_t in bitmap_get_counter
bitmap_get_counter returns the number of sectors covered
by the counter in a pass-by-reference variable.
In some cases this can be very large, so make it a sector_t
for safety.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:32:26 +11:00
Tejun Heo e9c7469bb4 md: implment REQ_FLUSH/FUA support
This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
  processing of other requests and thus there is no reason to mark the
  queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
  logic.  Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
  be handled the same way as other writes - their processing can be
  deferred to request_queue of member devices.  md_barrier_request()
  is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough.  raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
  request_queues of member devices.  Barrier related logic removed.

* raid5: Queue draining logic dropped.  FUA bit is propagated through
  biodrain and stripe resconstruction such that all the updated parts
  of the stripe are written out with FUA writes if any of the dirtying
  writes was FUA.  preread_active_stripes handling in make_request()
  is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
NeilBrown 2c7d46ec19 md raid-1/10 Fix bio_rw bit manipulations again
commit 7b6d91daee changed the behaviour
of a few variables in raid1 and raid10 from flags to bit-sets, but
left them as type 'bool' so they did not work.

Change them (back) to unsigned long.
(historical note: see 1ef04fefe2)

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jiri Slaby <jslaby@suse.cz> and many others
2010-08-18 16:16:05 +10:00
NeilBrown 6b96562054 md: provide appropriate return value for spare_active functions.
md_check_recovery expects ->spare_active to return 'true' if any
spares were activated, but none of them do, so the consequent change
in 'degraded' is not notified through sysfs.

So count the number of spares activated, subtract it from 'degraded'
just once, and return it.

Reported-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-18 12:04:32 +10:00
Adrian Drzewiecki e6ffbcb6cd md: Notify sysfs when RAID1/5/10 disk is In_sync.
When RAID1 is done syncing disks, it'll update the state
of synced rdevs to In_sync. But it neglected to notify
sysfs that the attribute changed. So any programs that
are waiting for an rdev's state to change will not be
woken.

(raid5/raid10 added by neilb)

Signed-off-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-18 11:49:02 +10:00
Linus Torvalds 3d30701b58 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (24 commits)
  md: clean up do_md_stop
  md: fix another deadlock with removing sysfs attributes.
  md: move revalidate_disk() back outside open_mutex
  md/raid10: fix deadlock with unaligned read during resync
  md/bitmap:  separate out loading a bitmap from initialising the structures.
  md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
  md/bitmap: optimise scanning of empty bitmaps.
  md/bitmap: clean up plugging calls.
  md/bitmap: reduce dependence on sysfs.
  md/bitmap: white space clean up and similar.
  md/raid5: export raid5 unplugging interface.
  md/plug: optionally use plugger to unplug an array during resync/recovery.
  md/raid5: add simple plugging infrastructure.
  md/raid5: export is_congested test
  raid5: Don't set read-ahead when there is no queue
  md: add support for raising dm events.
  md: export various start/stop interfaces
  md: split out md_rdev_init
  md: be more careful setting MD_CHANGE_CLEAN
  md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
  ...
2010-08-10 15:38:19 -07:00
Christoph Hellwig 7b6d91daee block: unify flags for struct bio and struct request
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver.  There were two flags in the bio that were
missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:20:39 +02:00
NeilBrown 51e9ac7703 md/raid10: fix deadlock with unaligned read during resync
If the 'bio_split' path in raid10-read is used while
resync/recovery is happening it is possible to deadlock.
Fix this be elevating ->nr_waiting for the duration of both
parts of the split request.

This fixes a bug that has been present since 2.6.22
but has only started manifesting recently for unknown reasons.
It is suitable for and -stable since then.

Reported-by:  Justin Bronder <jsbronder@gentoo.org>
Tested-by:  Justin Bronder <jsbronder@gentoo.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-08-07 21:17:00 +10:00
Maciej Trela f73ea87375 md: fix raid10 takeover: use new_layout for setup_conf
Use mddev->new_layout in setup_conf.
Also use new_chunk, and don't set ->degraded in takeover().  That
gets set in run()

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:33:51 +10:00
NeilBrown e93f68a1fc md: fix handling of array level takeover that re-arranges devices.
Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over.  However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.

Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:33:24 +10:00
Prasanna S. Panchamukhi 0544a21db0 md: raid10: Fix null pointer dereference in fix_read_error()
Such NULL pointer dereference can occur when the driver was fixing the
read errors/bad blocks and the disk was physically removed
causing a system crash. This patch check if the
rcu_dereference() returns valid rdev before accessing it in fix_read_error().

Cc: stable@kernel.org
Signed-off-by: Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com>
Signed-off-by: Rob Becker <rbecker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:31:03 +10:00
NeilBrown 19fdb9eefb Merge commit '3ff195b011d7decf501a4d55aeed312731094796' into for-linus
Conflicts:
	drivers/md/md.c

- Resolved conflict in md_update_sb
- Added extra 'NULL' arg to new instance of sysfs_get_dirent.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-22 08:31:36 +10:00
NeilBrown af3a2cd6b8 md: Fix read balancing in RAID1 and RAID10 on drives > 2TB
read_balance uses a "unsigned long" for a sector number which
will get truncated beyond 2TB.
This will cause read-balancing to be non-optimal, and can cause
data to be read from the 'wrong' branch during a resync.  This has a
very small chance of returning wrong data.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:00 +10:00
NeilBrown 128595ed6f md/raid10: tidy up printk messages.
All raid10 printk messages now start
   md/raid10:md-device-name:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown 21a52c6d05 md: pass mddev to make_request functions rather than request_queue
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:55 +10:00
NeilBrown 490773268c md: move io accounting out of personalities into md_make_request
While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:52 +10:00
Trela, Maciej dab8b29248 md: Add support for Raid0->Raid10 takeover
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:48 +10:00
NeilBrown 84707f38e7 md: don't use mddev->raid_disks in raid0 or raid10 while array is active.
In a subsequent patch we will make it possible to change
mddev->raid_disks while a RAID0 or RAID10 array is active.  This is
part of the process of reshaping such an array.

This means that we cannot use this value while processes requests
(it is OK to use it during initialisation as we are locked against
changes then).
Both RAID0 and RAID10 have the same value stored in the private data
structure, so use that value instead.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:47 +10:00
H Hartley Sweeten 7b92813c3c drivers/md: Remove unnecessary casts of void *
void pointers do not need to be cast to other pointer types.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:46 +10:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
NeilBrown 627a2d3c29 md: deal with merge_bvec_fn in component devices better.
If a component device has a merge_bvec_fn then as we never call it
we must ensure we never need to.  Currently this is done by setting
max_sector to 1 PAGE, however this does not stop a bio being created
with several sub-page iovecs that would violate the merge_bvec_fn.

So instead set max_segments to 1 and set the segment boundary to the
same as a page boundary to ensure there is only ever one single-page
segment of IO requested at a time.

This can particularly be an issue when 'xen' is used as it is
known to submit multiple small buffers in a single bio.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-03-16 17:04:24 +11:00
Martin K. Petersen 086fa5ff08 block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors
The block layer calling convention is blk_queue_<limit name>.
blk_queue_max_sectors predates this practice, leading to some confusion.
Rename the function to appropriately reflect that its intended use is to
set max_hw_sectors.

Also introduce a temporary wrapper for backwards compability.  This can
be removed after the merge window is closed.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-26 13:58:08 +01:00
NeilBrown 0efb9e6191 md: add MODULE_DESCRIPTION for all md related modules.
Suggested by  Oren Held <orenhe@il.ibm.com>

Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
Robert Becker 1e50915fe0 raid: improve MD/raid10 handling of correctable read errors.
We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts.  Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
Robert Becker 67b8dc4b06 md/raid10: print more useful messages on device failure.
When we get a read error on a device in a RAID10, and attempting to
repair the error fails, print more useful messages about why it
failed.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
NeilBrown 9cd30fdc33 md: remove needless setting of thread->timeout in raid10_quiesce
As bitmap_create and bitmap_destroy already set thread->timeout
as appropriate, there is no need to do it in raid10_quiesce.
There is a possible need to wake the thread after the timeout
has been set low, but it is better to do that where the timeout
is actually set low, in bitmap_create.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
NeilBrown 1b04be96f6 md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.
This removes a lot of multiplications by HZ.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
NeilBrown 42a04b5078 md: move offset, daemon_sleep and chunksize out of bitmap structure
... and into bitmap_info.  These are all configuration parameters
that need to be set before the bitmap is created.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
NeilBrown a2826aa92e md: support barrier requests on all personalities.
Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
2009-12-14 12:49:49 +11:00
NeilBrown ed9bfdf1a4 md: raid1/raid10: handle allocation errors during array setup.
Both raid1 and raid10 create a mempool during startup.
If the 'alloc' function for this mempool fails, unplug_slaves
is called.
If that happens when the pool is being initialised, unplug_slaves
will try to use the 'conf' structure that isn't filled in yet, and
badness will happen.

So ensure that unplug_slaves doesn't get called unless we know
that the conf structure if fully initialised.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16 15:55:44 +11:00
NeilBrown 1d9d52416c md/raid1/raid10: add a cond_resched
During 'check' of a raid1 or raid10 it is possible for the management
thread to spend a lot of time running 'memcmp' on blocks from
different devices, so make sure the thread has a chance to schedule.
raid5d already has a cond_resched (in process_stripe).

Reported-By: Lee Howard <faxguy@howardsilvan.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16 15:55:32 +11:00
Dmitry Monakhov 1ef04fefe2 md: raid-1/10: fix RW bits manipulation
Recently Jens has changed bio_rw_flagged() logic by following
commit 1f98a13f62. Now it returns
bool instead of int. This broke raid1/raid10 RW bits manipulation logic.
One of visible result is BUG_ON triggering due to empty barrier
here scsi_lib.c:1108 scsi_setup_fs_cmnd()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-09-23 18:20:15 +10:00
NeilBrown 3fa841d7e7 md: report device as congested when suspended
This should writeback from coming when the device is temporarily
suspended.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-09-23 18:10:29 +10:00
NeilBrown 0da3c6194e md: Improve name of threads created by md_register_thread
The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-09-23 18:09:45 +10:00
NeilBrown a9f326ebf2 md: remove sparse waring "symbol xxx shadows an earlier one"
Rename some variable and remove some duplicate definitions
to avoid there warnings.  None of them are actual errors.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-09-23 18:06:41 +10:00
Jens Axboe 1f98a13f62 bio: first step in sanitizing the bio->bi_rw flag testing
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Andre Noll ac5e7113e7 md: Push down data integrity code to personalities.
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:47 +10:00
Martin K. Petersen 8f6c2e4b32 md: Use new topology calls to indicate alignment and I/O sizes
Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 11:13:45 +10:00
Andre Noll 8c6ac868b1 md: Push down reconstruction log message to personality code.
Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).

Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:48:06 +10:00
Andre Noll 9d8f036362 md: Make mddev->chunk_size sector-based.
This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics.  Since

	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
				  = is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:45:01 +10:00
raz ben yehuda 964e7913b0 md: raid10: chunk size check in run
have raid10 check chunk size in run method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:01:22 +10:00
NeilBrown 070ec55d07 md: remove mddev_to_conf "helper" macro
Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:54:21 +10:00
Martin K. Petersen ae03bf639a block: Use accessor functions for queue limits
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
NeilBrown 1805556912 md/raid10: don't clear bitmap during recovery if array will still be degraded.
If we have a raid10 with multiple missing devices, and we recover just
one of these to a spare, then we risk (depending on the bitmap and
array chunk size) clearing bits of the bitmap for which recovery isn't
complete (because a device is still missing).

This can lead to a subsequent "re-add" being recovered without
any IO happening, which would result in loss of data.

This patch takes the safe approach of not clearing bitmap bits
if the array will still be degraded.

This patch is suitable for all active -stable kernels.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:48:10 +10:00
Christoph Hellwig 8f3d8ba20e block: move bio list helpers into bio.h
It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:09 +02:00
Dan Williams b522adcde9 md: 'array_size' sysfs attribute
Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
   a) If size is set before the array is running, do_md_run will fail
      if size is greater than the default size
   b) A reshape attempt that reduces the default size to less than the set
      array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
   kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size.  Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 15:00:31 +11:00
Dan Williams 1f403624bd md: centralize ->array_sectors modifications
Get personalities out of the business of directly modifying
->array_sectors.  Lays groundwork to introduce policy on when
->array_sectors can be modified.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 14:59:03 +11:00
Dan Williams 80c3a6ce4b md: add 'size' as a personality method
In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested.  For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 14:57:49 +11:00
NeilBrown 409c57f380 md: enable suspend/resume of md devices.
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
Andre Noll 58c0fed400 md: Make mddev->size sector-based.
This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown 43b2e5d86d md: move md_k.h from include/linux/raid/ to drivers/md/
It really is nicer to keep related code together..

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown bff61975b3 md: move lots of #include lines out of .h files and into .c
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
Christoph Hellwig ef740c372d md: move headers out of include/linux/raid/
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away.  md.h is left where it is for now as there
are some uses from the outside.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:03 +11:00