Commit Graph

1573 Commits

Author SHA1 Message Date
Dan Williams cb3c82992f async_tx: raid6 recovery self test
Port drivers/md/raid6test/test.c to use the async raid6 recovery
routines.  This is meant as a unit test for raid6 acceleration drivers.  In
addition to the 16-drive test case this implements tests for the 4-disk and
5-disk special cases (dma devices can not generically handle less than 2
sources), and adds a test for the D+Q case.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:28 -07:00
Dan Williams ad283ea4a3 async_tx: add sum check flags
Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result.  Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams d6f38f31f3 md/raid5,6: add percpu scribble region for buffer lists
Use percpu memory rather than stack for storing the buffer lists used in
parity calculations.  Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.

[ Impact: move memory pressure from stack to heap ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams 36d1c6476b md/raid6: move the spare page to a percpu allocation
In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.

Make this allocation cpu hotplug aware to maximize allocation
efficiency.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Chandra Seetharaman 2bfd2e1337 [SCSI] scsi_dh: Use scsi_dh_set_params() in multipath.
Use scsi_dh_set_params() set parameters provided. Save the parameters in
parse_hw_handler() and use it in parse_path().

Reported-by: Eddie Williams <Eddie.Williams@steeleye.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Tested-by: Eddie Williams <Eddie.Williams@steeleye.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2009-08-22 17:52:15 -05:00
Linus Torvalds 435a71d9ef Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  Fix new incorrect error return from do_md_stop.
2009-08-18 13:54:08 -07:00
NeilBrown 80ffb3ccea Fix new incorrect error return from do_md_stop.
Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough.  There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-18 10:35:26 +10:00
Randy Dunlap 894ef820b1 dm-log-userspace: fix printk format warning
drivers/md/dm-log-userspace-transfer.c:110: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'size_t'

Previously posted and acked, but apparently lost.
http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/02074.html

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-16 08:35:58 -07:00
NeilBrown 4d484a4a7a md: allow upper limit for resync/reshape to be set when array is read-only
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races.  But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.

If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.

So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:50 +10:00
NeilBrown 1a67dde0ab md/raid5: Properly remove excess drives after shrinking a raid5/6
We were removing the drives, from the array, but not
removing symlinks from /sys/.... and not marking the device
as having been removed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:49 +10:00
NeilBrown a639755cf8 md/raid5: make sure a reshape restarts at the correct address.
This "if" don't allow for the possibility that the number of devices
doesn't change, and so sector_nr isn't set correctly in that case.
So change '>' to '>='.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:13:00 +10:00
NeilBrown 67ac6011db md/raid5: allow new reshape modes to be restarted in the middle.
md/raid5 doesn't allow a reshape to restart if it involves writing
over the same part of disk that it would be reading from.
This happens at the beginning of a reshape that increases the number
of devices, at the end of a reshape that decreases the number of
devices, and continuously for a reshape that does not change the
number of devices.

The current code is correct for the "increase number of devices"
case as the critical section at the start is handled by userspace
performing a backup.

It does not work for reducing the number of devices, or the
no-change case.
For 'reducing', we need to invert the test.  For no-change we cannot
really be sure things will be safe, so simply require the array
to be read-only, which is how the user-space code which carefully
starts such arrays works.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:06:24 +10:00
NeilBrown 51d5668cb2 md: never advance 'events' counter by more than 1.
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'.  This is to cope with
a system failure between updating the metadata on two difference
devices.

However there are currently times when we update the event count by
2.  This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.

This is bad for the above reason.  So change it to never increase by
two.  This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost.  The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.

Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 09:54:02 +10:00
NeilBrown c8c00a6915 Remove deadlock potential in md_open
A recent commit:
  commit 449aad3e25

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74de
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL.  So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-10 12:50:52 +10:00
NeilBrown 449aad3e25 md: Use revalidate_disk to effect changes in size of device.
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
NeilBrown 64bd660b51 md: allow raid5_quiesce to work properly when reshape is happening.
The ->quiesce method is not supposed to stop resync/recovery/reshape,
just normal IO.
But in raid5 we don't have a way to know which stripes are being
used for normal IO and which for resync etc, so we need to wait for
all stripes to be idle to be sure that all writes have completed.

However reshape keeps at least some stripe busy for an extended period
of time, so a call to raid5_quiesce can block for several seconds
needlessly.
So arrange for reshape etc to pause briefly while raid5_quiesce is
trying to quiesce the array so that the active_stripes count can
drop to zero.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
NeilBrown e516402c0d md/raid5: set reshape_position correctly when reshape starts.
As the internal reshape_progress counter is the main driver
for reshape, the fact that reshape_position sometimes starts with the
wrong value has minimal effect.  It is visible in sysfs and that
is all.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:57 +10:00
NeilBrown 70471dafe3 md: Handle growth of v1.x metadata correctly.
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:57 +10:00
NeilBrown 3673f305fa md: avoid array overflow with bad v1.x metadata
We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array.  This isn't really safe.
So range-check the value first.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:56 +10:00
NeilBrown 3a981b03f3 md: when a level change reduces the number of devices, remove the excess.
When an array is changed from RAID6 to RAID5, fewer drives are
needed.  So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:55 +10:00
Andre Noll ac5e7113e7 md: Push down data integrity code to personalities.
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:47 +10:00
Dan Williams 95fc17aac4 md/raid6: release spare page at ->stop()
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-31 12:39:15 +10:00
Mike Snitzer 5dea271b6d dm table: pass correct dev area size to device_area_is_valid
Incorrect device area lengths are being passed to device_area_is_valid().

The regression appeared in 2.6.31-rc1 through commit
754c5fc7eb.

With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes).  An example of a
consequent incorrect error message is:

  device-mapper: table: 254:0: sdb too small for target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:42 +01:00
Mike Snitzer a732c207d1 dm: remove queue next_ordered workaround for barriers
This patch removes DM's bio-based vs request-based conditional setting
of next_ordered.  For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path).  For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c18

request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa2366

The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680ba

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:40 +01:00
Mikulas Patocka 69885683d2 dm raid1: wake kmirrord when requeueing delayed bios after remote recovery
The recent commit 7513c2a761 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.

The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing).  Otherwise the bios could sit on the list indefinitely.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:37 +01:00
Dan Williams a11034b428 md/raid6: release spare page at ->stop()
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-07-14 11:48:16 -07:00
Jens Axboe 8aa7e847d8 Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Joe Perches ad361c9884 Remove multiple KERN_ prefixes from printk formats
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics.  printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-08 10:30:03 -07:00
Linus Torvalds 2027bd9f92 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request()
  blocK: Restore barrier support for md and probably other virtual devices.
  block: get rid of queue-private command filter
  block: Create bip slabs with embedded integrity vectors
  cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue()
  cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue()
  Trivial typo fixes in Documentation/block/data-integrity.txt.
2009-07-01 10:41:09 -07:00
Linus Torvalds 544ae5f96e Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: use interruptible wait when duration is controlled by userspace.
  md/raid5: suspend shouldn't affect read requests.
  md: tidy up error paths in md_alloc
  md: fix error path when duplicate name is found on md device creation.
  md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
  md: Use new topology calls to indicate alignment and I/O sizes
2009-07-01 10:31:26 -07:00
Martin K. Petersen 7878cba9f0 block: Create bip slabs with embedded integrity vectors
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-01 10:56:25 +02:00
NeilBrown e62e58a5ff md: use interruptible wait when duration is controlled by userspace.
User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 13:15:35 +10:00
NeilBrown a5c308d4d1 md/raid5: suspend shouldn't affect read requests.
md allows write to regions on an array to be suspended temporarily.
This allows user-space to participate is aspects of reshape.
In particular, data can be copied with not risk of a race.
We should not be blocking read requests though, so don't.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 13:15:35 +10:00
NeilBrown 0909dc448c md: tidy up error paths in md_alloc
As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea.  So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.

Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 12:27:21 +10:00
NeilBrown 1ec22eb2b4 md: fix error path when duplicate name is found on md device creation.
When an md device is created by name (rather than number) we need to
check that the name is not already in use.  If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.

Cc: stable@kernel.org
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 12:27:21 +10:00
NeilBrown b8d966efd9 md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
If we try to modify one of the md/ sysfs files
  suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 11:14:04 +10:00
Martin K. Petersen 8f6c2e4b32 md: Use new topology calls to indicate alignment and I/O sizes
Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 11:13:45 +10:00
Mike Snitzer ea9df47cc9 dm table: fix blk_stack_limits arg to use bytes not sectors
The offset passed to blk_stack_limits() must be in bytes not sectors.
Fixes false warnings like the following:
device-mapper: table: 254:1: target device sda6 is misaligned

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: Frans Pop <elendil@planet.nl>
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-30 15:18:17 +01:00
Milan Broz 874d2f61d3 dm exception store: really fix type lookup
Fix exception store name handling.

We need to reference exception store by zero terminated string.

Fixes regression introduced in commit f6bd4eb73c

Cc: Yi Yang <yi.y.yang@intel.com>
Cc: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-30 15:18:14 +01:00
Kiyoshi Ueda f40c67f0f7 dm mpath: change to be request based
This patch converts dm-multipath target to request-based from bio-based.

Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism.  The change in the I/O queueing is described in details
as follows.

I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.

In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission.  So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.

On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is.  To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.

As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request.  So dm_bio_details in dm_mpath_io is removed.

multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
  o The target driver will map I/Os, if map() function is called
  and
  o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it wants.
(e.g. the target driver can't map I/Os now, so wants to kill I/Os.)

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:37 +01:00
Kiyoshi Ueda 523d9297d4 dm: disable interrupt when taking map_lock
This patch disables interrupt when taking map_lock to avoid
lockdep warnings in request-based dm.

request-based dm takes map_lock after taking queue_lock with
disabling interrupt:
  spin_lock_irqsave(queue_lock)
  q->request_fn() == dm_request_fn()
    => dm_get_table()
         => read_lock(map_lock)
while queue_lock could be (but isn't) taken in interrupt context.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:37 +01:00
Kiyoshi Ueda 5d67aa2366 dm: do not set QUEUE_ORDERED_DRAIN if request based
Request-based dm doesn't have barrier support yet.
So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table loading time,
the flag set is deferred until then.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
Kiyoshi Ueda e6ee8c0b76 dm: enable request based option
This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      request-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     --                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table binding time.
  Once the type of a dm device is decided, the type can't be changed.

o Mempool allocations are deferred to at the table loading time, since
  mempools for request-based dm are different from those for bio-based
  dm and needed mempool type is fixed by the type of table.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
Kiyoshi Ueda cec47e3d4a dm: prepare for request based option
This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
    make_request_fn: dm_make_request()
    pref_fn:         dm_prep_fn()
    request_fn:      dm_request_fn()
    softirq_done_fn: dm_softirq_done()
    lld_busy_fn:     dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md

  bio to request
  ==============
  md->queue->make_request_fn() (dm_make_request()) calls __make_request()
  for a bio submitted to the md.
  Then, the bio is kept in the queue as a new request or merged into
  another request in the queue if possible.

  Cloning and Mapping
  ===================
  Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
  when requests are dispatched after they are sorted by the I/O scheduler.

  dm_request_fn() checks busy state of underlying devices using
  target's busy() function and stops dispatching requests to keep them
  on the dm device's queue if busy.
  It helps better I/O merging, since no merge is done for a request
  once it is dispatched to underlying devices.

  Actual cloning and mapping are done in dm_prep_fn() and map_request()
  called from dm_request_fn().
  dm_prep_fn() clones not only request but also bios of the request
  so that dm can hold bio completion in error cases and prevent
  the bio submitter from noticing the error.
  (See the "Completion" section below for details.)

  After the cloning, the clone is mapped by target's map_rq() function
    and inserted to underlying device's queue using
    blk_insert_cloned_request().

  Completion
  ==========
  Request completion can be hooked by rq->end_io(), but then, all bios
  in the request will have been completed even error cases, and the bio
  submitter will have noticed the error.
  To prevent the bio completion in error cases, request-based dm clones
  both bio and request and hooks both bio->bi_end_io() and rq->end_io():
      bio->bi_end_io(): end_clone_bio()
      rq->end_io():     end_clone_request()

  Summary of the request completion flow is below:
  blk_end_request() for a clone request
    => blk_update_request()
       => bio->bi_end_io() == end_clone_bio() for each clone bio
          => Free the clone bio
          => Success: Complete the original bio (blk_update_request())
             Error:   Don't complete the original bio
    => blk_finish_request()
       => rq->end_io() == end_clone_request()
          => blk_complete_request()
             => dm_softirq_done()
                => Free the clone request
                => Success: Complete the original request (blk_end_request())
                   Error:   Requeue the original request

  end_clone_bio() completes the original request on the size of
  the original bio in successful cases.
  Even if all bios in the original request are completed by that
  completion, the original request must not be completed yet to keep
  the ordering of request completion for the stacking.
  So end_clone_bio() uses blk_update_request() instead of
  blk_end_request().
  In error cases, end_clone_bio() doesn't complete the original bio.
  It just frees the cloned bio and gives over the error handling to
  end_clone_request().

  end_clone_request(), which is called with queue lock held, completes
  the clone request and the original request in a softirq context
  (dm_softirq_done()), which has no queue lock, to avoid a deadlock
  issue on submission of another request during the completion:
      - The submitted request may be mapped to the same device
      - Request submission requires queue lock, but the queue lock
        has been held by itself and it doesn't know that

  The clone request has no clone bio when dm_softirq_done() is called.
  So target drivers can't resubmit it again even error cases.
  Instead, they can ask dm core for requeueing and remapping
  the original request in that cases.

  suspend
  =======
  Request-based dm uses stopping md->queue as suspend of the md.
  For noflush suspend, just stops md->queue.

  For flush suspend, inserts a marker request to the tail of md->queue.
  And dispatches all requests in md->queue until the marker comes to
  the front of md->queue.  Then, stops dispatching request and waits
  for the all dispatched requests to complete.
  After that, completes the marker request, stops md->queue and
  wake up the waiter on the suspend queue, md->wait.

  resume
  ======
  Starts md->queue.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
Jonthan Brassow f5db4af466 dm raid1: add userspace log
This patch contains a device-mapper mirror log module that forwards
requests to userspace for processing.

The structures used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h.  Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' was chosen as the interface for
communication.

The first log implementations written in userspace - "clustered-disk"
and "clustered-core" - support clustered shared storage.   A userspace
daemon (in the LVM2 source code repository) uses openAIS/corosync to
process requests in an ordered fashion with the rest of the nodes in the
cluster so as to prevent log state corruption.  Other implementations
with no association to LVM or openAIS/corosync, are certainly possible.

(Imagine if two machines are writing to the same region of a mirror.
They would both mark the region dirty, but you need a cluster-aware
entity that can handle properly marking the region clean when they are
done.  Otherwise, you might clear the region when the first machine is
done, not the second.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
Mike Snitzer 754c5fc7eb dm: calculate queue limits during resume not load
Currently, device-mapper maintains a separate instance of 'struct
queue_limits' for each table of each device.  When the configuration of
a device is to be changed, first its table is loaded and this structure
is populated, then the device is 'resumed' and the calculated
queue_limits are applied.

This places restrictions on how userspace may process related devices,
where it is often advantageous to 'load' tables for several devices
at once before 'resuming' them together.  As the new queue_limits
only take effect after the 'resume', if they are changing and one
device uses another, the latter must be 'resumed' before the former
may be 'loaded'.

This patch moves the calculation of these queue_limits out of
the 'load' operation into 'resume'.  Since we are no longer
pre-calculating this struct, we no longer need to maintain copies
within our dm structs.

dm_set_device_limits() now passes the 'start' of the device's
data area (aka pe_start) as the 'offset' to blk_stack_limits().

init_valid_queue_limits() is replaced by blk_set_default_limits().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: martin.petersen@oracle.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:34 +01:00
Mike Snitzer 18d8594dd9 dm log: fix create_log_context to use logical_block_size of log device
create_log_context() must use the logical_block_size from the log disk,
where the I/O happens, not the target's logical_block_size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:33 +01:00
Mike Snitzer af4874e03e dm target:s introduce iterate devices fn
Add .iterate_devices to 'struct target_type' to allow a function to be
called for all devices in a DM target.  Implemented it for all targets
except those in dm-snap.c (origin and snapshot).

(The raid1 version number jumps to 1.12 because we originally reserved
1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
instead.)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: martin.petersen@oracle.com
2009-06-22 10:12:33 +01:00
Mike Snitzer 1197764e40 dm table: establish queue limits by copying table limits
Copy the table's queue_limits to the DM device's request_queue.  This
properly initializes the queue's topology limits and also avoids having
to track the evolution of 'struct queue_limits' in
dm_table_set_restrictions()

Also fixes a bug that was introduced in dm_table_set_restrictions() via
commit ae03bf639a.  In addition to
establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit()
also performs an allocation to setup the ISA DMA pool.  This allocation
resulted in "sleeping function called from invalid context" when called
from dm_table_set_restrictions().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:32 +01:00
Mike Snitzer 5ab97588fb dm table: replace struct io_restrictions with struct queue_limits
Use blk_stack_limits() to stack block limits (including topology) rather
than duplicate the equivalent within Device Mapper.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:32 +01:00
Mike Snitzer be6d4305db dm table: validate device logical_block_size
Impose necessary and sufficient conditions on a devices's table such
that any incoming bio which respects its logical_block_size can be
processed successfully.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:31 +01:00
Mike Snitzer 02acc3a4fa dm table: ensure targets are aligned to logical_block_size
Ensure I/O is aligned to the logical block size of target devices.

Rename check_device_area() to device_area_is_valid() for clarity and
establish the device limits including the logical block size prior to
calling it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:30 +01:00
Milan Broz 60935eb21d dm ioctl: support cookies for udev
Add support for passing a 32 bit "cookie" into the kernel with the
DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls.  The (unsigned)
value of this cookie is returned to userspace alongside the uevents
issued by these ioctls in the variable DM_COOKIE.

This means the userspace process issuing these ioctls can be notified
by udev after udev has completed any actions triggered.

To minimise the interface extension, we pass the cookie into the
kernel in the event_nr field which is otherwise unused when calling
these ioctls.  Incrementing the version number allows userspace to
determine in advance whether or not the kernel supports the cookie.
If the kernel does support this but userspace does not, there should
be no impact as the new variable will just get ignored.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:30 +01:00
Peter Rajnoha 486d220fe4 dm: sysfs add suspended attribute
Add a file named 'suspended' to each device-mapper device directory in
sysfs.  It holds the value 1 while the device is suspended.  Otherwise
it holds 0.

Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:29 +01:00
Jonthan Brassow 1b6da75459 dm table: improve warning message when devices not freed before destruction
Report any devices forgotten to be freed before a table is destroyed.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:29 +01:00
Kiyoshi Ueda f392ba889b dm mpath: add service time load balancer
This patch adds a service time oriented dynamic load balancer,
dm-service-time, which selects the path with the shortest estimated
service time for the incoming I/O.
The service time is estimated by dividing the in-flight I/O size
by a performance value of each path.

The performance value can be given as a table argument at the table
loading time.  If no performance value is given, all paths are
considered equal.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:28 +01:00
Kiyoshi Ueda fd5e033908 dm mpath: add queue length load balancer
This patch adds a dynamic load balancer, dm-queue-length, which
balances the number of in-flight I/Os across the paths.

The code is based on the patch posted by Stefan Bader:
https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:27 +01:00
Kiyoshi Ueda 02ab823fd1 dm mpath: add start_io and nr_bytes to path selectors
This patch makes two additions to the dm path selector interface for
dynamic load balancers:
  o a new hook, start_io()
  o a new parameter 'nr_bytes' to select_path()/start_io()/end_io()
    to pass the size of the I/O

start_io() is called when a target driver actually submits I/O
to the selected path.
Path selectors can use it to start accounting of the I/O.
(e.g. counting the number of in-flight I/Os.)
The start_io hook is based on the patch posted by Stefan Bader:
https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

nr_bytes, the size of the I/O, is so path selectors can take the
size of the I/O into account when deciding which path to use.
dm-service-time uses it to estimate service time, for example.
(Added the nr_bytes member to dm_mpath_io instead of using existing
 details.bi_size, since request-based dm patch deletes it.)

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:27 +01:00
Mikulas Patocka 2bd0234525 dm snapshot: use barrier when writing exception store
Send barrier requests when updating the exception area.

Exception area updates need to be ordered w.r.t. data writes, so that
the writes are not reordered in hardware disk cache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:26 +01:00
Mikulas Patocka 51aa322849 dm io: retry after barrier error
If -EOPNOTSUPP was returned and the request was a barrier request, retry it
without barrier.

Retry all regions for now. Barriers are submitted only for one-region requests,
so it doesn't matter.  (In the future, retries can be limited to the actual
regions that failed.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:26 +01:00
Mikulas Patocka 5af443a7e1 dm io: record eopnotsupp
Add another field, eopnotsupp_bits. It is subset of error_bits, representing
regions that returned -EOPNOTSUPP.  (The bit is set in both error_bits and
eopnotsupp_bits).

This value will be used in further patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:25 +01:00
Mikulas Patocka 494b3ee7d4 dm snapshot: support barriers
Flush support for dm-snapshot target.

This patch just forwards the flush request to either the origin or the snapshot
device.  (It doesn't flush exception store metadata.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:25 +01:00
Mikulas Patocka 8627921fa2 dm mpath: support barriers
Flush support for dm-multipath target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:24 +01:00
Mikulas Patocka c927259e34 dm delay: support barriers
Flush support for dm-delay target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:23 +01:00
Mikulas Patocka 647c7db14e dm crypt: support flush
Flush support for dm-crypt target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:23 +01:00
Mikulas Patocka 374bf7e7f6 dm: stripe support flush
Flush support for the stripe target.

This sets ti->num_flush_requests to the number of stripes and
remaps individual flush requests to the appropriate stripe devices.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:22 +01:00
Mikulas Patocka 433bcac564 dm: linear support flush
Flush support for the linear target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:22 +01:00
Mikulas Patocka 52b1fd5a27 dm: send empty barriers to targets in dm_flush
Pass empty barrier flushes to the targets in dm_flush().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:21 +01:00
Alasdair G Kergon 9015df24a8 dm: initialise tio in alloc_tio
Move repeated dm_target_io initialisation inside alloc_tio().

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:21 +01:00
Mikulas Patocka f9ab94cee3 dm: introduce num_flush_requests
Introduce num_flush_requests for a target to set to say how many flush
instructions (empty barriers) it wants to receive.  These are sent by
__clone_and_map_empty_barrier with map_info->flush_request going from 0
to (num_flush_requests - 1).

Old targets without flush support won't receive any flush requests.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:20 +01:00
Mikulas Patocka 27eaa14975 dm: remove check that prevents mapping empty bios
Remove the check that the size of the cloned bio is not zero because a
subsequent patch needs to send zero-sized barriers down this path.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:20 +01:00
Mikulas Patocka fdb9572b73 dm: remove EOPNOTSUPP for barriers
If the underlying device doesn't support barriers and dm receives a
barrier, it waits until all requests on that device drain so it no
longer needs to report -EOPNOTSUPP to the caller.

This patch deals with the confusing situation when moving a volume from
one physical device to another triggers an EOPNOTSUPP on a volume that
didn't report it before.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:19 +01:00
Mikulas Patocka 5aa2781d96 dm: store only first barrier error
With the following patches, more than one error can occur during
processing.  Change md->barrier_error so that only the first one is
recorded and returned to the caller.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:18 +01:00
Mikulas Patocka 2761e95fe4 dm: process requeue in dm_wq_work
If barrier request was returned with DM_ENDIO_REQUEUE,
requeue it in dm_wq_work instead of dec_pending.

This allows us to correctly handle a situation when some targets
are asking for a requeue and other targets signal an error.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:18 +01:00
Mikulas Patocka 531fe96364 dm: make dm_flush return void
Make dm_flush return void.

The first error during flush is stored in md->barrier_error instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:17 +01:00
Mikulas Patocka 32a926da5a dm: always hold bdev reference
Fix a potential deadlock when creating multiple snapshots by holding a
reference to struct block_device for the whole lifecycle of every dm
device instead of obtaining it independently at each point it is needed.

bdget_disk() was called while the device was being suspended, in
dm_suspend().  However there could be other devices already suspended,
for example when creating additional snapshots of a device. bdget_disk()
can wait for IO and allocate memory resulting in waiting for the
already-suspended device - deadlock.

This patch changes the code so that it gets the reference to struct
block_device when struct mapped_device is allocated and initialized in
alloc_dev() where it is always OK to allocate memory or wait for I/O.
It drops the reference when it is destroyed in free_dev().  Thus there
is no call to bdget_disk() while any device is suspended.

Previously unlock_fs() was called only if bdev was held.  Now it is
called unconditionally, but the superfluous calls are harmless because
it returns immediately if the filesystem was not previously frozen.

This patch also now allows the device size to be changed in a
noflush suspend because the bdev is held.  This has no adverse effect.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:17 +01:00
Mikulas Patocka db8fef4fab dm: rename suspended_bdev to bdev
Rename suspended_bdev to bdev.

This patch doesn't change any functionality, just renames the variable.
In the next patch, the variable will be used even for non-suspended device.

(Pre-requisite for the per-target barrier support patches.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:15 +01:00
Jonathan Brassow f6bd4eb73c dm exception store: fix exstore lookup to be case insensitive
When snapshots are created using 'p' instead of 'P' as the
exception store type, the device-mapper table loading fails.

This patch makes the code case insensitive as intended and fixes some
regressions reported with device-mapper snapshots.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:15 +01:00
Mikulas Patocka 5657e8fa45 dm: use i_size_read
Use i_size_read() instead of reading i_size.

If someone changes the size of the device simultaneously, i_size_read
is guaranteed to return a valid value (either the old one or the new one).

i_size can return some intermediate invalid value (on 32-bit computers
with 64-bit i_size, the reads to both halves of i_size can be interleaved
with updates to i_size, resulting in garbage being returned).

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:14 +01:00
Mikulas Patocka 8cbeb67ad5 dm: avoid unsupported spanning of md stripe boundaries
A bio that has two or more vector entries, size less than or equal to
page size, that crosses a stripe boundary of an underlying md device is
accepted by device mapper (it conforms to all its limits) but not by the
underlying device.

The fix is: If device mapper selects the one-page maximum request size,
it also needs to set its own q->merge_bvec_fn to reject any bios with
multiple vector entries that span more pages.

The problem was discovered in the following scenario:
  * MD - RAID-0
  * LV on the top of it (raid1, snapshot or striped with chunk
size/stripe larger than RAID-0 stripe)
  * one of the logical volumes is exported to xen domU
  * inside xen domU it is partitioned, the key point is that the partition
must be unaligned on page boundary (fdisk normally aligns the partition to
63 sectors which will trigger it)
  * install the system on the partitioned disk in domU
This causes I/O failures in dom0.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:14 +01:00
Mikulas Patocka 53b351f972 dm mpath: flush keventd queue in destructor
The commit fe9cf30eb8 moves dm table event
submission from kmultipath queue to kernel kevent queue to avoid a
deadlock.

There is a possibility of race condition because kevent queue is not flushed
in the multipath destructor. The scenario is:
- some event happens and is queued to keventd
- keventd thread is delayed due to scheuling latency or some other work
- multipath device is destroyed
- keventd now attempts to process work_struct that is residing in already
  released memory.

The patch flushes the keventd queue in multipath constructor.
I've already fixed similar bug in dm-raid1.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2009-06-22 10:12:13 +01:00
Mikulas Patocka a72986c562 dm raid1: keep retrying alloc if mempool_alloc failed
If the code can't handle allocation failures, use __GFP_NOFAIL so that
in case of memory pressure the allocator will retry indefinitely and
won't return NULL which would cause a crash in the function.

This is still not a correct fix, it may cause a classic deadlock when
memory manager waits for I/O being done and I/O waits for some free memory.
I/O code shouldn't allocate any memory. But in this case it probably
doesn't matter much in practice, people usually do not swap on RAID.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:13 +01:00
Chandra Seetharaman e54f77ddda dm mpath: call activate fn for each path in pg_init
Fixed a problem affecting reinstatement of passive paths.

Before we moved the hardware handler from dm to SCSI, it performed a pg_init
for a path group and didn't maintain any state about each path in hardware
handler code.

But in SCSI dh, such state is now maintained, as we want to fail I/O early on a
path if it is not the active path.

All the hardware handlers have a state now and set to active or some form of
inactive.  They have prep_fn() which uses this state to fail the I/O without
it ever being sent to the device.

So in effect when dm-multipath calls scsi_dh_activate(), activate is
sent to only one path and the "state" of that path is changed appropriately
to "active" while other paths in the same path group are never changed
as they never got an "activate".

In order make sure all the paths in a path group gets their state set
properly when a pg_init happens, we need to call scsi_dh_activate() on
all paths in a path group.

Doing this at the hardware handler layer is not a good option as we
want the multipath layer to define the relationship between path and path
groups and not the hardware handler.

Attached patch sends an "activate" on each path in a path group when a
path group is switched. It also sends an activate when a path is reinstated.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:12 +01:00
Hannes Reinecke a0cf7ea954 dm mpath: change attached scsi_dh
When specifying a different hardware handler via multipath
features we should be able to override the built-in defaults.

The problem here is the hardware table from scsi_dh is compiled
in and cannot be changed from userland. The multipath.conf OTOH
is purely user-defined and, what's more, the user might have a valid
reason for modifying it.
(EG EMC Clariion can well be run in PNR mode even though ALUA is
active, or the user might want to try ALUA on any as-of-yet unknown
devices)

So _not_ allowing multipath to override the device handler setting
will just add to the confusion and makes error tracking even more
difficult.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:11 +01:00
Milan Broz 4d89b7b4e4 dm: sysfs skip output when device is being destroyed
Do not process sysfs attributes when device is being destroyed.

Otherwise code can cause
  BUG_ON(test_bit(DMF_FREEING, &md->flags));
in dm_put() call.

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:11 +01:00
Mikulas Patocka e094f4f15f dm mpath: validate hw_handler argument count
Fix arg count parsing error in hw handlers.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:10 +01:00
Mikulas Patocka 0e0497c0c0 dm mpath: validate table argument count
The parser reads the argument count as a number but doesn't check that
sufficient arguments are supplied. This command triggers the bug:

dmsetup create mpath --table "0 `blockdev --getsize /dev/mapper/cr0`
    multipath 0 0 2 1 round-robin 1000 0 1 1 /dev/mapper/cr0
    round-robin 0 1 1 /dev/mapper/cr1 1000"
kernel BUG at drivers/md/dm-mpath.c:530!

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:08:02 +01:00
Linus Torvalds 31583d6acf Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Fix kernel-doc parameter name typo in blk-settings.c:
  block: rename CONFIG_LBD to CONFIG_LBDAF
  block: Fix bounce_pfn setting
  hd: stop defining MAJOR_NR
2009-06-19 17:43:04 -07:00
Bartlomiej Zolnierkiewicz 90c699a9ee block: rename CONFIG_LBD to CONFIG_LBDAF
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 08:08:50 +02:00
Linus Torvalds 9729a6eb58 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (39 commits)
  md/raid5: correctly update sync_completed when we reach max_resync
  md/raid5: add missing call to schedule() after prepare_to_wait()
  md/linear: use call_rcu to free obsolete 'conf' structures.
  md linear: Protecting mddev with rcu locks to avoid races
  md: Move check for bitmap presence to personality code.
  md: remove chunksize rounding from common code.
  md: raid0/linear: ensure device sizes are rounded to chunk size.
  md: move assignment of ->utime so that it never gets skipped.
  md: Push down reconstruction log message to personality code.
  md: merge reconfig and check_reshape methods.
  md: remove unnecessary arguments from ->reconfig method.
  md: raid5: check stripe cache is large enough in start_reshape
  md: raid0: chunk_sectors cleanups.
  md: fix some comments.
  md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig().
  md: convert conf->chunk_size and conf->prev_chunk to sectors.
  md: Convert mddev->new_chunk to sectors.
  md: Make mddev->chunk_size sector-based.
  md: raid0 :Enables chunk size other than powers of 2.
  md: prepare for non-power-of-two chunk sizes
  ...
2009-06-18 13:11:50 -07:00
NeilBrown 48606a9f2f md/raid5: correctly update sync_completed when we reach max_resync
At the end of reshape_request we update cyrr_resync_completed
if we are about to pause due to reaching resync_max.
However we update it to the wrong value.  We need to add the
"reshape_sectors" that have just been reshaped.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 09:14:12 +10:00
Dan Williams 7a3ab90894 md/raid5: add missing call to schedule() after prepare_to_wait()
In the unlikely event that reshape progresses past the current request
while it is waiting for a stripe we need to schedule() before retrying
for 2 reasons:
1/ Prevent list corruption from duplicated list_add() calls without
   intervening list_del().
2/ Give the reshape code a chance to make some progress to resolve the
   conflict.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:50:18 +10:00
NeilBrown 495d357301 md/linear: use call_rcu to free obsolete 'conf' structures.
Current, when we update the 'conf' structure, when adding a
drive to a linear array, we keep the old version around until
the array is finally stopped, as it is not safe to free it
immediately.

Now that we have rcu protection on all accesses to 'conf',
we can use call_rcu to free it more promptly.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:49:42 +10:00
SandeepKsinha af11c397fd md linear: Protecting mddev with rcu locks to avoid races
Due to the lack of memory ordering guarantees, we may have races around
mddev->conf.

In particular, the correct contents of the structure we get from
dereferencing ->private might not be visible to this CPU yet, and
they might not be correct w.r.t mddev->raid_disks.

This patch addresses the problem using rcu protection to avoid
such race conditions.

Signed-off-by: SandeepKsinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:49:35 +10:00
Andre Noll 0894cc3066 md: Move check for bitmap presence to personality code.
If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.

Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.

This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.

A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:49:23 +10:00
NeilBrown 8190e754e0 md: remove chunksize rounding from common code.
It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.

Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:48:58 +10:00
NeilBrown 13f2682b72 md: raid0/linear: ensure device sizes are rounded to chunk size.
This is currently ensured by common code, but it is more reliable to
ensure it where it is needed in personality code.
All the other personalities that care already round the size to
the chunk_size.  raid0 and linear are the only hold-outs.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:48:55 +10:00
NeilBrown 1b57f13223 md: move assignment of ->utime so that it never gets skipped.
Currently the assignment to utime gets skipped for 'external'
metadata.  So move it to the top of the function so that it
always gets effected.
This is of largely cosmetic interest.  Nothing actually depends
on ->utime being right for external arrays.
"mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
mdadm-3.0, this is not important for external metadata.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:48:19 +10:00
Andre Noll 8c6ac868b1 md: Push down reconstruction log message to personality code.
Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).

Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:48:06 +10:00
NeilBrown 50ac168a6e md: merge reconfig and check_reshape methods.
The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.

So make them just one method.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:47:55 +10:00
NeilBrown 597a711b69 md: remove unnecessary arguments from ->reconfig method.
Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.

This is preparation for combining the check_reshape and reconfig
methods

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:47:42 +10:00
NeilBrown 01ee22b496 md: raid5: check stripe cache is large enough in start_reshape
In reshape cases that do not change the number of devices,
start_reshape is called without first calling check_reshape.

Currently, the check that the stripe_cache is large enough is
only done in check_reshape.  It should be in start_reshape too.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:47:20 +10:00
NeilBrown d6e412eaa5 md: raid0: chunk_sectors cleanups.
following the conversion to chunk_sectors, there is room
for cleaning up a little.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:47:00 +10:00
Andre Noll cdc2ae6d6a md: fix some comments.
1/ Raid5 has learned to take over also raid4 and raid6 arrays.
2/ new_chunk in mdp_superblock_1 is in sectors, not bytes.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:46:47 +10:00
Andre Noll 0ba459d262 md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:46:10 +10:00
Andre Noll 09c9e5fa1b md: convert conf->chunk_size and conf->prev_chunk to sectors.
This kills some more shifts.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:45:55 +10:00
Andre Noll 664e7c413f md: Convert mddev->new_chunk to sectors.
A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:45:27 +10:00
Andre Noll 9d8f036362 md: Make mddev->chunk_size sector-based.
This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics.  Since

	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
				  = is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:45:01 +10:00
Linus Torvalds 6fd03301d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
  debugfs: use specified mode to possibly mark files read/write only
  debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
  xen: remove driver_data direct access of struct device from more drivers
  usb: gadget: at91_udc: remove driver_data direct access of struct device
  uml: remove driver_data direct access of struct device
  block/ps3: remove driver_data direct access of struct device
  s390: remove driver_data direct access of struct device
  parport: remove driver_data direct access of struct device
  parisc: remove driver_data direct access of struct device
  of_serial: remove driver_data direct access of struct device
  mips: remove driver_data direct access of struct device
  ipmi: remove driver_data direct access of struct device
  infiniband: ehca: remove driver_data direct access of struct device
  ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
  hvcs: remove driver_data direct access of struct device
  xen block: remove driver_data direct access of struct device
  thermal: remove driver_data direct access of struct device
  scsi: remove driver_data direct access of struct device
  pcmcia: remove driver_data direct access of struct device
  PCIE: remove driver_data direct access of struct device
  ...

Manually fix up trivial conflicts due to different direct driver_data
direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
2009-06-16 12:57:37 -07:00
Li Zefan e212d6f250 block: remove some includings of blktrace_api.h
When porting blktrace to tracepoints, we changed to trace/block.h
for trace prober declarations.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 11:19:36 +02:00
raz ben yehuda fbb704efb7 md: raid0 :Enables chunk size other than powers of 2.
Maintain two flows, one for pow2 chunk sizes (which uses masks and
shift), and a flow for the general case (which uses sector_div).
This is for the sake of performance.

 - introduce map_sector and is_io_in_chunk_boundary to encapsulate
   those two flows better for raid0_make_request
 - fix blk_mergeable to support the two flows.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:02:05 +10:00
raz ben yehuda 2ac06c3332 md: prepare for non-power-of-two chunk sizes
Remove chunk size check from md as this is now performed in the run
function in each personality.

Replace chunk size power 2 code calculations by a regular division.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:01:42 +10:00
raz ben yehuda 740da44918 md: raid5: chunk size check in setup_conf
have raid5 check chunk size in run/reshape method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:01:36 +10:00
raz ben yehuda 964e7913b0 md: raid10: chunk size check in run
have raid10 check chunk size in run method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:01:22 +10:00
raz ben yehuda 92e59b6ba2 md: raid0: chunk size check in raid0_run
have raid0 check chunk size in run method instead of in md.
This is part of a series moving the checks from common code to
the personalities where they belong.

hardsect is short and chunksize is an int, so it is safe to use %.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:00:57 +10:00
raz ben yehuda 46994191ae md: have raid0 report its formation
Report to the user what are the raid zones

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 17:00:54 +10:00
raz ben yehuda 1b9614291e md: have raid0 compile with MD_DEBUG on
Because of the removal of the device list from
the strips raid0 did not compile with MD_DEBUG flag on

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:57:40 +10:00
Sandeep K Sinha aece3d1f40 md: Binary search in linear raid
Replace the linear search with binary search in which_dev.

Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:57:08 +10:00
Sandeep K Sinha 4db7cdc859 md: Removing num_sector and replacing start_sector with end_sector
Remove num_sectors from dev_info and replace start_sector with
end_sector.  This makes a lot of comparisons much simpler.

Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:56:13 +10:00
Sandeep K Sinha 45d4582f21 md: Removal of hash table in linear raid
Get rid of sector_div and hash table for linear raid and replace
with a linear search in which_dev.
The hash table adds a lot of complexity for little if any gain.
Ultimately a binary search will be used which will have smaller
cache foot print, a similar number of memory access, and no
divisions.

Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:55:26 +10:00
NeilBrown 070ec55d07 md: remove mddev_to_conf "helper" macro
Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:54:21 +10:00
NeilBrown a6b3deafe0 md: raid0: remove setting of segment boundary.
This setting doesn't seem to make sense (half the chunk size??) and
shouldn't be needed.
The segment boundary exported by raid0 should simply be the minimum
of the segment boundary of all component devices.  And we already
get that right.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:54:07 +10:00
NeilBrown b414579f45 md: raid0: remove ->dev pointer from strip_zone structure
If we treat conf->devlist more like a 2 dimensional array,
we can get the devlist for a particular zone simply by indexing
that array, so we don't need to store the pointers to subarrays
in strip_zone.  This makes strip_zone smaller and so (hopefully)
searches faster.

Signed-of-by: NeilBrown <neilb@suse.de>
2009-06-16 16:50:52 +10:00
NeilBrown 49f357a22b md: raid0: remove ->sectors from the strip_zone structure.
storing ->sectors is redundant as is can be computed from the
difference  z->zone_end - (z-1)->zone_end

The one place where it is used, it is just as efficient to use
a zone_end value instead.

And removing it makes strip_zone smaller, so they array of these that
is searched on every request has a better chance to say in cache.

So discard the field and get the value from elsewhere.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:50:35 +10:00
Andre Noll fb5ab4b5d6 md: raid0: Fix a memory leak when stopping a raid0 array.
raid0_stop() removes all references to the raid0 configuration but
misses to free the ->devlist buffer.

This patch closes this leak, removes a pointless initialization and
fixes a coding style issue in raid0_stop().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:48:19 +10:00
Andre Noll ed7b00380d md: raid0: Allocate all buffers for the raid0 configuration in one function.
Currently the raid0 configuration is allocated in raid0_run() while
the buffers for the strip_zone and the dev_list arrays are allocated
in create_strip_zones(). On errors, all three buffers are freed
in raid0_run().

It's easier and more readable to do the allocation and cleanup within
a single function. So move that code into create_strip_zones().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:47:36 +10:00
Andre Noll 5568a6035d md: raid0: Make raid0_run() return a proper error code.
Currently raid0_run() always returns -ENOMEM on errors. This is
incorrect as running the array might fail for other reasons, for
example because not all component devices were available.

This patch changes create_strip_zones() so that it returns a proper
error code (either -ENOMEM or -EINVAL) rather than 1 on errors and
makes raid0_run(), its single caller, return that value instead
of -ENOMEM.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:47:21 +10:00
Andre Noll 8f79cfcdb6 md: raid0: Remove hash spacing and sector shift.
The "sector_shift" and "spacing" fields of struct raid0_private_data
were only used for the hash table lookups. So the removal of the
hash table allows get rid of these fields as well which simplifies
create_strip_zones() and raid0_run() quite a bit.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:47:10 +10:00
Andre Noll 09770e0b6e md: raid0: Remove hash table.
The raid0 hash table has become unused due to the changes in the
previous patch. This patch removes the hash table allocation and
setup code and kills the hash_table field of struct raid0_private_data.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:46:48 +10:00
NeilBrown d27a43abd7 md/raid0: two cleanups in create_stripe_zones.
1/ remove current_start.  The same value is available in
     zone->dev_start and storing it separately doesn't gain anything.
2/ rename curr_zone_start to curr_zone_end as we are now more
     focused on the 'end' of each zone.  We end up storing the
     same number though - the old name was a little confusing
     (and what does 'current' mean in this context anyway).

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:46:46 +10:00
Andre Noll dc58266385 md: raid0: Replace hash table lookup by looping over all strip_zones.
The number of strip_zones of a raid0 array is bounded by the number of
drives in the array and is in fact much smaller for typical setups. For
example, any raid0 array containing identical disks will have only
a single strip_zone.

Therefore, the hash tables which are used for quickly finding the
strip_zone that holds a particular sector are of questionable value
and add quite a bit of unnecessary complexity.

This patch replaces the hash table lookup by equivalent code which
simply loops over all strip zones to find the zone that holds the
given sector.

In order to make this loop as fast as possible, the zone->start field
of struct strip_zone has been renamed to zone_end, and it now stores
the beginning of the next zone in sectors. This allows to save one
addition in the loop.

Subsequent cleanup patches will remove the hash table structure.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:18:43 +10:00
Kay Sievers d405640539 Driver Core: misc: add nodename support for misc devices.
This adds support for misc devices to report their requested nodename to
userspace.  It also updates a number of misc drivers to provide the
needed subdirectory and device name to be used for them.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:30:25 -07:00
Linus Torvalds c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Linus Torvalds 8623661180 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
  Revert "x86, bts: reenable ptrace branch trace support"
  tracing: do not translate event helper macros in print format
  ftrace/documentation: fix typo in function grapher name
  tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
  tracing: add protection around module events unload
  tracing: add trace_seq_vprint interface
  tracing: fix the block trace points print size
  tracing/events: convert block trace points to TRACE_EVENT()
  ring-buffer: fix ret in rb_add_time_stamp
  ring-buffer: pass in lockdep class key for reader_lock
  tracing: add annotation to what type of stack trace is recorded
  tracing: fix multiple use of __print_flags and __print_symbolic
  tracing/events: fix output format of user stack
  tracing/events: fix output format of kernel stack
  tracing/trace_stack: fix the number of entries in the header
  ring-buffer: discard timestamps that are at the start of the buffer
  ring-buffer: try to discard unneeded timestamps
  ring-buffer: fix bug in ring_buffer_discard_commit
  ftrace: do not profile functions when disabled
  tracing: make trace pipe recognize latency format flag
  ...
2009-06-10 19:53:40 -07:00
Li Zefan 55782138e4 tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:34:23 -04:00
NeilBrown 0e6e0271a2 md/raid5: fix bug in reshape code when chunk_size decreases.
Now that we support changing the chunksize, we calculate
"reshape_sectors" to be the max of number of sectors in old
and new chunk size.
However there is one please where we still use 'chunksize'
rather than 'reshape_sectors'.
This causes a reshape that reduces the size of chunks to freeze.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-09 16:32:22 +10:00
NeilBrown a8c906ca3f md/raid5 - avoid deadlocks in get_active_stripe during reshape
md has functionality to 'quiesce' and array so that all pending
IO completed and no new IO starts.  This is used to achieve a
stable state before making internal changes.

Currently this quiescing applies equally to normal IO, resync
IO, and reshape IO.
However there is a problem with applying it to reshape IO.
Reshape can have multiple 'stripe_heads' that must be active together.
If the quiesce come between allocating the first and the last of
such a collection, then we deadlock, as the last will not be allocated
until the quiesce is lifted, the quiesce will not be lifted until the
first (which has been allocated) gets used, and that first cannot be
used until the last is allocated.

It is not necessary to inhibit reshape IO when a quiesce is
requested.  Those places in the code that require a full quiesce will
ensure the reshape thread is not running at all.

So allow reshape requests to get access to new stripe_heads without
being blocked by a 'quiesce'.

This only affects in-place reshapes (i.e. where the array does not
grow or shrink) and these are only newly supported.  So this patch is
not needed in earlier kernels.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-09 14:39:59 +10:00
NeilBrown f001a70cdc md/raid5: use conf->raid_disks in preference to mddev->raid_disk
mddev->raid_disks can be changed and any time by a request from
user-space.  It is a suggestion as to what number of raid_disks is
desired.

conf->raid_disks can only be changed by the raid5 module with suitable
locks in place.  It is a statement as to the current number of
raid_disks.

There are two places where the latter should be used, but the former
is used.  This can lead to a crash when reshaping an array.

This patch changes to mddev-> to conf->

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-09 14:30:31 +10:00
Jens Axboe 9df1bb9b51 Revert "block: Fix bounce limit setting in DM"
This reverts commit a05c0205ba.

DM doesn't need to access the bounce_pfn directly.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-09 06:22:57 +02:00
Dan Williams a08abd8ca8 async_tx: structify submission arguments, add scribble
Prepare the api for the arrival of a new parameter, 'scribble'.  This
will allow callers to identify scratchpad memory for dma address or page
address conversions.  As this adds yet another parameter, take this
opportunity to convert the common submission parameters (flags,
dependency, callback, and callback argument) into an object that is
passed by reference.

Also, take this opportunity to fix up the kerneldoc and add notes about
the relevant ASYNC_TX_* flags for each routine.

[ Impact: moves api pass-by-value parameters to a pass-by-reference struct ]

Signed-off-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-06-03 14:07:35 -07:00
Dan Williams 88ba2aa586 async_tx: kill ASYNC_TX_DEP_ACK flag
In support of inter-channel chaining async_tx utilizes an ack flag to
gate whether a dependent operation can be chained to another.  While the
flag is not set the chain can be considered open for appending.  Setting
the ack flag closes the chain and flags the descriptor for garbage
collection.  The ASYNC_TX_DEP_ACK flag essentially means "close the
chain after adding this dependency".  Since each operation can only have
one child the api now implicitly sets the ack flag at dependency
submission time.  This removes an unnecessary management burden from
clients of the api.

[ Impact: clean up and enforce one dependency per operation ]

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-06-03 14:07:34 -07:00
Martin K. Petersen a05c0205ba block: Fix bounce limit setting in DM
blk_queue_bounce_limit() is more than a wrapper about the request queue
limits.bounce_pfn variable.  Introduce blk_queue_bounce_pfn() which can
be called by stacking drivers that wish to set the bounce limit
explicitly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-03 09:33:18 +02:00
NeilBrown ed37d83e6a md: raid5: change incorrect usage of 'min' macro to 'min_t'
A recent patch to raid5.c use min on an int and a sector_t.
This isn't allowed.
So change it to min_t(sector_t,x,y).

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-27 21:39:05 +10:00
NeilBrown b492b852cd md: don't use locked_ioctl.
md has no need for the BKL - it does its own locking.
So md_ioctl doesn't need to be a locked_ioctl.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 12:57:36 +10:00
NeilBrown 7a91ee1f62 md: don't update curr_resync_completed without also updating reshape_position.
In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.

The reshape code updates both at the same time.  However since
commit 97e4f42d62
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 12:57:21 +10:00
NeilBrown 848b318236 md: raid5: avoid sector values going negative when testing reshape progress.
As sector_t in unsigned, we cannot afford to let 'safepos' etc go
negative.
So replace
   a -= b;
by
   a -= min(b,a);

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 12:41:08 +10:00
NeilBrown b6a9ce688f md: export 'frozen' resync state through sysfs
The md resync engine has a 'frozen' state which ensures that
no resync/recovery.  This is used to avoid races.

Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 09:41:17 +10:00
NeilBrown be51269103 md: bitmap: improve bitmap maintenance code.
The code for checking which bits in the bitmap can be cleared
has 2 problems:
 1/ it repeatedly takes and drops a spinlock, where it would make
    more sense to just hold on to it most of the time.
 2/ it doesn't make use of some opportunities to skip large sections
    of the bitmap

This patch fixes those.  It will only affect CPU consumption, not
correctness.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 09:41:17 +10:00
NeilBrown 2b69c83924 md: improve errno return when setting array_size
Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
  E2BIG
if the size requested is too large.  This makes it easier
for user-space to be sure what went wrong.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 09:41:17 +10:00
NeilBrown 62e1e389f8 md: always update level / chunk_size / layout when writing v1.x metadata.
We previously didn't update these fields when writing the metadata
because they could never change.  They can now, so we better write
them.
v0.90 metadata always updated these fields.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-26 09:40:59 +10:00
Martin K. Petersen ae03bf639a block: Use accessor functions for queue limits
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Martin K. Petersen e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Ingo Molnar 1079cac0f4 Merge commit 'v2.6.30-rc6' into tracing/core
Merge reason: we were on an -rc4 base, sync up to -rc6

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 10:15:35 +02:00
Ingo Molnar 44347d947f Merge branch 'linus' into tracing/core
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
              on a handful of tracing fixes present in .30-rc5-almost.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-07 11:17:34 +02:00
NeilBrown c4647292fd md: remove rd%d links immediately after stopping an array.
md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
   rd2 -> dev-sda

indicates that the device with role '2' in the array is sda.

These links are only present when the array is active.  They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.

So move the removal earlier so they are consistently only present when
the array is active.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:51:06 +10:00
NeilBrown 5bf2959754 md: remove ability to explicit set an inactive array to 'clean'.
Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.

It is unnecessary because the same can be achieved by writing
'active'.  This activates and array, but it still remains 'clean'
until the first write.

It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').

Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races:  One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array.  Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.

So just disable the use of 'clean' to activate an array.

This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.

Reported-by: Rafal Marszewski <rafal.marszewski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:50:57 +10:00
Jan Engelhardt 110518bccf md: constify VFTs
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:49:37 +10:00
NeilBrown dd71cf6b27 md: tidy up status_resync to handle large arrays.
Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
   now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
   the range of "unsigned long".

So
 - change "max_blocks" to "max_sectors", and store sector numbers
   in there and in 'resync'
 - Make 'rt' a 'sector_t' so it can temporarily hold the number of
   remaining sectors.
 - use sector_div rather than normal division.
 - change the magic '100' used to preserve precision to '32'.
   + making it a power of 2 makes division easier
   + it doesn't need to be as large as it was chosen when we averaged
     speed over the entire run.  Now we average speed over the last 30
     seconds or so.

Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:49:35 +10:00
NeilBrown db305e507d md: fix some (more) errors with bitmaps on devices larger than 2TB.
If a write intent bitmap covers more than 2TB, we sometimes work with
values beyond 32bit, so these need to be sector_t.  This patches
add the required casts to some unsigned longs that are being shifted
up.

This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with
member devices that are larger than 2TB.

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Cc: stable@kernel.org
2009-05-07 12:49:06 +10:00
NeilBrown 1805556912 md/raid10: don't clear bitmap during recovery if array will still be degraded.
If we have a raid10 with multiple missing devices, and we recover just
one of these to a spare, then we risk (depending on the bitmap and
array chunk size) clearing bits of the bitmap for which recovery isn't
complete (because a device is still missing).

This can lead to a subsequent "re-add" being recovered without
any IO happening, which would result in loss of data.

This patch takes the safe approach of not clearing bitmap bits
if the array will still be degraded.

This patch is suitable for all active -stable kernels.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:48:10 +10:00
NeilBrown b74fd2826c md: fix loading of out-of-date bitmap.
When md is loading a bitmap which it knows is out of date, it fills
each page with 1s and writes it back out again.  However the
write_page call makes used of bitmap->file_pages and
bitmap->last_page_size which haven't been set correctly yet.  So this
can sometimes fail.

Move the setting of file_pages and last_page_size to before the call
to write_page.

This bug can cause the assembly on an array to fail, thus making the
data inaccessible.  Hence I think it is a suitable candidate for
-stable.

Cc: stable@kernel.org
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-05-07 12:47:19 +10:00
Alan D. Brunelle 22a7c31a96 blktrace: from-sector redundant in trace_block_remap
Remove redundant from-sector parameter: it's /always/ the bio's sector
passed in.

[ Impact: cleanup ]

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49FF517C.7000503@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 14:13:01 +02:00
Linus Torvalds 2edbdd1266 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: support bitmaps on RAID10 arrays larger then 2 terabytes
  md: update sync_completed and reshape_position even more often.
  md: improve usefulness and accuracy of sysfs file md/sync_completed.
  md: allow setting newly added device to 'in_sync' via sysfs.
  md: tiny md.h cleanups
2009-04-20 08:37:37 -07:00
NeilBrown 1f59390339 md: support bitmaps on RAID10 arrays larger then 2 terabytes
.. and other arrays with components larger than 2 terabytes.

We use a "long" rather than a "sector_t" in part of the bitmap
size calculations, which is sad.

Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-20 11:50:24 +10:00
NeilBrown c03f6a1969 md: update sync_completed and reshape_position even more often.
There are circumstances when a user-space process might need to
"oversee" a resync/reshape process.  For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).

The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
  suspend IO to the first section and back it up
  set 'sync_max' to the end of the section
  wait for 'sync_completed' to reach that point
  resume IO on the first section and move on to the next section.

However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.

It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.

One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that).  This defeats some of the double buffering.

With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.

To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max.  This will usually be a good time for user
space to update sync_max.

If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops.  So the update rate will be unreasonably fast only for an
insignificant period of time.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-17 11:06:30 +10:00
Christoph Hellwig 8f3d8ba20e block: move bio list helpers into bio.h
It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:09 +02:00
NeilBrown acb180b0e3 md: improve usefulness and accuracy of sysfs file md/sync_completed.
The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.

We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).

So:
 - make curr_resync_completed be uptodate a little more often,
   particularly when raid5 reshape updates status in the metadata
 - report curr_resync_completed in the sysfs file
 - allow poll/select to report all updates to md/sync_completed.

This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-14 16:28:34 +10:00
NeilBrown 6d56e27844 md: allow setting newly added device to 'in_sync' via sysfs.
When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.

So add that option.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-14 12:01:57 +10:00
Christoph Hellwig 63fe08177f md: tiny md.h cleanups
- update inclusion guard and make sure it covers the whole file
 - remove superflous #ifdef CONFIG_BLOCK
 - make sure all required headers are included so that new users aren't
   required to include others before

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-14 12:01:53 +10:00
Mikulas Patocka 340cd44451 dm kcopyd: fix callback race
If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
split_job/segment_complete and the subjobs complete before the loop in
split_job completes, the kcopyd callback could be invoked from the
thread that called dm_kcopyd_copy instead of the kcopyd workqueue.

dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()

Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also mean that exception store callbacks are called twice
for the same exception - a likely reason for crashes reported inside
pending_complete() / remove_exception().

This patch fixes two problems:

1. job->fn being called from the thread that submitted the job (see above).

- Fix: hand over the completion callback to the kcopyd thread.

2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.

- Fix: pass job->write_err to the callback to report all error bits
  (it is done already in run_complete_job)

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:17 +01:00
Mikulas Patocka 73830857bc dm kcopyd: prepare for callback race fix
Use a variable in segment_complete() to point to the dm_kcopyd_client
struct and only release job->pages in run_complete_job() if any are
defined.  These changes are needed by the next patch.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:16 +01:00
Mikulas Patocka af7e466a1a dm: implement basic barrier support
Barriers are submitted to a worker thread that issues them in-order.

The thread is modified so that when it sees a barrier request it waits
for all pending IO before the request then submits the barrier and
waits for it.  (We must wait, otherwise it could be intermixed with
following requests.)

Errors from the barrier request are recorded in a per-device barrier_error
variable. There may be only one barrier request in progress at once.

For now, the barrier request is converted to a non-barrier request when
sending it to the underlying device.

This patch guarantees correct barrier behavior if the underlying device
doesn't perform write-back caching. The same requirement existed before
barriers were supported in dm.

Bottom layer barrier support (sending barriers by target drivers) and
handling devices with write-back caches will be done in further patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:16 +01:00
Mikulas Patocka 92c639021c dm: remove dm_request loop
Remove queue_io return value and a loop in dm_request.

IO may be submitted to a worker thread with queue_io().  queue_io() sets
DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When
the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this
point on, requests are submitted from dm_request again. This will be used
for processing barriers.

Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread
even if DMF_QUEUE_IO_TO_THREAD was not set.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:15 +01:00
Mikulas Patocka 3b00b2036f dm: rework queueing and suspension
Rework shutting down on suspend and document the associated rules.

Drop write lock in __split_and_process_bio to allow more processing
concurrency.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:15 +01:00
Alasdair G Kergon 54d9a1b451 dm: simplify dm_request loop
Refactor the code in dm_request().

Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will
discard so we don't drop such bios while processing a barrier.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:14 +01:00
Alasdair G Kergon 1eb787ec18 dm: split DMF_BLOCK_IO flag into two
Split the DMF_BLOCK_IO flag into two.

DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a
device.  DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a
worker thread for later processing.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:14 +01:00
Alasdair G Kergon df12ee9963 dm: rearrange dm_wq_work
Refactor dm_wq_work() to make later patch more readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:13 +01:00
Mikulas Patocka 692d0eb9e0 dm: remove limited barrier support
Prepare for full barrier implementation: first remove the restricted support.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:13 +01:00
Martin K. Petersen 9c47008d13 dm: add integrity support
This patch provides support for data integrity passthrough in the device
mapper.

 - If one or more component devices support integrity an integrity
   profile is preallocated for the DM device.

 - If all component devices have compatible profiles the DM device is
   flagged as capable.

 - Handle integrity metadata when splitting and cloning bios.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:12 +01:00
Dan Williams 099f53cb50 async_tx: rename zero_sum to val
'zero_sum' does not properly describe the operation of generating parity
and checking that it validates against an existing buffer.  Change the
name of the operation to 'val' (for 'validate').  This is in
anticipation of the p+q case where it is a requirement to identify the
target parity buffers separately from the source buffers, because the
target parity buffers will not have corresponding pq coefficients.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-04-08 14:28:37 -07:00
Dan Williams fd74ea6588 Merge branch 'dmaengine' into async-tx-raid6 2009-04-08 14:28:13 -07:00
Alexander Beregalov 91a9e99d76 md/raid1: fix build breakage
Fix this build error:

  drivers/md/raid1.c: In function 'raid1_congested':
  drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared

BDI_write_congested was changed in commit 1faa16d228 ("block: change the
request allocation/congestion logic to be sync/async based")

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 14:40:07 -07:00
NeilBrown 303a0e11d0 md/raid1 - don't assume newly allocated bvecs are initialised.
Since commit d3f761104b
newly allocated bvecs aren't initialised to NULL, so we have
to be more careful about freeing a bio which only managed
to get a few pages allocated to it.  Otherwise the resync
process crashes.

This patch is appropriate for 2.6.29-stable.

Cc: stable@kernel.org
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Reported-by: Gabriele Tozzi <gabriele@tozzi.eu>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-04-06 14:40:38 +10:00
Linus Torvalds d9b9be024a Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (36 commits)
  dm: set queue ordered mode
  dm: move wait queue declaration
  dm: merge pushback and deferred bio lists
  dm: allow uninterruptible wait for pending io
  dm: merge __flush_deferred_io into caller
  dm: move bio_io_error into __split_and_process_bio
  dm: rename __split_bio
  dm: remove unnecessary struct dm_wq_req
  dm: remove unnecessary work queue context field
  dm: remove unnecessary work queue type field
  dm: bio list add bio_list_add_head
  dm snapshot: persistent fix dtr cleanup
  dm snapshot: move status to exception store
  dm snapshot: move ctr parsing to exception store
  dm snapshot: use DMEMIT macro for status
  dm snapshot: remove dm_snap header
  dm snapshot: remove dm_snap header use
  dm exception store: move cow pointer
  dm exception store: move chunk_fields
  dm exception store: move dm_target pointer
  ...
2009-04-03 10:02:45 -07:00
Linus Torvalds 223cdea4c4 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (53 commits)
  md/raid5 revise rules for when to update metadata during reshape
  md/raid5: minor code cleanups in make_request.
  md: remove CONFIG_MD_RAID_RESHAPE config option.
  md/raid5: be more careful about write ordering when reshaping.
  md: don't display meaningless values in sysfs files resync_start and sync_speed
  md/raid5: allow layout and chunksize to be changed on active array.
  md/raid5: reshape using largest of old and new chunk size
  md/raid5: prepare for allowing reshape to change layout
  md/raid5: prepare for allowing reshape to change chunksize.
  md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
  Documentation/md.txt update
  md: allow number of drives in raid5 to be reduced
  md/raid5: change reshape-progress measurement to cope with reshaping backwards.
  md: add explicit method to signal the end of a reshape.
  md/raid5: enhance raid5_size to work correctly with negative delta_disks
  md/raid5: drop qd_idx from r6_state
  md/raid6: move raid6 data processing to raid6_pq.ko
  md: raid5 run(): Fix max_degraded for raid level 4.
  md: 'array_size' sysfs attribute
  md: centralize ->array_sectors modifications
  ...
2009-04-03 09:08:19 -07:00
Mikulas Patocka 99360b4c18 dm: set queue ordered mode
Set queue ordered mode.  It doesn't really matter what we set here
because we don't ever put any requests on the queue.  But we need to set
something other than QUEUE_ORDERED_NONE so that __generic_make_request
passes barrier requests to us.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:39 +01:00
Mikulas Patocka b44ebeb017 dm: move wait queue declaration
Move wait queue declaration and unplug to dm_wait_for_completion.

The purpose is to minimize duplicate code in the further patches.

The patch reorders functions a little bit. It doesn't change any
functionality. For proper non-deadlock operation, add_wait_queue must
happen before set_current_state(interruptible) and before the test for
!atomic_read(&md->pending).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:39 +01:00
Mikulas Patocka 022c261100 dm: merge pushback and deferred bio lists
Merge pushback and deferred lists into one list - use deferred list
for both deferred and pushed-back bios.

This will be needed for proper support of barrier bios: it is impossible to
support ordering correctly with two lists because the requests on both lists
will be mixed up.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:39 +01:00
Mikulas Patocka 401600dfd3 dm: allow uninterruptible wait for pending io
Allow uninterruptible wait for pending IOs.

Add argument "interruptible" to dm_wait_for_completion that specifies
either interruptible or uninterruptible waiting.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:38 +01:00
Mikulas Patocka ef2085870e dm: merge __flush_deferred_io into caller
Merge __flush_deferred_io() into the only caller, dm_wq_work().

There's no need to have a function that has only one caller.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:38 +01:00
Mikulas Patocka f0b9a4502b dm: move bio_io_error into __split_and_process_bio
Move the bio_io_error() calls directly into __split_and_process_bio().

This avoids some code duplication in later patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:38 +01:00
Mikulas Patocka 8a53c28db4 dm: rename __split_bio
Rename __split_bio() to __split_and_process_bio() because it not only splits
the bio to serveral parts, but also submits them to target drivers.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:37 +01:00
Mikulas Patocka 53d5914f28 dm: remove unnecessary struct dm_wq_req
Remove struct dm_wq_req and move "work" directly into struct mapped_device.

In the revised implementation, the thread will do just one type of work
(processing the queue).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:37 +01:00
Mikulas Patocka 9a1fb46448 dm: remove unnecessary work queue context field
Remove the context field from struct dm_wq_req because we will no longer
need it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:36 +01:00
Mikulas Patocka 143773965b dm: remove unnecessary work queue type field
Remove "type" field from struct dm_wq_req because we no longer need it
to have more than one value.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:36 +01:00
Mikulas Patocka 99c75e3130 dm: bio list add bio_list_add_head
Introduce a function that adds a bio to the head of the list for
use by the patch that will support barriers.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:36 +01:00
Jonathan Brassow a32079ce17 dm snapshot: persistent fix dtr cleanup
The persistent exception store destructor does not properly
account for all conditions in which it can be called.  If it
is called after 'ctr' but before 'read_metadata' (e.g. if
something else in 'snapshot_ctr' fails) then it will attempt
to free areas of memory that haven't been allocated yet.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:35 +01:00
Jonathan Brassow 1e302a929e dm snapshot: move status to exception store
Let the exception store types print out their status through
the new API, rather than having the snapshot code do it.

Adjust the buffer position to allow for the preceding DMEMIT in the
arguments to type->status().

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:35 +01:00
Jonathan Brassow fee1998e9c dm snapshot: move ctr parsing to exception store
First step of having the exception stores parse their own arguments -
generalizing the interface.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:34 +01:00
Jonathan Brassow 2e4a31df2b dm snapshot: use DMEMIT macro for status
Use DMEMIT in place of snprintf.  This makes it easier later when
other modules are helping to populate our status output.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:34 +01:00
Jonathan Brassow ccc45ea8ae dm snapshot: remove dm_snap header
Move some of the last bits from dm-snap.h into dm-snap.c where they
belong and remove dm-snap.h.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:34 +01:00
Jonathan Brassow 71fab00a6b dm snapshot: remove dm_snap header use
Move useful functions out of dm-snap.h and stop using dm-snap.h.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:33 +01:00
Jonathan Brassow 49beb2b87a dm exception store: move cow pointer
Move COW device from snapshot to exception store.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:33 +01:00
Jonathan Brassow d021684951 dm exception store: move chunk_fields
Move chunk fields from snapshot to exception store.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:32 +01:00
Jonathan Brassow 0cea9c7827 dm exception store: move dm_target pointer
Move target pointer from snapshot to exception store.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:32 +01:00
Jonathan Brassow 493df71c64 dm exception store: introduce registry
Move exception stores into a registry.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:31 +01:00
Jonathan Brassow 7513c2a761 dm raid1: add is_remote_recovering hook for clusters
The logging API needs an extra function to make cluster mirroring
possible.  This new function allows us to check whether a mirror
region is being recovered on another machine in the cluster.  This
helps us prevent simultaneous recovery I/O and process I/O to the
same locations on disk.

Cluster-aware log modules will implement this function.  Single
machine log modules will not.  So, there is no performance
penalty for single machine mirrors.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:30 +01:00
Jonathan Brassow b2a1146529 dm exception store: separate type from instance
Introduce struct dm_exception_store_type.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:30 +01:00
Mike Snitzer ec44ab9d66 dm log: remove struct dm_dirty_log_internal
Remove the 'dm_dirty_log_internal' structure.  The resulting cleanup
eliminates extra memory allocations.  Therefore exposing the internal
list_head to the external 'dm_dirty_log_type' structure is a worthwhile
compromise.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:30 +01:00
Mike Snitzer 84e67c9319 dm log: use standard kernel module refcount
Avoid private module usage accounting by removing 'use' from
dm_dirty_log_internal.  The standard module reference counting is
sufficient.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:29 +01:00
Johannes Weiner b81d6cf79b dm crypt: use kzfree
Use kzfree() instead of memset() + kfree().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:28 +01:00
Cheng Renquan 45194e4f89 dm target: remove struct tt_internal
The tt_internal is really just a list_head to manage registered target_type
in a double linked list,

Here embed the list_head into target_type directly,
1. to avoid kmalloc/kfree;
2. then tt_internal is really unneeded;

Cc: stable@kernel.org
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:28 +01:00
Alasdair G Kergon 570b9d968b dm table: fix upgrade mode race
upgrade_mode() sets bdev to NULL temporarily, and does not have any
locking to exclude anything from seeing that NULL.

In dm_table_any_congested() bdev_get_queue() can dereference that NULL and
cause a reported oops.

Fix this by not changing that field during the mode upgrade.

Cc: stable@kernel.org
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:28 +01:00
Jun'ichi Nomura aea9058801 dm: path selector use module refcount directly
Fix refcount corruption in dm-path-selector

Refcounting with non-atomic ops under shared lock will corrupt the counter
in multi-processor system and may trigger BUG_ON().
Use module refcount.
# same approach as dm-target-use-module-refcount-directly.patch here
# https://www.redhat.com/archives/dm-devel/2008-December/msg00075.html

Typical oops:
  kernel BUG at linux-2.6.29-rc3/drivers/md/dm-path-selector.c:90!
  Pid: 11148, comm: dmsetup Not tainted 2.6.29-rc3-nm #1
  dm_put_path_selector+0x4d/0x61 [dm_multipath]
  Call Trace:
   [<ffffffffa031d3f9>] free_priority_group+0x33/0xb3 [dm_multipath]
   [<ffffffffa031d4aa>] free_multipath+0x31/0x67 [dm_multipath]
   [<ffffffffa031d50d>] multipath_dtr+0x2d/0x32 [dm_multipath]
   [<ffffffffa015d6c2>] dm_table_destroy+0x64/0xd8 [dm_mod]
   [<ffffffffa015b73a>] __unbind+0x46/0x4b [dm_mod]
   [<ffffffffa015b79f>] dm_swap_table+0x60/0x14d [dm_mod]
   [<ffffffffa015f963>] dev_suspend+0xfd/0x177 [dm_mod]
   [<ffffffffa0160250>] dm_ctl_ioctl+0x24c/0x29c [dm_mod]
   [<ffffffff80288cd3>] ? get_page_from_freelist+0x49c/0x61d
   [<ffffffffa015f866>] ? dev_suspend+0x0/0x177 [dm_mod]
   [<ffffffff802bf05c>] vfs_ioctl+0x2a/0x77
   [<ffffffff802bf4f1>] do_vfs_ioctl+0x448/0x4a0
   [<ffffffff802bf5a0>] sys_ioctl+0x57/0x7a
   [<ffffffff8020c05b>] system_call_fastpath+0x16/0x1b

Cc: stable@kernel.org
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:27 +01:00
Cheng Renquan 5642b8a61a dm target: use module refcount directly
The tt_internal's 'use' field is superfluous: the module's refcount can do
the work properly.  An acceptable side-effect is that this increases the
reference counts reported by 'lsmod'.

Remove the superfluous test when removing a target module.

[Crash possible without this on SMP - agk]

Cc: stable@kernel.org
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
2009-04-02 19:55:27 +01:00
Mikulas Patocka 35bf659b00 dm snapshot: avoid having two exceptions for the same chunk
We need to check if the exception was completed after dropping the lock.

After regaining the lock, __find_pending_exception checks if the exception
was already placed into &s->pending hash.

But we don't check if the exception was already completed and placed into
&s->complete hash. If the process waiting in alloc_pending_exception was
delayed at this point because of a scheduling latency and the exception
was meanwhile completed, we'd miss that and allocate another pending
exception for already completed chunk.

It would lead to a situation where two records for the same chunk exist
and potential data corruption because multiple snapshot I/Os to the
affected chunk could be redirected to different locations in the
snapshot.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:26 +01:00
Mikulas Patocka c66213921c dm snapshot: avoid dropping lock in __find_pending_exception
It is uncommon and bug-prone to drop a lock in a function that is called with
the lock held, so this is moved to the caller.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:25 +01:00
Mikulas Patocka 2913808eb5 dm snapshot: refactor __find_pending_exception
Move looking-up of a pending exception from __find_pending_exception to another
function.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:25 +01:00
Mikulas Patocka b64b6bf4fd dm io: make sync_io uninterruptible
If someone sends signal to a process performing synchronous dm-io call,
the kernel may crash.

The function sync_io attempts to exit with -EINTR if it has pending signal,
however the structure "io" is allocated on stack, so already submitted io
requests end up touching unallocated stack space and corrupting kernel memory.

sync_io sets its state to TASK_UNINTERRUPTIBLE, so the signal can't break out
of io_schedule() --- however, if the signal was pending before sync_io entered
while (1) loop, the corruption of kernel memory will happen.

There is no way to cancel in-progress IOs, so the best solution is to ignore
signals at this point.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:24 +01:00
Mikulas Patocka 95f8fac8dc dm raid1: switch read_record from kmalloc to slab to save memory
With my previous patch to save bi_io_vec, the size of dm_raid1_read_record
is significantly increased (the vector list takes 3072 bytes on 32-bit machines
and 4096 bytes on 64-bit machines).

The structure dm_raid1_read_record used to be allocated with kmalloc,
but kmalloc aligns the size on the next power-of-two so an object
slightly greater than 4096 will allocate 8192 bytes of memory and half of
that memory will be wasted.

This patch turns kmalloc into a slab cache which doesn't have this
padding so it will reduce the memory consumed.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:24 +01:00
Mikulas Patocka a920f6b3ac dm: preserve bi_io_vec when resubmitting bios
Device mapper saves and restores various fields in the bio, but it doesn't save
bi_io_vec.  If the device driver modifies this after a partially successful
request, dm-raid1 and dm-multipath may attempt to resubmit a bio that has
bi_size inconsistent with the size of vector.

To make requests resubmittable in dm-raid1 and dm-multipath, we must save
and restore the bio vector as well.

To reduce the memory overhead involved in this, we do not save the pages in a
vector and use a 16-bit field size if the page size is less than 65536.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:23 +01:00
NeilBrown c8f517c444 md/raid5 revise rules for when to update metadata during reshape
We currently update the metadata :
 1/ every 3Megabytes
 2/ When the place we will write new-layout data to is recorded in
    the metadata as still containing old-layout data.

Rule one exists to avoid having to re-do too much reshaping in the
face of a crash/restart.  So it should really be time based rather
than size based.  So change it to "every 10 seconds".

Rule two turns out to be too harsh when restriping an array
'in-place', as in that case the metadata much be updates for every
stripe.
For the in-place update, it can only possibly be safe from a crash if
some user-space program data a backup of every e.g. few hundred
stripes before allowing them to be reshaped.  In that case, the
constant metadata update is pointless.
So only update the metadata if the new metadata will report that the
end of the 'old-layout' data is beyond where we are currently
writing 'new-layout' data.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:28:40 +11:00
NeilBrown b0f9ec047b md/raid5: minor code cleanups in make_request.
... and to be certain the that make_request doesn't wait forever,
add a 'wake_up' when ->reshape_progress has been set to MaxSector

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:27:18 +11:00
NeilBrown 2cffc4a01d md: remove CONFIG_MD_RAID_RESHAPE config option.
This was only needed when the code was experimental.  Most of it
is well tested now, so the option is no longer useful.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:27:05 +11:00
NeilBrown ab69ae12ce md/raid5: be more careful about write ordering when reshaping.
When we are reshaping an array, it is very important that we read
the data from a particular sector offset before writing new data
at that offset.

In most cases when growing or shrinking an array we read long before
we even consider writing.  But when restriping an array without
changing it size, there is a small possibility that we might have
some data to available write before the read has happened at the same
location.  This would require some stripes to be in cache already.

To guard against this small possibility, we check, before writing,
that the 'old' stripe at the same location is not in the process of
being read.  And we ensure that we mark all 'source' stripes as such
before allowing new 'destination' stripes to proceed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:26:47 +11:00
NeilBrown d1a7c50369 md: don't display meaningless values in sysfs files resync_start and sync_speed
When no resync if happening, both of these files currently have
meaningless values (is slightly different ways).
Change them to "none" in that case.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:24:32 +11:00
NeilBrown 88ce4930e2 md/raid5: allow layout and chunksize to be changed on active array.
If an array has 3 or more devices, we allow the chunksize or layout
to be changed and when a reshape starts, we use these as the 'new'
values.


Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:24:23 +11:00
NeilBrown 7a66138107 md/raid5: reshape using largest of old and new chunk size
This ensures that even when old and new stripes are overlapping,
we will try to read all of the old before having to write any
of the new.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:21:40 +11:00
NeilBrown e183eaedd5 md/raid5: prepare for allowing reshape to change layout
Add prev_algo to raid5_conf_t along the same lines as prev_chunk
and previous_raid_disks.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:20:22 +11:00
NeilBrown 784052ecc6 md/raid5: prepare for allowing reshape to change chunksize.
Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.

This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.

Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position.  In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:19:07 +11:00
NeilBrown 86b42c713b md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'.  They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.

However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size).  So make the difference more
explicit with a 'generation' number.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:19:03 +11:00
NeilBrown ec32a2bd35 md: allow number of drives in raid5 to be reduced
When reshaping a raid5 to have fewer devices, we work from the end of
the array to the beginning.
md_do_sync gives addresses to sync_request that go from the beginning
to the end.  So largely ignore them use the internal state variable
"reshape_progress" to keep track of what to do next.

Never allow the size to be reduced below the minimum (4 for raid6,
3 otherwise).

We require that the size of the array has already been reduced before
the array is reshaped to a smaller size.  This is because simply
reducing the size is an easily reversible operation, while the reshape
is immediately destructive and so is not reversible for the blocks at
the ends of the devices.
Thus to reshape an array to have fewer devices, you must first write
an appropriately small size to md/array_size.

When reshape finished, we remove any drives that are no longer
needed and fix up ->degraded.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:17:38 +11:00
NeilBrown fef9c61fdf md/raid5: change reshape-progress measurement to cope with reshaping backwards.
When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning.  So we need to handle expand_progress and expand_lo
differently.

This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:16:46 +11:00
NeilBrown cea9c22800 md: add explicit method to signal the end of a reshape.
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.

This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.

The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.

The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.

This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:15:05 +11:00
NeilBrown 7ec0547838 md/raid5: enhance raid5_size to work correctly with negative delta_disks
This is the first of four patches which combine to allow md/raid5 to
reduce the number of devices in the array by restriping the data over
a subset of the devices.

If the number of disks in a raid4/5/6 is being reduced, then the
default size must be based on the new number, not the old number
of devices.
In general, it should be based on the smaller of new and old.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:10:36 +11:00
NeilBrown 34e04e87fb md/raid5: drop qd_idx from r6_state
We now have this value in stripe_head so we don't need to duplicate
it.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:10:16 +11:00
Dan Williams f701d589aa md/raid6: move raid6 data processing to raid6_pq.ko
Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules.  This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.

To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
   compile

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:09:39 +11:00
Andre Noll 18b0033491 md: raid5 run(): Fix max_degraded for raid level 4.
raid4 allows only one failed disk.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 15:00:56 +11:00
Dan Williams b522adcde9 md: 'array_size' sysfs attribute
Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
   a) If size is set before the array is running, do_md_run will fail
      if size is greater than the default size
   b) A reshape attempt that reduces the default size to less than the set
      array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
   kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size.  Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 15:00:31 +11:00
Dan Williams 1f403624bd md: centralize ->array_sectors modifications
Get personalities out of the business of directly modifying
->array_sectors.  Lays groundwork to introduce policy on when
->array_sectors can be modified.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 14:59:03 +11:00
Dan Williams 80c3a6ce4b md: add 'size' as a personality method
In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested.  For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 14:57:49 +11:00
Atsushi SAKAI 93ed05e2a5 md: fix typo in FSF address
Hello,

 I found a typo Bosto"m" in FSF address.
And I am checking around linux source code.
Here is the only place which uses Bosto"m" (not Boston).

Signed-off-by: Atsushi SAKAI <sakaia@jp.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:57:37 +11:00
NeilBrown fc9739c6d6 md: add takeover support for converting raid6 back into raid5
If a raid6 is still in the layout that comes from converting raid5
into a raid6. this will allow us to convert it back again.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:57:20 +11:00
NeilBrown e9d4758f6e md: add takeover support for raid4 -> raid5 conversion.
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:57:09 +11:00
NeilBrown b354603527 md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5.
2-drive raid5's aren't very interesting.  But if you are converting
a raid1 into a raid5, you will at least temporarily have one.  And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.

layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:56:41 +11:00
NeilBrown d562b0c431 md: add ->takeover method for raid5 to be able to take over raid1
The RAID1 must have two drives and be a suitable size to
be a multiple of a chunksize that isn't too small.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 245f46c2c2 md: add ->takeover method to support changing the personality managing an array
Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 409c57f380 md: enable suspend/resume of md devices.
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown e0cf8f045b md: md_unregister_thread should cope with being passed NULL
Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first.  It is safer
to put the check inside md_unregister_thread itself.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 91adb56473 md/raid5: refactor raid5 "run"
.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00