Commit Graph

565 Commits

Author SHA1 Message Date
Eric Mei 9ffc8f7cb9 md/raid5: don't do chunk aligned read on degraded array.
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
 1) optimal, as baseline;
 2) degraded;
 3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.


Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown edbe83ab4c md/raid5: allow the stripe_cache to grow and shrink.
The default setting of 256 stripe_heads is probably
much too small for many configurations.  So it is best to make it
auto-configure.

Shrinking the cache under memory pressure is easy.  The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.

Growing the cache on demand needs to be done carefully.  If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.

So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.

This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.

It could take hours to reach a steady state.

The value written to, and displayed in, stripe_cache_size is
used as a minimum.  The cache can grow above this and shrink back
down to it.  The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown 5423399a84 md/raid5: change ->inactive_blocked to a bit-flag.
This allows us to easily add more (atomic) flags.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown 486f0644c3 md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.

Also choose the correct hash to handle next inside the functions.

This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.

This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
NeilBrown a9683a795b md/raid5: pass gfp_t arg to grow_one_stripe()
This is needed for future improvement to stripe cache management.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen d06f191f8e md/raid5: introduce configuration option rmw_level
Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.

The configuration can handle three levels of control.

rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.

rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.

rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen 584acdd49c md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
 128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
 256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
 512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
shli@kernel.org dabc4ec6ba raid5: handle expansion/resync case with stripe batching
expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org 72ac733015 raid5: handle io error of batch list
If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org 59fc630b8b RAID5: batch adjacent full stripe write
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.

With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.

Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.

I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org 7a87f43405 raid5: track overwrite disk count
Track overwrite disk count, so we can know if a stripe is a full stripe write.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org da41ba6597 raid5: add a new flag to track if a stripe can be batched
A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org 46d5b78562 raid5: use flex_array for scribble data
Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
NeilBrown 09314799e4 md: remove 'go_faster' option from ->sync_request()
This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
Eric Mei 16d9cfab93 raid5: check faulty flag for array status during recovery.
When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-25 11:38:26 +11:00
NeilBrown 26ac107378 md/raid5: Fix livelock when array is both resyncing and degraded.
Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f:
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.

Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.

Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.

So change the condition to only force RCW on non-degraded arrays.

Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd
Cc: stable@vger.kernel.org (v3.7+)
2015-02-18 11:35:14 +11:00
NeilBrown 6791875e2e md: make reconfig_mutex optional for writes to md sysfs files.
Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.

This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown 7b1485bab9 md/raid5: use ->lock to protect accessing raid5 sysfs attributes.
It is important that mddev->private isn't freed while
a sysfs attribute function is accessing it.

So use mddev->lock to protect the setting of ->private to NULL, and
take that lock when checking ->private for NULL and de-referencing it
in the sysfs access functions.

This only applies to the read ('show') side of access.  Write
access will be handled separately.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
NeilBrown afa0f557cb md: rename ->stop to ->free
Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private.  This flexibility will be useful in level_store().

Finally, don't clear ->private.  It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before  ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 5aa61f427e md: split detach operation out from ->stop.
Each md personality has a 'stop' operation which does two
things:
 1/ it finalizes some aspects of the array to ensure nothing
    is accessing the ->private data
 2/ it frees the ->private data.

All the steps in '1' can apply to all arrays and so can be
performed in common code.

This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.

So split the 'step 1' functionality out into a new mddev_detach().

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 64590f45dd md: make merge_bvec_fn more robust in face of personality changes.
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 5c675f83c6 md: make ->congested robust against personality changes.
There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown ea664c8245 md/raid5: need_this_block: tidy/fix last condition.
That last condition is unclear and over cautious.

There are two related issues here.

If a partial write is destined for a missing device, then
either RMW or RCW can work.  We must read all the available
block.  Only then can the missing blocks be calculated, and
then the parity update performed.

If RMW is not an option, then there is a complication even
without partial writes.  If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.

So make these two cases more clear and separate, and perform
the relevant tests more  thoroughly.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown a9d56950f7 md/raid5: need_this_block: start simplifying the last two conditions.
Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point.  So factor out those tests and
explain them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown a79cfe12c6 md/raid5: separate out the easy conditions in need_this_block.
Some of the conditions in need_this_block have very straight
forward motivation.  Separate those out and document them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown 2c58f06e6f md/raid5: separate large if clause out of fetch_block().
fetch_block() has a very large and hard to read 'if' condition.

Separate it into its own function so that it can be
made more readable.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
Jes Sorensen ad3ab8b608 md: do_release_stripe(): No need to call md_wakeup_thread() twice
67f455486d introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.

Remove the unnecessary wakeup call.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown b1b02fe97f md/raid5: fix another livelock caused by non-aligned writes.
If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.

As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.

This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.

However since commit 67f455486d, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.

So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.

Fixes: 67f455486d
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02 16:57:17 +11:00
NeilBrown 108cef3aa4 md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.

Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.

So treat that case just like RAID6, just as we do in
handle_stripe_dirtying.  RAID6 always does reconstruct-write.

This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.

Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-03 16:07:58 +11:00
NeilBrown f72ffdd686 md: remove unwanted white space from md.c
My editor shows much of this is RED.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
Markus Stockhausen b8e6a15a1a md/raid5: fix init_stripe() inconsistencies
raid5: fix init_stripe() inconsistencies

1) remove_hash() is not necessary. We will only be called right after
   get_free_stripe(). There we have already a call to remove_hash().

2) Tracing prints out the sector of the freed stripe and not the sector
   that we want to initialize.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown 3fd83717e4 md: use set_bit/clear_bit instead of shift/mask for bi_flags changes.
Using {set,clear}_bit is more consistent than shifting and masking.

No functional change.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-09 10:07:04 +11:00
NeilBrown 8e0e99ba64 md/raid5: disable 'DISCARD' by default due to safety concerns.
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-02 13:45:00 +10:00
NeilBrown 9c4bdf697c md/raid6: avoid data corruption during recovery of double-degraded RAID6
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then.  In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
2014-08-18 14:49:46 +10:00
NeilBrown a40687ff73 md/raid5: avoid livelock caused by non-aligned writes.
If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.

In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed.  Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.

This bug is unlikely as most writes are page aligned.  However as it
can lead to a livelock it is suitable for -stable.  It was introduced
in 3.16.

Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-18 14:49:41 +10:00
Linus Torvalds 8d0304e69d Assorted md fixes for 3.16
Mostly performance improvements with a few corner-case bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU5fevznsnt1WYoG5AQKzNxAAgjZPg6LZgl0wEHJi09ykqLK9dSkglz2B
 RY5IEWPEMcsGJQbSX7fEME5WSHkqKBIHx43xmdzijPCHyWLnF4vQRubNAi/Wo8PC
 yfJyI5hMrG8+0rNKnmXWeG+NLGj7l7j+wkExe27VBlu2+6ATvIUCa34kjOWFDJJx
 xAN9/t22XGly776SiuLXTpMhA0pRWH4sqijJhvM5EqPlsEUjGxvQU/URwUPYzhYr
 iCbvY64uKB1Crh+vn7yP1IUWc77aDOoUPIxO50m0GXOQFU/wbQta1NAFmMeu8tIo
 3QRlGzP+jRJFBV7jLii4nHTMSVxBmD2zMppxVtDoq/egWhyZlJhh2Kd5sUKo2r9V
 8bGQm7e1JlsDsRMVXxSlx3TebZDfJhtnFNhYb6JPlpWM5EYjCzXXfJUl3/6LfKX6
 BKJ800xvabr5TFkVubF9H80LTQxaO+pQoS6PSs8hnwneqtybdbt9o7V59yluMGao
 IwaG/BY4jlY+76YnK32Kbr1eofUNYXBLg45mgSkrhUVqv9HLHe0CDnYAzYs68aur
 zG8a1rhA9IJtBmWwql2Gm8lKOnDFM6onvSmaYntsabJ++eahaNDpL5ck+R4kO8mJ
 pyI1N6GQDjtLdONk4NmH36P+CuCZje6KwRbVXvsJIBA1042tkZfU0zdqgt5QhFZj
 Ma9ZCn9wZFo=
 =MaM/
 -----END PGP SIGNATURE-----

Merge tag 'md/3.16' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Assorted md fixes for 3.16

  Mostly performance improvements with a few corner-case bug fixes"

* tag 'md/3.16' of git://neil.brown.name/md:
  raid5: speedup sync_request processing
  md/raid5: deadlock between retry_aligned_read with barrier io
  raid5: add an option to avoid copy data from bio to stripe cache
  md/bitmap: remove confusing code from filemap_get_page.
  raid5: avoid release list until last reference of the stripe
  md: md_clear_badblocks should return an error code on failure.
  md/raid56: Don't perform reads to support writes until stripe is ready.
  md: refuse to change shape of array if it is active but read-only
2014-06-11 08:33:41 -07:00
Eivind Sarto 053f5b6525 raid5: speedup sync_request processing
The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread.  The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
creates some more parallelism between the resync-thread and raid5 kernel daemon.

Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-06-10 11:02:01 +10:00
hui jiao 2844dc32ea md/raid5: deadlock between retry_aligned_read with barrier io
A chunk aligned read increases counter active_aligned_reads and
decreases it after sub-device handle it successfully. But when a read
error occurs,  the read redispatched by raid5d, and the
active_aligned_reads will not be decreased until we can grab a stripe
head in retry_aligned_read. Now suppose, a barrier io comes, set
conf->quiesce to 2, and wait until both active_stripes and
active_aligned_reads are zero. The retried chunk aligned read gets
stuck at get_active_stripe waiting until conf->quiesce becomes 0.
Retry_aligned_read and barrier io are waiting each other now.
One possible solution is that we ignore conf->quiesce, let the retried
aligned read finish. I reproduced this deadlock and test this patch on
centos6.0

Signed-off-by: NeilBrown <neilb@suse.de>
2014-06-05 17:18:19 +10:00
Shaohua Li d592a99691 raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.

In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.

So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.

Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.

Neilb:
  changed BUG_ON to WARN_ON
  Removed some assignments from raid5_build_block which are now not needed.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-29 16:59:47 +10:00
Eivind Sarto cf170f3fa4 raid5: avoid release list until last reference of the stripe
The (lockless) release_list reduces lock contention, but there is excessive
queueing and dequeuing of stripes on this list.  A stripe will currently be
queued on the release_list with a stripe reference count > 1.  This can cause
the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
without doing any other useful processing of the stripe.  The are two cases
when the stripe can be put on the release_list multiple times before it is
actually handled by the kernel thread(s).
1) make_request() activates the stripe processing in 4k increments.  When a
   write request is large enough to span multiple chunks of a stripe_head, the
   first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
   processed for the same stripe puts the stripe on the release_list with a
   refcount=2.  This can cause the kernel thread to process and decrement the
   stripe before the stripe us unplugged, which again will put it back on the
   release_list.
2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
   refcount is set to the number of active IO (for each chunk).  The stripe is
   released as each IO complete, and can be queued and dequeued multiple times
   on the release_list, until its refcount finally reached zero.

This simple patch will ensure a stripe is only queued on the release_list when
its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
instrumentation to raid5 and counted the number of times striped were queued on
the release_list for a variety of write IO sizes.  Without this patch the number
of times stripes got queued on the release_list was 100-500% higher than with
the patch.  The excess queuing will increase with the IO size.  The patch also
improved throughput by 5-10%.

Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-29 16:59:46 +10:00
NeilBrown 67f455486d md/raid56: Don't perform reads to support writes until stripe is ready.
If it is found that we need to pre-read some blocks before a write
can succeed, we normally set STRIPE_DELAYED and don't actually perform
the read until STRIPE_PREREAD_ACTIVE subsequently gets set.

However for a degraded RAID6 we currently perform the reads as soon
as we see that a write is pending.  This significantly hurts
throughput.

So:
 - when handle_stripe_dirtying find a block that it wants on a device
   that is failed, set STRIPE_DELAY, instead of doing nothing, and
 - when fetch_block detects that a read might be required to satisfy a
   write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
   and if we would actually need to read something to complete the write.

This also helps RAID5, though less often as RAID5 supports a
read-modify-write cycle.  For RAID5 the read is performed too early
only if the write is not a full 4K aligned write (i.e. no an
R5_OVERWRITE).

Also clean up a couple of horrible bits of formatting.

Reported-by: Patrik Horník <patrik@dsl.sk>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-29 16:59:46 +10:00
Peter Zijlstra 4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Shaohua Li c7a6d35e46 raid5: fix a race of stripe count check
I hit another BUG_ON with e240c1839d. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.

V2: keeps the BUG_ON suggested by Neil.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-17 17:05:28 +10:00
Shaohua Li e240c1839d raid5: get_active_stripe avoids device_lock
For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 14:42:42 +10:00
Shaohua Li 27c0f68f07 raid5: make_request does less prepare wait
In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe.  This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 14:42:38 +10:00
Oleg Nesterov 789b5e0315 md/raid5: Fix CPU hotplug callback registration
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the code in raid5 this way to fix the deadlock with callback
registration.

Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-02-13 13:46:45 +11:00
Linus Torvalds 53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
NeilBrown 7da9d450ab md/raid5: close recently introduced race in stripe_head management.
As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.

However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock.  This can lead to races and list
corruption.

So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.

Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.

Fixes: 566c09c534
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-22 11:45:03 +11:00
NeilBrown 9f97e4b128 md/raid5: fix long-standing problem with bitmap handling on write failure.
Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices.  However if the write wasn't fully successful we
should not clear the bit.  If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.

This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing.  But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.

So: set the flag when a write fails.

This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.

Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-16 09:35:38 +11:00