There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The patch that converted raid5 to use bio_reset() forgot to initialize
bi_vcnt.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Tested-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
If we write to a known-bad-block it will be flags as having
a ReadError by analyse_stripe, but the write will proceed anyway
(as it should). Then the read-error handling will kick in an
write again, then re-read.
We don't need that 'write-again', so set R5_ReWrite so it looks like
it has already been done. Then we will just get the re-read, which we
want.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As the function call is the most expensive of these tests it should be
done later in the chain so that it can be avoided in some cases.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This reverts commit 3a366e614d.
Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.
Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).
The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."
Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- recent regressions in raid5
- recent regressions in dmraid
- a few instances of CONFIG_MULTICORE_RAID456 linger
Several tagged for -stable
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUUzCwDnsnt1WYoG5AQJKMhAAsi2XhqLC4Dx19J8MTF6+cjfynWCxF2SC
3mMcVZm6yxSowixb1Ht72CyssWdJAi4vgaw0aLNH7b3CbPDZfTSfqLP4tSvyPfod
aDcFDdd/RhHjDpJqZ52Tyc6QzBfyhwu+s9R+a78TSL47ZMjZpz1QpshG8Sm9JYTs
z72VlIZeglzhWmzO1FInsL/oT/Hwr9IfpmJpuXBQQObDn3BgvZLuzZyCi35upqrM
711ei7CKaN0s/jKcWclNRtgUrr10XsgQ6PugOZbli09CC8ushHwvXe/VmxoQFg2+
Sj14YSfYAY+1QpOiuYc+knrWc7CtPGHgUqBzOoYWMxi9Lqpo5xhD1vkRsFhXxMSg
GVnAnh/RXl7bGzGWaRv8twG4vU+qYOlEPNgO6/079AxCOrrNrstYrgjBxBSWuxrB
0UIFQGT69zA5G3cLbIRrXUxO8oIVeEx92YV1TOcgLKP5OXlp/0I8ajnA9b8KoPZa
He04GdPlZMXTLAqq9MaQRdS0XzX8YQDWbUebqe+w5NW46sLbckkmxaNZs7fOYAfG
CNHfeRsLp5v0oNbhNyCDSjxqH6uYwKCdCqmDxo6A+fmjmDruHQmZoAK8YISUtPtx
u4M82jW6Z/xOg4pomxMl4SxzCDhy1pM8PYzyx7Mj82C4XBR8CkrQTP8XD+FQL2Ih
KhId4tJzx6Q=
=Rycs
-----END PGP SIGNATURE-----
Merge tag 'md-3.9-fixes' of git://neil.brown.name/md
Pull md fixes from NeilBrown:
"A few bugfixes for md
- recent regressions in raid5
- recent regressions in dmraid
- a few instances of CONFIG_MULTICORE_RAID456 linger
Several tagged for -stable"
* tag 'md-3.9-fixes' of git://neil.brown.name/md:
md: remove CONFIG_MULTICORE_RAID456 entirely
md/raid5: ensure sync and DISCARD don't happen at the same time.
MD: Prevent sysfs operations on uninitialized kobjects
MD RAID5: Avoid accessing gendisk or queue structs when not available
md/raid5: schedule_construction should abort if nothing to do.
Had to shuffle the code around a bit (where bi_rw and bi_end_io were
set), but shouldn't really be anything tricky here
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: "Ed L. Cashin" <ecashin@coraid.com>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Jim Paris <jim@jtan.com>
CC: Geoff Levand <geoff@infradead.org>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ed Cashin <ecashin@coraid.com>
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
A number of problems can occur due to races between
resync/recovery and discard.
- if sync_request calls handle_stripe() while a discard is
happening on the stripe, it might call handle_stripe_clean_event
before all of the individual discard requests have completed
(so some devices are still locked, but not all).
Since commit ca64cae960
md/raid5: Make sure we clear R5_Discard when discard is finished.
this will cause R5_Discard to be cleared for the parity device,
so handle_stripe_clean_event() will not be called when the other
devices do become unlocked, so their ->written will not be cleared.
This ultimately leads to a WARN_ON in init_stripe and a lock-up.
- If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
time for resync, it can lead to s->uptodate being less than disks
in handle_parity_checks5(), which triggers a BUG (because it is
one).
So:
- keep R5_Discard on the parity device until all other devices have
completed their discard request
- make sure we don't try to have a 'discard' and a 'sync' action at
the same time.
This involves a new stripe flag to we know when a 'discard' is
happening, and the use of R5_Overlap on the parity disk so when a
discard is wanted while a sync is active, so we know to wake up
the discard at the appropriate time.
Discard support for RAID5 was added in 3.7, so this is suitable for
any -stable kernel since 3.7.
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID5: Fix kernel oops when RAID4/5/6 is used via device-mapper
Commit a9add5d (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
However, when device-mapper is used to create RAID4/5/6 arrays, the
mddev->gendisk and mddev->queue fields are not setup. Therefore, calling
things like trace_block_bio_remap will cause a kernel oops. This patch
conditionalizes those calls on whether the proper fields exist to make
the calls. (Device-mapper will call trace_block_bio_remap on its own.)
This patch is suitable for the 3.8.y stable kernel.
Cc: stable@vger.kernel.org (v3.8+)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit 1ed850f356
md/raid5: make sure to_read and to_write never go negative.
It has been possible for handle_stripe_dirtying to be called
when there isn't actually any work to do.
It then calls schedule_reconstruction() which will set R5_LOCKED
on the parity block(s) even when nothing else is happening.
This then causes problems in do_release_stripe().
So add checks to schedule_reconstruction() so that if it doesn't
find anything to do, it just aborts.
This bug was introduced in v3.7, so the patch is suitable
for -stable kernels since then.
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
mostly little bugfixes.
Only "feature" is a new RAID10 layout which slightly
improves the number of sets of devices that can concurrently
fail, without data loss.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUTPm+znsnt1WYoG5AQLLsw/+PMqr8roC4twgxTWV1NRbU8NtOcRi9Rj9
uvBS63uYAaLdi/D3UBKFYczmNCu9knuXbcp9SgFDxH7LlthQsWN/GYnif06pPo3w
9Agu5M8c062TJEG1vrnX6FhPO6pNgrWFr3h+CKkTiD3179i9DoQpP8LXQToeyMtI
YRMQf/zCkxYtDvWAP0iwsEWtw8cf+q9I/uGPhQ1L+DnZapXYdbtnqWBRz9q6mrDt
orcGrP41aZHvnOHUaTbwmaorCKkf/Ys4SMaGenrSFpnpQMypt7VgNuwHC59LxvJT
5eiFG/26zIsv7Wk0jv/TvFP5qzUPo0/PFkd5ug0ArvbVRiXS2cMJDwQvMdO1toxD
i5Bb+P9DptadvoWhOTgIpxnG77yRH45wJvyJOk+ZfS1/IO87nCRa3d0yiNOU5e2/
o0VdXPZRr72sdKKTK6kQuYfwCPb+Z2Pz6Q8BJdk6GxlmTXyP6sKhIgwUX86534fE
LrOxfK8qV+GetVu3X02RoX2CyJJRQHXyXmbHuSzXuo/JiOYtDigAydwNZChvf+tf
OoMY9K8vgNbhnGsUG6la7XPvZ+6dZMjdnxp2HB99Ml5A3PWZd75i5T6IHHxIQFbD
C3z9PWTWP+hK4k15DEyjlELtsE9WduGTXG4kUcf328xJ/7lj4VIImVugdCz+1B6z
+HlI6BiLwzY=
=YdVD
-----END PGP SIGNATURE-----
Merge tag 'md-3.9' of git://neil.brown.name/md
Pull md updates from NeilBrown:
"Mostly little bugfixes.
Only "feature" is a new RAID10 layout which slightly improves the
number of sets of devices that can concurrently fail, without data
loss."
* tag 'md-3.9' of git://neil.brown.name/md:
md: expedite metadata update when switching read-auto -> active
md: remove CONFIG_MULTICORE_RAID456
md/raid1,raid10: fix deadlock with freeze_array()
md/raid0: improve error message when converting RAID4-with-spares to RAID0
md: raid0: fix error return from create_stripe_zones.
md: fix two bugs when attempting to resize RAID0 array.
DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
MD RAID10: Minor non-functional code changes
md: raid1,10: Handle REQ_WRITE_SAME flag in write bios
md: protect against crash upon fsync on ro array
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This doesn't seem to actually help and we have an alternate
multi-threading approach waiting in the wings, so just get
rid of this config option and associated code.
As a bonus, we remove one use of CONFIG_EXPERIMENTAL
Cc: Dan Williams <djbw@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NeilBrown <neilb@suse.de>
bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.
This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.
* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.
* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.
* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.
This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.
v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mostly just little fixes. Probably biggest part is
AVX accelerated RAID6 calculations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh
2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy
HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0
jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb
lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO
/eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG
JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP
fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11
b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB
OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211
oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX
pDR/ey2CB6E=
=uK52
-----END PGP SIGNATURE-----
Merge tag 'md-3.8' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."
* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
handle_stripe_expansion contains:
if (tx) {
async_tx_ack(tx);
dma_wait_for_async_tx(tx);
}
which is very similar to the body of async_tx_quiesce(),
except that the later handles an error from dma_wait_for_async_tx()
(admittedly by panicing, but that decision belongs in the dma
code, not the md code).
So just us async_tx_quiesce().
Acked-by: Dan Williams <djbw@fb.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.
The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.
All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.
Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 9e44476851
MD: raid5 avoid unnecessary zero page for trim
change raid5 to clear R5_Discard when the complete request is
handled rather than when submitting the per-device discard request.
However it did not clear R5_Discard for the parity device.
This means that if the stripe_head was reused before it expired from
the cache, the setting would be wrong and a hang would result.
Also if the R5_Uptodate bit happens to be set, R5_Discard again
won't be cleared. But R5_Uptodate really should be clear at this point.
So make sure R5_Discard is cleared in all cases, and clear
R5_Uptodate when a 'discard' completes.
Signed-off-by: NeilBrown <neilb@suse.de>
stripe_handle.
The chunk of code in stripe_handle which responds to a
*_result value in reconstruct_state is really the completion
of some processing that happened outside of handle_stripe
(possibly asynchronously) and so should be one of the first
things done in handle_stripe().
After the next patch it will be important that it happens before
handle_stripe_clean_event(), as that will clear some dev->flags
bit that this code tests.
Signed-off-by: NeilBrown <neilb@suse.de>
blkdev_issue_discard currently assumes that the granularity
is a power of 2. So in raid5, round the chosen number up to
avoid embarrassment.
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
"discard" support, some dm-raid improvements and other assorted
bits and pieces.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+
+kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX
qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz
aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b
/f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t
/DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W
tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN
47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY
34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef
MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1
DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+
/ULNzkRU1w4=
=XRmL
-----END PGP SIGNATURE-----
Merge tag 'md-3.7' of git://neil.brown.name/md
Pull md updates from NeilBrown:
- "discard" support, some dm-raid improvements and other assorted bits
and pieces.
* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
md: refine reporting of resync/reshape delays.
md/raid5: be careful not to resize_stripes too big.
md: make sure manual changes to recovery checkpoint are saved.
md/raid10: use correct limit variable
md: writing to sync_action should clear the read-auto state.
Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
md/raid5: make sure to_read and to_write never go negative.
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
md/raid5: protect debug message against NULL derefernce.
md/raid5: add some missing locking in handle_failed_stripe.
MD: raid5 avoid unnecessary zero page for trim
MD: raid5 trim support
md/bitmap:Don't use IS_ERR to judge alloc_page().
md/raid1: Don't release reference to device while handling read error.
raid: replace list_for_each_continue_rcu with new interface
add further __init annotations to crypto/xor.c
DM RAID: Fix for "sync" directive ineffectiveness
DM RAID: Fix comparison of index and quantity for "rebuild" parameter
DM RAID: Add rebuild capability for RAID10
DM RAID: Move 'rebuild' checking code to its own function
...
When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large. This particularly happens
when md_check_recovery calls ->check_reshape().
If we use ->previous_raid_disks, we don't risk this.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
to_read and to_write are part of the result of analysing
a stripe before handling it.
Their use is to avoid some loops and tests if the values are
known to be zero. Thus it is not a problem if they are a
little bit larger than they should be.
So decrementing them in handle_failed_stripe serves little value, and
due to races it could cause some loops to be skipped incorrectly.
So remove those decrements.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The pr_debug in add_stripe_bio could race with something
changing *bip, so it is best to hold the lock until
after the pr_debug.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We really should hold the stripe_lock while accessing
'toread' else we could race with add_stripe_bio and corrupt
a list.
Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and will
get inconsistent data.
To avoid zero the page, we don't set R5_UPTODATE flag after construction is
done. In this way, discard write request is still issued and finished, but read
will not hit the page. If the stripe gets accessed soon, we need reread the
stripe, but since the chance is low, the reread isn't a big deal.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Change the thread parameter, so the thread can carry extra info. Next patch
will use it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit b17459c050
raid5: add a per-stripe lock
added a spin_lock to the 'stripe_head' struct.
Unfortunately there are two places where this struct is allocated
but the spin lock was only initialised in one of them.
So add the missing spin_lock_init.
Signed-off-by: NeilBrown <neilb@suse.de>
When a replacement device becomes active, we mark the device that it
replaces as 'faulty' so that it can subsequently get removed.
However 'calc_degraded' only pays attention to the primary device, not
the replacement, so the array appears to become degraded, which is
wrong.
So teach 'calc_degraded' to consider any replacement if a primary
device is faulty.
This is suitable for -stable as an incorrect 'degraded' value can
confuse md and could lead to data corruption.
This is only relevant for 3.3 and later.
Cc: stable@vger.kernel.org
Reported-by: Robin Hill <robin@robinhill.me.uk>
Reported-by: John Drescher <drescherjm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This reverts commit 895e3c5c58.
While this patch seemed like a good idea and did help some workloads,
it hurts other workloads.
Large sequential O_DIRECT writes were faster,
Small random O_DIRECT writes were slower.
Other changes (batching RAID5 writes) have improved the sequential
writes using a different mechanism, so the net result of this patch
is definitely negative. So revert it.
Reported-by: Shaohua Li <shli@kernel.org>
Tested-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This contains a few patches that depend on
plugging changes in the block layer so needs to wait
for those.
It also contains a Kconfig fix for the new RAID10 support
in dm-raid.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAUBnKUznsnt1WYoG5AQJOQA/+M7RoVnF63+TbGIqdNDotuF8FxvudCZBl
Ou2yG47EOPtWf/RoqPyfpydDgdjyXsk4T5TfXoc0hsXVr4shCYo51uT9K34TMSDJ
2GzGWuyugRJFyvxW7PBgM+zFWlcVdgUGcwsdmIUMtHRz8Q10TqO5fE22RNLkhwOl
fvGCK1KYnQqlG87DbulHWMo22vyZVic8jBqFSw55CPuuFMSJMxCw0rOPUnvk5Q8v
jWzZzuUqrM8iiOxTDHsbCA0IleCbGl/m0tgk02Vj4tkCvz9N/xzQW2se0H6uECiK
k8odbAiNBOh1q135sa7ASrBzxT+JqSiQ25rLheTEzzNxjFv6/NlntXmYu6HB+lD3
DoHAvRjgMxiLCdisW6TJb10NItitXwE/HSpQOVRxyYtINdzmhIDaCccgfN8ZMkho
nmE/uzO+CAoCFpZC2C/nY8D0BZs5fw4hgDAsci66mvs+88dy+SoA4AbyNEMAusOS
tiL8ZEjnYXvxTh3JFaMIaqQd6PkbahmtEtvorwXsUYUdY0ybkcs2FYVksvkgYdyW
WlejOZVurY2i5biqck3UqjesxeJA5TMAlAUQR7vXu1Fa9fYFXZbqJom/KnPRTfek
xerCWPMbhuzmcyEjUOGfjs6GFEnEmRT6Q6fN3CBaQMS2Q/z+6AkTOXKVl5Fhvoyl
aeu1m8nZLuI=
=ovN2
-----END PGP SIGNATURE-----
Merge tag 'md-3.6' of git://neil.brown.name/md
Pull additional md update from NeilBrown:
"This contains a few patches that depend on plugging changes in the
block layer so needed to wait for those.
It also contains a Kconfig fix for the new RAID10 support in dm-raid."
* tag 'md-3.6' of git://neil.brown.name/md:
md/dm-raid: DM_RAID should select MD_RAID10
md/raid1: submit IO from originating thread instead of md thread.
raid5: raid5d handle stripe in batch way
raid5: make_request use batch stripe release
Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.
Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block driver changes from Jens Axboe:
- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.
- The usual round of drbd updates.
- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.
- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.
* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous
This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.
So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.
This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.
Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.
We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.
So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC. This is consistent with the documented meaning
of REQ_NOIDLE:
__REQ_NOIDLE, /* don't anticipate more IO after this one */
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.
V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
is unnecessary, because the chance one stripe gets written twice in the mean
time is rare. We can always do a bitmap_startwrite when a write request is
added to a stripe and bitmap_endwrite after write request is done. Delete the
optimization. With it, we can delete some cases of device_lock.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The value returned by "mddev_check_plug" is only valid until the
next 'schedule' as that will unplug things. This could happen at any
call to mempool_alloc.
So just calling mddev_check_plug at the start doesn't really make
sense.
So call it just before, or just after, queuing things for the thread.
As the action that happens at unplug is to wake the thread, this makes
lots of sense.
If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
wake thread immediately.
RAID5 is a bit different. Requests are queued for the thread and the
thread is woken by release_stripe. So we don't need to wake the
thread on failure.
However the thread doesn't perform certain actions when there is any
active plug, so it is important to install a plug before waking the
thread. So for RAID5 we install the plug *before* queuing the request
and waking the thread.
Without this patch it is possible for raid1 or raid10 to queue a
request without then waking the thread, resulting in the array locking
up.
Also change raid10 to only flush_pending_write when there are not
active plugs, just like raid1.
This patch is suitable for 3.0 or later. I plan to submit it to
-stable, but I'll like to let it spend a few weeks in mainline
first to be sure it is completely safe.
Signed-off-by: NeilBrown <neilb@suse.de>
There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We may not be able to fix a bad block if:
- the array is degraded
- the over-write fails.
In these cases we currently eject the device, but we should
record a bad block if possible.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.
So make it non-optional and always explicitly pass the name
of the level that the array will be.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 43220aa0f2
md/raid5: fix a hang on device failure.
fixed a hang, but introduced a refcounting in-balance so
that if the presence of bad-blocks ever caused an rdev to
be 'blocked' we would increment the refcount on the rdev and
never decrement it.
So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
is not called.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
nr_pending so we lose the reference we hold on the rdev.
So atomic_inc it first to maintain the reference.
This bug was introduced by commit 73e92e51b7
md/raid5. Don't write to known bad block on doubtful devices.
which appeared in 3.0, so patch is suitable for stable kernels since
then.
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In chunk_aligned_read() we are adding data_offset before calling
is_badblock. But is_badblock also adds data_offset, so that is bad.
So move the addition of data_offset to after the call to
is_badblock.
This bug was introduced by commit 31c176ecdf
md/raid5: avoid reading from known bad blocks.
which first appeared in 3.0. So that patch is suitable for any
-stable kernel from 3.0.y onwards. However it will need minor
revision for most of those (as the comment didn't appear until
recently).
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a RAID5 has both a failed device and a device marked as
'WantReplacement', then we should preferentially replace the failed
device.
However the current code replaces whichever is found first.
So split into 2 loops, check fail failed/missing first, and only check
for WantReplacement if nothing is failed or missing.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
After a reshape which reduced the number of devices we need
to disconnect the extra devices.
The code for this doesn't currently handle 'replacement' devices.
It is very unlikely that such devices will be present, but it is
safest to handle them anyway.
So simplify the handling. Just clear In_sync and leave it
to remove_and_add_spaces (which will be called soon) to do
the real works.
Signed-off-by: NeilBrown <neilb@suse.de>
We always should have allowed this. A raid5 reshape doesn't change
the size of the bitmap, so not need to restrict it.
Also add a test to make sure we don't try to start a reshape on a
failed array.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.
This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.
Signed-off-by: NeilBrown <neilb@suse.de>
REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
policy,
for example ioscheduler. This patch adds it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.
To this end we find the minimum offset difference across all devices
and add that where appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>
As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.
Signed-off-by: NeilBrown <neilb@suse.de>
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).
So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.
(As we didn't previous check that all 'pad' fields were zero,
we need a new FEATURE flag for this.
A (belatedly) check that all remaining 'pad' fields are
zero to avoid a repeat of this)
The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.
This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be. At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.
When a reshape finishes we will need to update the data_offset
and rdev->sectors. So provide an exported function to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.
To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.
However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.
So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.
This will become more important when we allow the data_offset to
change in a reshape. Then the explicit statement of what direction is
being used will be more useful.
This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.
Signed-off-by: NeilBrown <neilb@suse.de>
When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
if (do_recovery ||
sh->sector >= conf->mddev->recovery_cp)
s->syncing = 1;
else
s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.
This bug was introduced by commit 9a3e1101b8
md/raid5: detect and handle replacements during recovery.
so this patch is suitable for 3.3-stable.
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
1/ We can only treat a known-bad-block like a read-error if we
have the data that belongs in that block. So fix that test.
2/ If we cannot recovery a stripe due to insufficient data,
don't tell "md_done_sync" that the sync failed unless we really
did fail something. If we successfully record bad blocks,
that is success.
Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.
Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.
So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.
However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.
So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays types that check for a failed array).
Signed-off-by: NeilBrown <neilb@suse.de>
Now that WantReplacement drives are replaced cleanly, mark a drive
as WantReplacement when we see a write error. It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.
This requires that common md code attempt hot_add even when the array
is not formally degraded.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a Replacement is seen, file it as such.
If we see two replacements (or two normal devices) for the one slot,
abort.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.
Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.
This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.
Signed-off-by: NeilBrown <neilb@suse.de>
During recovery we want to write to the replacement but not
the original. So we have two new flags
- R5_NeedReplace if this stripe has a replacement that needs to
be written at some stage
- R5_WantReplace if NeedReplace, and the data is available, and
a 'sync' has been requested on this stripe.
We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.
Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.
Signed-off-by: NeilBrown <neilb@suse.de>
When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.
If the write to the replacement results in a write error, we just fail
the device. We only try to record write errors to the original.
When writing for recovery, we shouldn't write to the original. This
will be addressed in a subsequent patch that generally addresses
recovery.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.
If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.
This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache. It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.
Signed-off-by: NeilBrown <neilb@suse.de>
We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.
Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head. This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.
For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.
This includes a small enhancement. Previously we would only do
aligned reads if the target device was fully recovered. Now we also
do them if it has recovered far enough.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is being reshaped to change the number of devices,
the two halves can be differently degraded. e.g. one could be
missing a device and the other not.
So we need to be more careful about calculating the 'degraded'
attribute.
Instead of just inc/dec at appropriate times, perform a full
re-calculation examining both possible cases. This doesn't happen
often so it not a big cost, and we already have most of the code to
do it.
Signed-off-by: NeilBrown <neilb@suse.de>
While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync. In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.
Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Once a device is failed we really want to completely ignore it.
It should go away soon anyway.
In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.
So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.
Signed-off-by: NeilBrown <neilb@suse.de>
All updates that occur under STRIPE_ACTIVE should be globally visible
when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but
clear_bit() does not.
This is suitable for 3.1-stable.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
When the number of failed devices exceeds the allowed number
we must abort any active parity operations (checks or updates) as they
are no longer meaningful, and can lead to a BUG_ON in
handle_parity_checks6.
This bug was introduce by commit 6c0069c0ae
in 2.6.29.
Reported-by: Manish Katiyar <mkatiyar@gmail.com>
Tested-by: Manish Katiyar <mkatiyar@gmail.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...
Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
1/ one place in raid1 was missed and still sets to '1'.
2/ We didn't explicitly set the conf-> value at array creation
time.
It defaulted to '0' just like the mddev value does so they
could appear equal and thus disable recovery.
This did not affect normal 'md' as it calls bind_rdev_to_array
which changes the mddev value. However the dmraid interface
doesn't call this and so doesn't change ->recovery_disabled; so at
array start all recovery is incorrectly disabled.
So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug was introduced in 415e72d034
which was in 2.6.36.
There is a small window of time between when a device fails and when
it is removed from the array. During this time we might still read
from it, but we won't write to it - so it is possible that we could
read stale data.
We didn't need the test of 'Faulty' before because the test on
In_sync is sufficient. Since we started allowing reads from the early
part of non-In_sync devices we need a test on Faulty too.
This is suitable for any kernel from 2.6.36 onwards, though the patch
might need a bit of tweaking in 3.0 and earlier.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>