This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer always remaps partitions before calling into the
->make_request methods of drivers. Thus the call to get_start_sect in
in_chunk_boundary will always return 0 and can be removed.
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since thread_group worker and raid5d kthread are not in sync, if
worker writes stripe before raid5d then requests will be waiting
for issue_pendig.
Issue observed when building raid5 with ext4, in some build runs
jbd2 would get hung and requests were waiting in the HW engine
waiting to be issued.
Fix this by adding a call to async_tx_issue_pending_all in the
raid5_do_work.
Signed-off-by: Ofer Heifetz <oferh@marvell.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
Since bio_io_error sets bi_status to BLK_STS_IOERR,
and calls bio_endio, so we can use it directly.
And as mentioned by Shaohua, there are also two
places in raid5.c can use bio_io_error either.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The raid5 md device is created by the disks which we don't use the total size. For example,
the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
and assemble it again. It fails.
mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
mdadm /dev/md0 --grow --chunk=64
wait reshape to finish
mdadm -S /dev/md0
mdadm -As
The error messages:
[197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
[197519.821686] md: md_import_device returned -22
After reshape the data offset is changed. It selects backwards direction in this condition.
In function super_1_load it compares the available space of the underlying device with
sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based
on rdev->sectors. So add md_finish_reshape in end_reshape.
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
Pull MD update from Shaohua Li:
- fixed deadlock in MD suspend and a potential bug in bio allocation
(Neil Brown)
- fixed signal issue (Mikulas Patocka)
- fixed typo in FailFast test (Guoqing Jiang)
- other trival fixes
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
MD: fix sleep in atomic
MD: fix a null dereference
md: use a separate bio_set for synchronous IO.
md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE
md/raid1: remove unused bio in sync_request_write
md/raid10: fix FailFast test for wrong device
md: don't use flush_signals in userspace processes
md: fix deadlock between mddev_suspend() and md_write_start()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function flush_signals clears all pending signals for the process. It
may be used by kernel threads when we need to prepare a kernel thread for
responding to signals. However using this function for an userspaces
processes is incorrect - clearing signals without the program expecting it
can cause misbehavior.
The raid1 and raid5 code uses flush_signals in its request routine because
it wants to prepare for an interruptible wait. This patch drops
flush_signals and uses sigprocmask instead to block all signals (including
SIGKILL) around the schedule() call. The signals are not lost, but the
schedule() call won't respond to them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.
We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().
Reported-by: Nix <nix@esperi.org.uk>
Fix: 68866e425be2(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.
Move the initialization to the personality modules
that need it. This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This makes it possible, with appropriate filesystem support, for a
sysadmin to tell what is affected by the mismatch, and whether
it should be ignored (if it's inside a swap partition, for
instance).
We ratelimit to prevent log flooding: if there are so many
mismatches that ratelimiting is necessary, the individual messages
are relatively unlikely to be important (either the machine is
swapping like crazy or something is very wrong with the disk).
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Currently, sync of raid456 array cannot make progress when hitting
data in writeback r5cache.
This patch fixes this issue by flushing cached data of the stripe
before processing the sync request. This is achived by:
1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is
in write back cache;
2. In r5c_try_caching_write(), handle the stripe in sync with write
through;
3. In do_release_stripe(), make stripe in sync write out and send
it to the state machine.
Shaohua: explictly set STRIPE_HANDLE after write out completed
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This essentially reverts commit b5470dc5fc ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.
Since commit 6791875e2e ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.
On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1
Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
dump_stack+0x47/0x68
? migrate_enable+0x4a/0xf0
___might_sleep+0x101/0x180
rt_spin_lock+0x17/0x40
add_stripe_bio+0x4e3/0x6c0 [raid456]
? preempt_count_add+0x42/0xb0
raid5_make_request+0x737/0xdd0 [raid456]
Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We can clear 'WantReplacement' flag directly no
matter it's replacement existed or not since the
semantic is same as before.
Also since the disk is removed from array, then
it is straightforward to remove 'WantReplacement'
flag and the comments in raid10/5 can be removed
as well.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
chunk_aligned_read() currently uses fs_bio_set - which is meant for
filesystems to use - and loops if multiple splits are needed, which is
not best practice.
As this is only used for READ requests, not writes, it is unlikely
to cause a problem. However it is best to be consistent in how
we split bios, and to follow the pattern used in raid1/raid10.
So create a private bioset, bio_split, and use it to perform a single
split, submitting the remainder to generic_make_request() for later
processing.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
In case of read-modify-write, partial partity is the same as the result
of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into
sh->ppl_page instead of calculating it again.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
or free sh->ppl_page at runtime for all stripes in the stripe cache.
raid5_reset_stripe_cache() required suspending the mddev and could
deadlock because of GFP_KERNEL allocations.
Move the 'newsize' check to check_reshape() to allow reallocating the
stripes with the same number of disks. Allocate sh->ppl_page in
alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
a parameter to alloc_stripe() because it is needed to check whether to
allocate ppl_page. Add free_stripe() and use it to free stripes rather
than directly call kmem_cache_free(). Also free sh->ppl_page in
free_stripe().
Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
setting it in advance and add another parameter to log_init() to allow
calling ppl_init_log() without the bit set. Don't try to calculate
partial parity or add a stripe to log if it does not have ppl_page set.
Enabling ppl can now be performed without suspending the mddev, because
the log won't be used until new stripes are allocated with ppl_page.
Calling mddev_suspend/resume is still necessary when disabling ppl,
because we want all stripes to finish before stopping the log, but
resize_stripes() can be called after mddev_resume() when ppl is no
longer active.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When recoverying a single missing/failed device in a RAID6,
those stripes where the Q block is on the missing device are
handled a bit differently. In these cases it is easy to
check that the P block is correct, so we do. This results
in the P block be destroy. Consequently the P block needs
to be read a second time in order to compute Q. This causes
lots of seeks and hurts performance.
It shouldn't be necessary to re-read P as it can be computed
from the DATA. But we only compute blocks on missing
devices, since c337869d95 ("md: do not compute parity
unless it is on a failed drive").
So relax the change made in that commit to allow computing
of the P block in a RAID6 which it is the only missing that
block.
This makes RAID6 recovery run much faster as the disk just
"before" the recovering device is no longer seeking
back-and-forth.
Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When journal device of an array fails, the array is forced into read-only
mode. To make the array normal without adding another journal device, we
need to remove journal _feature_ from the array.
This patch allows remove journal _feature_ from an array, For journal
existing journal should be either missing or faulty.
To remove journal feature, it is necessary to remove the journal device
first:
mdadm --fail /dev/md0 /dev/sdb
mdadm: set /dev/sdb faulty in /dev/md0
mdadm --remove /dev/md0 /dev/sdb
mdadm: hot removed /dev/sdb from /dev/md0
Then the journal feature can be removed by echoing into the sysfs file:
cat /sys/block/md0/md/consistency_policy
journal
echo resync > /sys/block/md0/md/consistency_policy
cat /sys/block/md0/md/consistency_policy
resync
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This test on ->writes_pending cannot be safe as the counter
can be incremented at any moment and cannot be locked against.
Change it to test conf->active_stripes, which at least
can be locked against. More changes are still needed.
A future patch will change ->writes_pending, and testing it here will
be very inconvenient.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This reverts commit e8d7c33232.
Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
need to impose these limits.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When a read request, which bypassed the cache, fails, we need to retry
it through the cache.
This involves attaching it to a sequence of stripe_heads, and it may not
be possible to get all the stripe_heads we need at once.
We do what we can, and record how far we got in ->bi_phys_segments so
we can pick up again later.
There is only ever one bio which may have a non-zero offset stored in
->bi_phys_segments, the one that is either active in the single thread
which calls retry_aligned_read(), or is in conf->retry_read_aligned
waiting for retry_aligned_read() to be called again.
So we only need to store one offset value. This can be in a local
variable passed between remove_bio_from_retry() and
retry_aligned_read(), or in the r5conf structure next to the
->retry_read_aligned pointer.
Storing it there allows the last usage of ->bi_phys_segments to be
removed from md/raid5.c.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
md/raid5 needs to keep track of how many stripe_heads are processing a
bio so that it can delay calling bio_endio() until all stripe_heads
have completed. It currently uses 16 bits of ->bi_phys_segments for
this purpose.
16 bits is only enough for 256M requests, and it is possible for a
single bio to be larger than this, which causes problems. Also, the
bio struct contains a larger counter, __bi_remaining, which has a
purpose very similar to the purpose of our counter. So stop using
->bi_phys_segments, and instead use __bi_remaining.
This means we don't need to initialize the counter, as our caller
initializes it to '1'. It also means we can call bio_endio() directly
as it tests this counter internally.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We currently gather bios that need to be returned into a bio_list
and call bio_endio() on them all together.
The original reason for this was to avoid making the calls while
holding a spinlock.
Locking has changed a lot since then, and that reason is no longer
valid.
So discard return_io() and various return_bi lists, and just call
bio_endio() directly as needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.
Commit c3cce6cda1 ("md/raid5: ensure device failure recorded before
write request returns.") added code for this, but it was
unnecessarily complicated. We already had similar functionality for
handling updates to the bad-block-list, thanks to Commit de393cdea6
("md: make it easier to wait for bad blocks to be acknowledged.")
So revert most of the former commit, and instead avoid collecting
completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush
the metadata and retry the stripe_head.
As this change can leave a stripe_head ready for handling immediately
after handle_active_stripes() returns, we change raid5_do_work() to
pause when MD_CHANGE_PENDING is set, so that it doesn't spin.
We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously. After analyse_stripe(), we have collected stable data
about the state of devices, which will be used to make decisions.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.
When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.
Enabling or disabling PPL is performed under a suspended array. The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Add a function to modify the log by removing an rdev when a drive fails
or adding when a spare/replacement is activated as a raid member.
Removing a disk just clears the child log rdev pointer. No new stripes
will be accepted for this child log in ppl_write_stripe() and running io
units will be processed without writing PPL to the device.
Adding a disk sets the child log rdev pointer and writes an empty PPL
header.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Load the log from each disk when starting the array and recover if the
array is dirty.
The initial empty PPL is written by mdadm. When loading the log we
verify the header checksum and signature. For external metadata arrays
the signature is verified in userspace, so here we read it from the
header, verifying only if it matches on all disks, and use it later when
writing PPL.
In addition to the header checksum, each header entry also contains a
checksum of its partial parity data. If the header is valid, recovery is
performed for each entry until an invalid entry is found. If the array
is not degraded and recovery using PPL fully succeeds, there is no need
to resync the array because data and parity will be consistent, so in
this case resync will be disabled.
Due to compatibility with IMSM implementations on other systems, we
can't assume that the recovery data block size is always 4K. Writes
generated by MD raid5 don't have this issue, but when recovering PPL
written in other environments it is possible to have entries with
512-byte sector granularity. The recovery code takes this into account
and also the logical sector size of the underlying drives.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Implement the calculation of partial parity for a stripe and PPL write
logging functionality. The description of PPL is added to the
documentation. More details can be found in the comments in raid5-ppl.c.
Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.
Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:
- reconstruct-write case:
xor data from all not updated disks in a stripe
- read-modify-write case:
xor old data and parity from all updated disks in a stripe
Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.
Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.
Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.
Warn about using PPL with enabled disk volatile write-back cache for
now. It can be removed once disk cache flushing before writing PPL is
implemented.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
wrappers for functions which will be shared with ppl and use them in
raid5 core instead of direct calls to raid5-cache.
Remove unused parameter from r5c_cache_data(), move two duplicated
pr_debug() calls to r5l_init_log().
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
defers IO dispatching. The goal is to create better IO pattern. At that
time, we don't sort the deffered IO and hope the block layer can do IO
merge and sort. Now the raid5-cache writeback could create large amount
of bios. And if we enable muti-thread for stripe handling, we can't
control when to dispatch IO to raid disks. In a lot of time, we are
dispatching IO which block layer can't do merge effectively.
This patch moves further for the IO dispatching defer. We accumulate
bios, but we don't dispatch all the bios after a threshold is met. This
'dispatch partial portion of bios' stragety allows bios coming in a
large time window are sent to disks together. At the dispatching time,
there is large chance the block layer can merge the bios. To make this
more effective, we dispatch IO in ascending order. This increases
request merge chance and reduces disk seek.
Signed-off-by: Shaohua Li <shli@fb.com>
In raid5-cache writeback mode, we have two types of stripes to handle.
- stripes which aren't cached yet
- stripes which are cached and flushing out to raid disks
Upperlayer is more sensistive to latency of the first type of stripes
generally. But we only one handle list for all these stripes, where the
two types of stripes are mixed together. When reclaim flushes a lot of
stripes, the first type of stripes could be noticeably delayed. On the
other hand, if the log space is tight, we'd like to handle the second
type of stripes faster and free log space.
This patch destinguishes the two types stripes. They are added into
different handle list. When we try to get a stripe to handl, we prefer
the first type of stripes unless log space is tight.
This should have no impact for !writeback case.
Signed-off-by: Shaohua Li <shli@fb.com>
Before this patch, device InJournal will be included in prexor
(SYNDROME_SRC_WANT_DRAIN) but not in reconstruct (SYNDROME_SRC_WRITTEN). So it
will break parity calculation. With srctype == SYNDROME_SRC_WRITTEN, we need
include both dev with non-null ->written and dev with R5_InJournal. This fixes
logic in 1e6d690(md/r5cache: caching phase of r5cache)
Cc: stable@vger.kernel.org (v4.10+)
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.
And both set_capacity and revalidate_disk are used in
pers->resize such as raid1, raid10 and raid5. So
move them from personality file to common code.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull md updates from Shaohua Li:
"Mainly fixes bugs and improves performance:
- Improve scalability for raid1 from Coly
- Improve raid5-cache read performance, disk efficiency and IO
pattern from Song and me
- Fix a race condition of disk hotplug for linear from Coly
- A few cleanup patches from Ming and Byungchul
- Fix a memory leak from Neil
- Fix WRITE SAME IO failure from me
- Add doc for raid5-cache from me"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
md/raid1: handle flush request correctly
md/linear: shutup lockdep warnning
md/raid1: fix a use-after-free bug
RAID1: avoid unnecessary spin locks in I/O barrier code
RAID1: a new I/O barrier implementation to remove resync window
md/raid5: Don't reinvent the wheel but use existing llist API
md: fast clone bio in bio_clone_mddev()
md: remove unnecessary check on mddev
md/raid1: use bio_clone_bioset_partial() in case of write behind
md: fail if mddev->bio_set can't be created
block: introduce bio_clone_bioset_partial()
md: disable WRITE SAME if it fails in underlayer disks
md/raid5-cache: exclude reclaiming stripes in reclaim check
md/raid5-cache: stripe reclaim only counts valid stripes
MD: add doc for raid5-cache
Documentation: move MD related doc into a separate dir
md: ensure md devices are freed before module is unloaded.
md/r5cache: improve journal device efficiency
md/r5cache: enable chunk_aligned_read with write back cache
...
Although llist provides proper APIs, they are not used. Make them used.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.
Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.
So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().
Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
stripes which are being reclaimed are still accounted into cached
stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
stripes and does unnecessary stripe reclaim. In practice, I saw one
stripe is reclaimed one time. This will cause bad IO pattern. Fixing
this by excluding the reclaing stripes in the check.
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
It is important to be able to flush all stripes in raid5-cache.
Therefore, we need reserve some space on the journal device for
these flushes. If flush operation includes pending writes to the
stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
for the flush out. This reduces the efficiency of journal space.
If we exclude these pending writes from flush operation, we only
need (conf->max_degraded + 1) pages per stripe.
With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
pending writes will be excluded from stripe flush out. Therefore,
we can reduce reserved space for flush out and thus improve journal
device efficiency.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.
For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.
For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.
chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().
It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.
In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
will submit async job r5c_disable_writeback_async to disable
writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
degraded mode;
3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
in write-through mode.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
With write back cache, we use orig_page to do prexor. This patch
makes sure we read data into orig_page for it.
Flag R5_OrigPageUPTDODATE is added to show whether orig_page
has the latest data from raid disk.
We introduce a helper function uptodate_for_rmw() to simplify
the a couple conditions in handle_stripe_dirtying().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This fixes a build error on certain architectures, such as ppc64.
Fixes: 6995f0b247e("md: takeover should clear unrelated bits")
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
will be accidentally set, but raid5 doesn't support it. The same is true
for the MD_HAS_JOURNAL bit.
Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.
Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.
Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.
This patch limits requests size with 256Mb - 8Kb to prevent overflows.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Shaohua Li <shli@kernel.org>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.
To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.
After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up the new O_DIRECT cases.
Signed-off-by: Jens Axboe <axboe@fb.com>
With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.
This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.
slightly changed the code by Shaohua
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
1. In previous patch, we:
- add new data to r5l_recovery_ctx
- add new functions to recovery write-back cache
The new functions are not used in this patch, so this patch does not
change the behavior of recovery.
2. In this patchpatch, we:
- modify main recovery procedure r5l_recovery_log() to call new
functions
- remove old functions
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
With write cache, journal_mode is the knob to switch between
write-back and write-through.
Below is an example:
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
write-through [write-back]
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Revert commit 11367799f3 ("md: Prevent IO hold during accessing to faulty
raid5 array") as it doesn't comply with commit c3cce6cda1 ("md/raid5:
ensure device failure recorded before write request returns."). That change
is not required anymore as the problem is resolved by commit 16f889499a
("md: report 'write_pending' state when array in sync") - read request is
stuck as array state is not reported correctly via sysfs attribute.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull MD updates from Shaohua Li:
"This update includes:
- new AVX512 instruction based raid6 gen/recovery algorithm
- a couple of md-cluster related bug fixes
- fix a potential deadlock
- set nonrotational bit for raid array with SSD
- set correct max_hw_sectors for raid5/6, which hopefuly can improve
performance a little bit
- other minor fixes"
* tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: set rotational bit
raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
raid5: handle register_shrinker failure
raid5: fix to detect failure of register_shrinker
md: fix a potential deadlock
md/bitmap: fix wrong cleanup
raid5: allow arbitrary max_hw_sectors
lib/raid6: Add AVX512 optimized xor_syndrome functions
lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
lib/raid6: Add AVX512 optimized recovery functions
lib/raid6: Add AVX512 optimized gen_syndrome functions
md-cluster: make resync lock also could be interruptted
md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
md-cluster: convert the completion to wait queue
md-cluster: protect md_find_rdev_nr_rcu with rcu lock
md-cluster: clean related infos of cluster
md: changes for MD_STILL_CLOSED flag
md-cluster: remove some unnecessary dlm_unlock_sync
md-cluster: use FORCEUNLOCK in lockres_free
md-cluster: call md_kick_rdev_from_array once ack failed
Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:
- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.
- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.
- Handle unbalanced hotplug enable/disable calls more gracefully.
- Remove the obsolete CPU_STARTING/DYING notifier support.
- Convert another batch of notifier users.
The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.
The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...
register_shrinker() now can fail. When it happens, shrinker.nr_deferred is
null. We use it to determine if unregister_shrinker is required.
Signed-off-by: Shaohua Li <shli@fb.com>
register_shrinker can fail after commit 1d3d4437ea ("vmscan: per-node
deferred work"), we should detect the failure of it, otherwise we may
fail to register shrinker after raid5 configuration was setup successfully.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
raid5 will split bio to proper size internally, there is no point to use
underlayer disk's max_hw_sectors. In my qemu system, without the change,
the raid5 only receives 128k size bio, which reduces the chance of bio
merge sending to underlayer disks.
Signed-off-by: Shaohua Li <shli@fb.com>
commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data)
moves bio_reset to bio_endio. But it introduces a small race condition.
It does bio_reset after raid5_release_stripe, which could make the
stripe reusable and hence reuse the bio just before bio_reset. Moving
bio_reset before raid5_release_stripe is called should fix the race.
Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Shaohua Li <shli@fb.com>
If there aren't enough stripes, reshape will hang. We have a check for
this in new reshape, but miss it for reshape resume, hence we could see
hang in reshape resume. This patch forces enough stripes existed if
reshape resumes.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull MD fixes from Shaohua Li:
"This includes several bug fixes:
- Alexey Obitotskiy fixed a hang for faulty raid5 array with external
management
- Song Liu fixed two raid5 journal related bugs
- Tomasz Majchrzak fixed a bad block recording issue and an
accounting issue for raid10
- ZhengYuan Liu fixed an accounting issue for raid5
- I fixed a potential race condition and memory leak with DIF/DIX
enabled
- other trival fixes"
* tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid5: avoid unnecessary bio data set
raid5: fix memory leak of bio integrity data
raid10: record correct address of bad block
md-cluster: fix error return code in join()
r5cache: set MD_JOURNAL_CLEAN correctly
md: don't print the same repeated messages about delayed sync operation
md: remove obsolete ret in md_start_sync
md: do not count journal as spare in GET_ARRAY_INFO
md: Prevent IO hold during accessing to faulty raid5 array
MD: hold mddev lock to change bitmap location
raid5: fix incorrectly counter of conf->empty_inactive_list_nr
raid10: increment write counter after bio is split
bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
set them every time. bi_private will be set before the bio is
dispatched.
Signed-off-by: Shaohua Li <shli@fb.com>
Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.
With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.
For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.
Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.
Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.
Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.
As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices. Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.
fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.
Some places take a new reference after only check In_sync. This should
be safe as the two are changed together. However to make the code more
obviously safe, add checks for 'Faulty' as well.
Note: the actual rule is:
Never increment nr_pending if Faulty is set and Blocked is clear,
never clear Faulty, and never set Blocked without holding a reference
through nr_pending.
fix build error (Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Being in the middle of resync is no longer protection against failed
rdevs disappearing. So add rcu protection.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The rdev could be freed while handle_failed_sync is running, so
rcu protection is needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>