This contains a few patches that depend on
plugging changes in the block layer so needs to wait
for those.
It also contains a Kconfig fix for the new RAID10 support
in dm-raid.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAUBnKUznsnt1WYoG5AQJOQA/+M7RoVnF63+TbGIqdNDotuF8FxvudCZBl
Ou2yG47EOPtWf/RoqPyfpydDgdjyXsk4T5TfXoc0hsXVr4shCYo51uT9K34TMSDJ
2GzGWuyugRJFyvxW7PBgM+zFWlcVdgUGcwsdmIUMtHRz8Q10TqO5fE22RNLkhwOl
fvGCK1KYnQqlG87DbulHWMo22vyZVic8jBqFSw55CPuuFMSJMxCw0rOPUnvk5Q8v
jWzZzuUqrM8iiOxTDHsbCA0IleCbGl/m0tgk02Vj4tkCvz9N/xzQW2se0H6uECiK
k8odbAiNBOh1q135sa7ASrBzxT+JqSiQ25rLheTEzzNxjFv6/NlntXmYu6HB+lD3
DoHAvRjgMxiLCdisW6TJb10NItitXwE/HSpQOVRxyYtINdzmhIDaCccgfN8ZMkho
nmE/uzO+CAoCFpZC2C/nY8D0BZs5fw4hgDAsci66mvs+88dy+SoA4AbyNEMAusOS
tiL8ZEjnYXvxTh3JFaMIaqQd6PkbahmtEtvorwXsUYUdY0ybkcs2FYVksvkgYdyW
WlejOZVurY2i5biqck3UqjesxeJA5TMAlAUQR7vXu1Fa9fYFXZbqJom/KnPRTfek
xerCWPMbhuzmcyEjUOGfjs6GFEnEmRT6Q6fN3CBaQMS2Q/z+6AkTOXKVl5Fhvoyl
aeu1m8nZLuI=
=ovN2
-----END PGP SIGNATURE-----
Merge tag 'md-3.6' of git://neil.brown.name/md
Pull additional md update from NeilBrown:
"This contains a few patches that depend on plugging changes in the
block layer so needed to wait for those.
It also contains a Kconfig fix for the new RAID10 support in dm-raid."
* tag 'md-3.6' of git://neil.brown.name/md:
md/dm-raid: DM_RAID should select MD_RAID10
md/raid1: submit IO from originating thread instead of md thread.
raid5: raid5d handle stripe in batch way
raid5: make_request use batch stripe release
Now that DM_RAID supports raid10, it needs to select that code
to ensure it is included.
Cc: Jonathan Brassow <jbrassow@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.
So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.
Signed-off-by: NeilBrown <neilb@suse.de>
Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.
Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block driver changes from Jens Axboe:
- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.
- The usual round of drbd updates.
- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.
- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.
* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous
Pull md updates from NeilBrown.
* 'for-next' of git://neil.brown.name/md:
DM RAID: Add support for MD RAID10
md/RAID1: Add missing case for attempting to repair known bad blocks.
md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
md/raid1: don't abort a resync on the first badblock.
md: remove duplicated test on ->openers when calling do_md_stop()
raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
md/raid1: prevent merging too large request
md/raid1: read balance chooses idlest disk for SSD
md/raid1: make sequential read detection per disk based
MD RAID10: Export md_raid10_congested
MD: Move macros from raid1*.h to raid1*.c
MD RAID1: rename mirror_info structure
MD RAID10: rename mirror_info structure
MD RAID10: Fix compiler warning.
raid5: add a per-stripe lock
raid5: remove unnecessary bitmap write optimization
raid5: lockless access raid5 overrided bi_phys_segments
raid5: reduce chance release_stripe() taking device_lock
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.
So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.
This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.
Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When doing resync or repair, attempt to correct bad blocks, according
to WriteErrorSeen policy
Signed-off-by: Alex Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Merge Andrew's first set of patches:
"Non-MM patches:
- lots of misc bits
- tree-wide have_clk() cleanups
- quite a lot of printk tweaks. I draw your attention to "printk:
convert the format for KERN_<LEVEL> to a 2 byte pattern" which
looks a bit scary. But afaict it's solid.
- backlight updates
- lib/ feature work (notably the addition and use of memweight())
- checkpatch updates
- rtc updates
- nilfs updates
- fatfs updates (partial, still waiting for acks)
- kdump, proc, fork, IPC, sysctl, taskstats, pps, etc
- new fault-injection feature work"
* Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
drivers/misc/lkdtm.c: fix missing allocation failure check
lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table()
fault-injection: add tool to run command with failslab or fail_page_alloc
fault-injection: add selftests for cpu and memory hotplug
powerpc: pSeries reconfig notifier error injection module
memory: memory notifier error injection module
PM: PM notifier error injection module
cpu: rewrite cpu-notifier-error-inject module
fault-injection: notifier error injection
c/r: fcntl: add F_GETOWNER_UIDS option
resource: make sure requested range is included in the root range
include/linux/aio.h: cpp->C conversions
fs: cachefiles: add support for large files in filesystem caching
pps: return PTR_ERR on error in device_create
taskstats: check nla_reserve() return
sysctl: suppress kmemleak messages
ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION
ipc: compat: use signed size_t types for msgsnd and msgrcv
ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC
ipc: add COMPAT_SHMLBA support
...
Use memweight() to count the total number of bits set in memory area.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.
We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.
So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC. This is consistent with the documented meaning
of REQ_NOIDLE:
__REQ_NOIDLE, /* don't anticipate more IO after this one */
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a resync of a RAID1 array with 2 devices finds a known bad block
one device it will neither read from, or write to, that device for
this block offset.
So there will be one read_target (The other device) and zero write
targets.
This condition causes md/raid1 to abort the resync assuming that it
has finished - without known bad blocks this would be true.
When there are no write targets because of the presence of bad blocks
we should only skip over the area covered by the bad block.
RAID10 already gets this right, raid1 doesn't. Or didn't.
As this can cause a 'sync' to abort early and appear to have succeeded
it could lead to some data corruption, so it suitable for -stable.
Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.
So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.
Signed-off-by: NeilBrown <neilb@suse.de>
Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.
V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.
So when should we split big requests? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. This patch only considers request with size above optimal io size.
If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.
So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.
Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.
Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.
Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle. For
SSD, we should try best to distribute requests to as many disks as possible.
There isn't spindle move penality anyway.
With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.
The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.
Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.
V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid10: Export is_congested test.
In similar fashion to commits
11d8a6e3711ed7242e59
we export the RAID10 congestion checking function so that dm-raid.c can
make use of it and make use of the personality. The 'queue' and 'gendisk'
structures will not be available to the MD code when device-mapper sets
up the device, so we conditionalize access to these fields also.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1/RAID10: Move some macros from .h file to .c file
There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
in both raid1.h and raid10.h. They are only used in there respective .c files.
However, if we wish to make RAID10 accessible to the device-mapper RAID
target (dm-raid.c), then we need to move these macros into the .c files where
they are used so that they do not conflict with each other.
The macros from the two files are identical and could be moved into md.h, but
I chose to leave the duplication and have them remain in the personality
files.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
The same structure name ('mirror_info') is used by raid10. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide. While only one of these structure
names needs to change, this patch adds consistency to the naming of the
structure.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
The same structure name ('mirror_info') is used by raid1. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit outstanding metadata before returning the status for a dm thin
pool so that the numbers reported are as up-to-date as possible.
The commit is not performed if the device is suspended or if
the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target
through a new 'status_flags' parameter in the target's dm_status_fn.
The userspace dmsetup tool will support the --noflush flag with the
'dmsetup status' and 'dmsetup wait' commands from version 1.02.76
onwards.
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add read-only and fail-io modes to thin provisioning.
If a transaction commit fails the pool's metadata device will transition
to "read-only" mode. If a commit fails once already in read-only mode
the transition to "fail-io" mode occurs.
Once in fail-io mode the pool and all associated thin devices will
report a status of "Fail".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_pool_abort_metadata to abort the current metadata
transaction. Generally this will only be called when bad things are
happening and dm-thin is trying to roll back to a good state for
read-only mode.
It's complicated by the fact that the metadata device may have failed
completely causing the abort to be unable to read the old transaction.
In this case the metadata object is placed in a 'fail' mode and
everything fails apart from destroying it.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_pool_metadata_set_read_only to put the underlying block
manager into read-only mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_bm_set_read_only to switch the block manager into a
read-only mode. To be used when dm-thin degrades due to io errors on
the metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reduce the number of metadata commits by using
dm_thin_changed_this_transaction to check if metadata was changed on a
per thin device granularity.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_thin_changed_this_transaction to dm-thin-metadata to publish a
useful bit of information we're already tracking. This will help dm thin
decide when to commit.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a parameter to dm_pool_metadata_open to indicate whether or not an
unformatted metadata area should be formatted.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tidy up error path in __open_metadata and __format_metadata in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor out __check_incompat_features and only call it once when we open
the metadata device rather than at the beginning of every transaction.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove some duplicate initialisation of struct dm_pool_metadata.
These pmd fields are initialised by both:
__format_metadata's calls to dm_btree_empty
__write_initial_superblock + __begin_transaction
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove 'create' parameter from __create_persistent_data_objects() in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the check for __superblock_all_zeroes from
__create_persistent_data_objects() down to __open_or_format_metadata in
dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove nr_blocks arg from __create_persistent_data_objects in dm-thin-metadata.
It was always passed as zero.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Split __open_or_format_metadata into __format_metadata and __open_metadata in
dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up __open_or_format_metadata in dm-thin-metadata by using struct
dm_pool_metadata members to replace local variables.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Zero the unused uuid when initialising the metadata superblock.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Lift the call to __begin_transaction out of __write_initial_superblock in
dm-thin-metadata. Called higher up the call chain now.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move dm_commit_pool_metadata inline into __write_initial_superblock in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor out __write_initial_superblock and also pull some other initial
creation code out of dm_pool_metadata_open.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Lift some initialisation out of __open_or_format_metadata in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor __destroy_persistent_data_objects out of dm_pool_metadata_close.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move block manager creation and the check for unformatted metadata into
__create_persistent_data_objects().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename init_pmd to __create_persistent_data_objects in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce wrappers to handle write locking the superblock
appropriately.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Stop using dm_bm_unlock_move when shadowing blocks in the transaction
manager as an optimisation and remove the function as it is then no
longer used.
Some code, such as the space maps, keeps using on-disk data structures
from the previous transaction. It can do this because blocks won't
be reallocated until the subsequent transaction. Using
dm_bm_unlock_move to copy blocks sounds like a win, but it forces a
synchronous read should the old block be accessed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tidy the transaction manager creation functions.
They no longer lock the superblock. Superblock locking is pulled out to
the caller.
Also export dm_bm_write_lock_zero.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove an optimisation that tracks whether or not a thin metadata commit
is needed.
If dm_pool_commit_metadata() is called and no changes have been made
to the metadata then this optimisation avoided writing to disk.
Removing because we're going to do something better later.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch introduces a separate struct for the block_manager.
It also uses IS_ERR to check the return value of dm_bufio_client_create
instead of testing incorrectly for NULL.
Prior to this patch a struct dm_block_manager was really an alias for
a struct dm_bufio_client. We want to add some functionality to the
block manager that will require extra fields, so this one to one
mapping is no longer valid.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor __setup_btree_details out of init_pmd in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The thin provisioning target commits internal metadata on flush. So it
should receive flushes regardless of whether the underlying devices
support them.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow targets to override the 'supports flush' calculation.
Set 'flush_supported' if a target needs to receive flushes regardless of
whether or not its underlying devices have support.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce bitmap_index_changed to track whether or not the index changed
then only commit a space map if it did.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Unlock the superblock even if initial dm_bufio_write_dirty_buffers fails.
Also, remove redundant flush calls. dm_bm_flush_and_unlock's calls to
dm_bufio_write_dirty_buffers already result in dm_bufio_issue_flush
being called.
This avoids warnings about unflushed dirty buffers from bufio.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There's no need to break sharing, triggering a copy, for a write that has no
data (i.e. a flush).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix memory leak in process_prepared_mapping by always freeing
the dm_thin_new_mapping structs from the mapping_pool mempool on
the error paths.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename sector to cc_sector in dm-crypt's convert_context struct.
This is preparation for a future patch that merges dm_io and
convert_context which both have a "sector" field.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Store the crypt_config struct pointer directly in struct dm_crypt_io
instead of the dm_target struct pointer.
Target information is never used - only target->private is referenced,
thus we can change it to point directly to struct crypt_config.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move static dm-crypt cipher data out of per-cpu structure.
Cipher information is static, so it does not have to be in a per-cpu
structure.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
In preparation for RAID10 inclusion in dm-raid, we move the sectors_per_dev
calculation later in the device creation process. This is because we won't
know up-front how many stripes vs how many mirrors there are which will
change the calculation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There are two dm crypt structures that have a field called "pending".
This patch renames them to "cc_pending" and "io_pending" to reduce confusion
and ease searching the code.
Also remove unnecessary initialisation of r in crypt_convert_block().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
In preparation for RAID10 addition to dm-raid, we change an 'if' conditional
to a 'switch' conditional to make it easier to see what is being checked for
each RAID type.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
A SCSI device handler might get attached to a device during the
initial device scan. We do not necessarily want to override
this when loading a multipath table, so this patch adds a new
multipath feature argument "retain_attached_hw_handler".
During SCSI device scan all loaded SCSI device handlers will be
consulted for a match (via scsi_dh's provided .match). If a match is
found that device handler will be attached. We need a way to have
userspace multipathd's provided 'hw_handler' not override the already
attached hardware handler.
When specifying the new feature 'retain_attached_hw_handler' multipath
will use the currently attached hardware handler instead of trying to
attach the one specified during table load. If no hardware handler is
attached the specified hardware handler will still be used.
Leverages scsi_dh_attach's ability to increment the scsi_dh's reference
count if the same scsi_dh name is provided when attaching - currently
attached scsi_dh name is determined with scsi_dh_attached_handler_name.
Depends upon commit 7e8a74b177
("[SCSI] scsi_dh: add scsi_dh_attached_handler_name").
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Babu Moger <babu.moger@netapp.com>
Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-thin will be most likely used with a block size that is a power of
two. So it should be optimized for this case.
This patch changes division and modulo operations to shifts and bit
masks if block size is a power of two.
A test that bi_sector is divisible by a block size is removed from
io_overlaps_block. Device mapper never sends bios that span a block
boundary. Consequently, if we tested that bi_size is equivalent to block
size, bi_sector must already be on a block boundary.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch sets the variable "ti->split_discard_requests" for the dm thin
target so that device mapper core splits discard requests on a block
boundary.
Consequently, a discard request that spans multiple blocks is never sent
to dm-thin. The patch also removes some code in process_discard that
deals with discards that span multiple blocks.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch introduces a new variable split_discard_requests. It can be
set by targets so that discard requests are split on max_io_len
boundaries.
When split_discard_requests is not set, discard requests are only split on
boundaries between targets, as was the case before this patch.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Non power of 2 blocksize support is needed to properly align thinp IO
on storage that has non power of 2 optimal IO sizes (e.g. RAID6 10+2).
Use sector_div to support non power of 2 blocksize for the pool's
data device. This provides comparable performance to the power of 2
math that was performed until now (as tested on modern x86_64 hardware).
The kernel currently assumes that limits->discard_granularity is a power
of two so the thin target only enables discard support if the block
size is a power of two.
Eliminate pool structure's 'block_shift', 'offset_mask' and
remaining 4 byte holes.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-stripe is usually used with a chunk size that is a power of two.
Use faster shifts and bit masks in such cases.
stripe_width is already optimized in a similar way.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There is no technical limitation in device mapper that would prevent the
dm-stripe target from using a stripe size smaller than page size.
This patch removes the limit and makes stripe volumes portable across
architectures with different page size.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support non-power-of-2 chunk sizes with dm striping for proper alignment
of stripe IO on storage that has non-power-of-2 optimal IO sizes (e.g.
RAID6 10+2).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the restriction that limits a target's specified maximum incoming
I/O size to be a power of 2.
Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'.
Change it from sector_t to uint32_t, which is plenty big enough, and
introduce a wrapper function dm_set_target_max_io_len() to set it.
Use sector_div() to process it now that it is not necessarily a power of 2.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The structure stripe_c contains a stripes_mask field. This field is
useless because it can be trivially calculated by subtracting one from
stripes. It is used only at one place. This patch removes it.
The patch also changes ffs(stripes) - 1 to __ffs(stripes).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-stripe is supposed to ensure that all the space allocated to the
stripes is fully used and that all stripes are the same size. This
patch fixes the test. It checks that device length is divisible by the
chunk size and checks that the resulting quotient is divisible by the
number of stripes (which is equivalent to testing if device length is
divisible by chunk_size * stripes).
Previously, the code only tested that the number of sectors in the target
was divisible by each of the chunk size and the number of stripes
separately, which could leave entire stripes unused.
(A setup that genuinely needs some stripes to be shorter than others
can be created by concatenating striped targets.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Provide specific error message strings for two pool_ctr() failure cases
that currently give just "Unknown error".
Reference: test_two_pools_pointing_to_the_same_metadata_fails and
test_different_pool_cant_replace_pool in thinp-test-suite.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce THIN_MAX_CONCURRENT_LOCKS into dm-thin-metadata to
give a name to an otherwise "magic" number.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the pointless label 'out' from __commit_transaction in
dm-thin-metadata.c
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove debug space map checker from dm persistent data.
The space map checker is a wrapper for other space maps that double
checks the reference counts are correct. It holds all these reference
counts in memory rather than on disk, so uses a lot of memory and is
thus restricted to small pools.
As yet, this checker hasn't found any issues, but has caused a few of
its own due to people turning it on by default with larger pools.
Removing.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up "warning: dubious: !x & y". Also make it clear that
__snapshotted_since() returns a bool and that dm_thin_lookup_result's
'shared' member is a flag.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reduce the slab size used for the dm_thin_endio_hook mempool.
Allocation has been seen to fail on machines with smaller amounts
of memory due to fragmentation.
lvm: page allocation failure. order:5, mode:0xd0
device-mapper: table: 253:38: thin-pool: Error creating pool's endio_hook mempool
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
- avoid a crash in dm-raid1 when discards coincide with mirror recovery;
- avoid discarding shared data that's still needed in dm-thin;
- don't guarantee that discarded blocks will be wiped in dm-raid1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQCV8qAAoJEK2W1qbAHj1niSAP/2K0RkgWvL0hwuaM+us0oh29
XFou6Tb9pH+//QfKOJuClHeSfZFoHYuevvJPtwTqPlHGONE2YXeBtVmyp0k+BS69
xoaQy+OoZFrEbhxyJFrg+lDcxVGRtvo7x9zegeRf++o/skRfRgAjzyLkI8bk4t3v
c3vSDTVBikJXlTxa+J7EQpeW29DBiky+tIHQQx0+98u2VSlaFFP6MdLr1ROeq7yF
+z3kEXk6qzwL9ZHTWuVCvhi7bw4i18UTrH0wxZuUXWRpz+Va5h7w+/zcQbau6D/s
K+BmlAW/fxzZOW4guFU6pCLlVGU4BsJxUXT55UaP4Dx9UuV59EtIPsDb8/Y/pGMX
t9xnC4GmSOjw52pW2VR2gUJwG/c5mJ9g/mdP6twQzcC4JJ+CYg4Q5lH88qzDqceS
VCrW681nIKIVoja5n1adv6gbZax8hlR/z8ElXrqELDmXk7nKBLOLdDVSXzZ9ceX1
RnvtAZE/zrxcslKHw52Sd37c8YRer/fgx3kQxhXd1nb096DgiWvE/taD/ixjWHQX
Eu1KrQIelvw63/BNNTKYRF7xS0dGKsGNaXWln7cMONG28CnrWG/8f+mp+KG73x5e
Fc8yCONHNbqmf95yx1N0MgfYlZFjBBw0+BtqmR7QVcnG3r4SaSug+F72SPb5nN/B
ZBmwNcSBaaC952+5pMZa
=gbLp
-----END PGP SIGNATURE-----
Merge tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper discard fixes from Alasdair G Kergon:
- avoid a crash in dm-raid1 when discards coincide with mirror
recovery;
- avoid discarding shared data that's still needed in dm-thin;
- don't guarantee that discarded blocks will be wiped in dm-raid1.
* tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm raid1: set discard_zeroes_data_unsupported
dm thin: do not send discards to shared blocks
dm raid1: fix crash with mirror recovery and discard
We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
the underlying disks support zero on discard. So this patch sets
ti->discard_zeroes_data_unsupported.
For example, if the mirror is in the process of resynchronizing, it may
happen that kcopyd reads a piece of data, then discard is sent on the
same area and then kcopyd writes the piece of data to another leg.
Consequently, the data is not zeroed.
The flag was made available by commit 983c7db347
(dm crypt: always disable discard_zeroes_data).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When process_discard receives a partial discard that doesn't cover a
full block, it sends this discard down to that block. Unfortunately, the
block can be shared and the discard would corrupt the other snapshots
sharing this block.
This patch detects block sharing and ends the discard with success when
sending it to the shared block.
The above change means that if the device supports discard it can't be
guaranteed that a discard request zeroes data. Therefore, we set
ti->discard_zeroes_data_unsupported.
Thin target discard support with this bug arrived in commit
104655fd4d (dm thin: support discards).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch fixes a crash when a discard request is sent during mirror
recovery.
Firstly, some background. Generally, the following sequence happens during
mirror synchronization:
- function do_recovery is called
- do_recovery calls dm_rh_recovery_prepare
- dm_rh_recovery_prepare uses a semaphore to limit the number
simultaneously recovered regions (by default the semaphore value is 1,
so only one region at a time is recovered)
- dm_rh_recovery_prepare calls __rh_recovery_prepare,
__rh_recovery_prepare asks the log driver for the next region to
recover. Then, it sets the region state to DM_RH_RECOVERING. If there
are no pending I/Os on this region, the region is added to
quiesced_regions list. If there are pending I/Os, the region is not
added to any list. It is added to the quiesced_regions list later (by
dm_rh_dec function) when all I/Os finish.
- when the region is on quiesced_regions list, there are no I/Os in
flight on this region. The region is popped from the list in
dm_rh_recovery_start function. Then, a kcopyd job is started in the
recover function.
- when the kcopyd job finishes, recovery_complete is called. It calls
dm_rh_recovery_end. dm_rh_recovery_end adds the region to
recovered_regions or failed_recovered_regions list (depending on
whether the copy operation was successful or not).
The above mechanism assumes that if the region is in DM_RH_RECOVERING
state, no new I/Os are started on this region. When I/O is started,
dm_rh_inc_pending is called, which increases reg->pending count. When
I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
If the count is zero and the region was in DM_RH_RECOVERING state,
dm_rh_dec adds it to the quiesced_regions list.
Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
in DM_RH_RECOVERING state, it could be added to quiesced_regions list
multiple times or it could be added to this list when kcopyd is copying
data (it is assumed that the region is not on any list while kcopyd does
its jobs). This results in memory corruption and crash.
There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
do not belong to any region, so they are always added to the sync list
in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
requests. These bypasses avoid the crash possibility described above.
These bypasses were improperly implemented for REQ_DISCARD when
the mirror target gained discard support in commit
5fc2ffeabb (dm raid1: support discard).
In do_writes, REQ_DISCARD requests is always added to the sync queue and
immediately dispatched (even if the region is in DM_RH_RECOVERING). However,
dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the
rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
corruption described above.
This patch changes it so that REQ_DISCARD requests follow the same path
as REQ_FLUSH. This avoids the crash.
Reference: https://bugzilla.redhat.com/837607
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
is unnecessary, because the chance one stripe gets written twice in the mean
time is rare. We can always do a bitmap_startwrite when a write request is
added to a stripe and bitmap_endwrite after write request is done. Delete the
optimization. With it, we can delete some cases of device_lock.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 4367af5561
md/raid1: clear bad-block record when write succeeds.
Added a 'reschedule_retry' call possibility at the end of
end_sync_write, but didn't add matching code at the end of
sync_request_write. So if the writes complete very quickly, or
scheduling makes it seem that way, then we can miss rescheduling
the request and the resync could hang.
Also commit 73d5c38a95
md: avoid races when stopping resync.
Fix a race condition in this same code in end_sync_write but didn't
make the change in sync_request_write.
This patch updates sync_request_write to fix both of those.
Patch is suitable for 3.1 and later kernels.
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.
However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.
If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).
So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.
This is needed when setting a read-write device to read-only too.
Cc: stable@vger.kernel.org
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit c6563a8c38
md: add possibility to change data-offset for devices.
introduced a 'new_data_offset' attribute which should normally
be the same as 'data_offset', but can be explicitly set to a different
value to allow a reshape operation to move the data.
Unfortunately when the 'data_offset' is explicitly set through
sysfs, the new_data_offset is not also set, so the two would become
out-of-sync incorrectly.
One result of this is that trying to set the 'size' after the
'data_offset' would fail because it is not permitted to set the size
when the 'data_offset' and 'new_data_offset' are different - as that
can be confusing.
Consequently when mdadm tried to do this while assembling an IMSM
array it would fail.
This bug was introduced in 3.5-rc1.
Reported-by: Brian Downing <bdowning@lavos.net>
Bisected-by: Brian Downing <bdowning@lavos.net>
Tested-by: Brian Downing <bdowning@lavos.net>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug has been present ever since data-check was introduce
in 2.6.16. However it would only fire if a data-check were
done on a degraded array, which was only possible if the array
has 3 or more devices. This is certainly possible, but is quite
uncommon.
Since hot-replace was added in 3.3 it can happen more often as
the same condition can arise if not all possible replacements are
present.
The problem is that as soon as we submit the last read request, the
'r1_bio' structure could be freed at any time, so we really should
stop looking at it. If the last device is being read from we will
stop looking at it. However if the last device is not due to be read
from, we will still check the bio pointer in the r1_bio, but the
r1_bio might already be free.
So use the read_targets counter to make sure we stop looking for bios
to submit as soon as we have submitted them all.
This fix is suitable for any -stable kernel since 2.6.16.
Cc: stable@vger.kernel.org
Reported-by: Arnold Schulz <arnysch@gmx.net>
Signed-off-by: NeilBrown <neilb@suse.de>
I really shouldn't do important things late in the day. It seems
that I get careless.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT/OCnTnsnt1WYoG5AQIlBxAAsEAYpXhz2m081nY9lYCF4m3CqWoUdsZh
mI/LDVUW6d26fz9uUgTEqv4JmJ3/v2m0PZEVhGiuQF5dkQbMYz/1ddBiaRMM6vWC
pC5+RlcFnYbx62Iayr8/bXYhMZ0lLmxa1oupjEPklchZaJ4+EhRnckbfo+0OFbET
LGe5JFpgtyx5YtHtVJRwAzNkkbHRYQxHuLuX3kMhAONVBmZ+v9c9/0yJZr2TlmYv
wz/khYcMvGGBfr9u60JaIRPpz+b0Hhw9Qh9BhWW4On0wRFytCKktLShTptgtf6BT
ZnXaR2X6vpgoYEJC1nrylyBlLkjnpTUhznWdySZKMGZg8UvZTiPXxN9zZne7HyEN
eosnFx1CiFma4aZyjXvcA2RI7ZnOyKdgnOVbrS29b3z2QpmmvOn/WwnNguFJiupC
WR9ZOpBih/mSCpLQDMkU/F6soXy/vTDcxjfYdLQc8HFeN/BJWT2szH54yD5wtd7G
e4H3Dqs31mxRJG+wVj1XVbcLY9VdHaFwXBf6Y85fqUc3DFJ6y7fX+TEkMIEzOtxt
PQkgZlFSP7QkowMn/3NK1ID3a9Ivjdi3l3fYyDSj582V1iXTJqYGICmibScdJSBJ
ohzkvnsz+fxpEXkQ0jRVN5gstPTVWXS8yrUOF326my60sMiI9Jres+lE9ynrhdZN
OqZjo07ULrU=
=pDob
-----END PGP SIGNATURE-----
Merge tag 'md-3.5-fixes' of git://neil.brown.name/md
Pull raid10 build failure fix from NeilBrown:
"I really shouldn't do important things late in the day. It seems that
I get careless."
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md/raid10: fix careless build error
build error introduced by commit b357f04a67
That function doesn't get extra args until a later patch. Bother.
Reported-by: Fengguang Wu <wfg@linux.intel.com>
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and memory is fragmented and a
sufficiently-large metadata device is used in a thin pool then the space
map checker will fail to allocate the memory it requires.
Switch from kmalloc to vmalloc to allow larger virtually contiguous
allocations for the space map checker's internal count arrays.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and dm_sm_checker_create()
fails, dm_tm_create_internal() would still return success even though it
cleaned up all resources it was supposed to have created. This will
lead to a kernel crash:
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
...
RIP: 0010:[<ffffffff81593659>] [<ffffffff81593659>] dm_bufio_get_block_size+0x9/0x20
Call Trace:
[<ffffffff81599bae>] dm_bm_block_size+0xe/0x10
[<ffffffff8159b8b8>] sm_ll_init+0x78/0xd0
[<ffffffff8159c1a6>] sm_ll_new_disk+0x16/0xa0
[<ffffffff8159c98e>] dm_sm_disk_create+0xfe/0x160
[<ffffffff815abf6e>] dm_pool_metadata_open+0x16e/0x6a0
[<ffffffff815aa010>] pool_ctr+0x3f0/0x900
[<ffffffff8158d565>] dm_table_add_target+0x195/0x450
[<ffffffff815904c4>] table_load+0xe4/0x330
[<ffffffff815917ea>] ctl_ioctl+0x15a/0x2c0
[<ffffffff81591963>] dm_ctl_ioctl+0x13/0x20
[<ffffffff8116a4f8>] do_vfs_ioctl+0x98/0x560
[<ffffffff8116aa51>] sys_ioctl+0x91/0xa0
[<ffffffff81869f52>] system_call_fastpath+0x16/0x1b
Fix the space map checker code to return an appropriate ERR_PTR and have
dm_sm_disk_create() and dm_tm_create_internal() check for it with
IS_ERR.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cleanup the shadow table before destroying the transaction manager.
Reference: leak was identified with kmemleak when running
test_discard_random_sectors in the thinp-test-suite.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Userland sometimes sees a corrupt metadata block if metadata is changing
rapidly when a metadata snapshot is reserved for userland, To make the
problem go away, commit before we take the metadata snapshot (which is a
sensible thing to do anyway).
The checksums mean userland spots this corruption immediately so there's
no risk of acting on incorrect data. No corruption exists from the
kernel's point of view, and thin_check passes after pool shutdown.
I believe this is to do with shared blocks at the first level of the
{device, mapping} btree. Prior to the metadata-snap support no sharing
at this level was possible, so this patch is only required after commit
cc8394d86f ("dm thin: provide userspace
access to pool metadata").
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The value returned by "mddev_check_plug" is only valid until the
next 'schedule' as that will unplug things. This could happen at any
call to mempool_alloc.
So just calling mddev_check_plug at the start doesn't really make
sense.
So call it just before, or just after, queuing things for the thread.
As the action that happens at unplug is to wake the thread, this makes
lots of sense.
If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
wake thread immediately.
RAID5 is a bit different. Requests are queued for the thread and the
thread is woken by release_stripe. So we don't need to wake the
thread on failure.
However the thread doesn't perform certain actions when there is any
active plug, so it is important to install a plug before waking the
thread. So for RAID5 we install the plug *before* queuing the request
and waking the thread.
Without this patch it is possible for raid1 or raid10 to queue a
request without then waking the thread, resulting in the array locking
up.
Also change raid10 to only flush_pending_write when there are not
active plugs, just like raid1.
This patch is suitable for 3.0 or later. I plan to submit it to
-stable, but I'll like to let it spend a few weeks in mainline
first to be sure it is completely safe.
Signed-off-by: NeilBrown <neilb@suse.de>
We currently only allow a device to be re-added if it appear to be
in-sync. This is overly restrictive as it may be desirable to re-add
a device that is in the middle of recovery.
So remove the test for "InSync" - the test on rdev->raid_disk is
sufficient to ensure that the re-add will succeed.
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we added hot_replace we doubled the number of devices
that could be in a RAID1 array. So we doubled how far read_balance
would search. Unfortunately we didn't double the point at which
it looped back to the beginning - so it effectively loops over
all non-replacement disks twice.
This doesn't cause bad behaviour, but it pointless and means we
never read from replacement devices.
Signed-off-by: NeilBrown <neilb@suse.de>
There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We may not be able to fix a bad block if:
- the array is degraded
- the over-write fails.
In these cases we currently eject the device, but we should
record a bad block if possible.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.
So make it non-optional and always explicitly pass the name
of the level that the array will be.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 58c54fcca3
md/raid10: handle further errors during fix_read_error better.
in 3.1 added "r10_sync_page_io" which takes an IO size in sectors.
But we were passing the IO size in bytes!!!
This resulting in bio_add_page failing, and empty request being sent
down, and a consequent BUG_ON in scsi_lib.
[fix missing space in error message at same time]
This fix is suitable for 3.1.y and later.
Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 43220aa0f2
md/raid5: fix a hang on device failure.
fixed a hang, but introduced a refcounting in-balance so
that if the presence of bad-blocks ever caused an rdev to
be 'blocked' we would increment the refcount on the rdev and
never decrement it.
So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
is not called.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
nr_pending so we lose the reference we hold on the rdev.
So atomic_inc it first to maintain the reference.
This bug was introduced by commit 73e92e51b7
md/raid5. Don't write to known bad block on doubtful devices.
which appeared in 3.0, so patch is suitable for stable kernels since
then.
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In chunk_aligned_read() we are adding data_offset before calling
is_badblock. But is_badblock also adds data_offset, so that is bad.
So move the addition of data_offset to after the call to
is_badblock.
This bug was introduced by commit 31c176ecdf
md/raid5: avoid reading from known bad blocks.
which first appeared in 3.0. So that patch is suitable for any
-stable kernel from 3.0.y onwards. However it will need minor
revision for most of those (as the comment didn't appear until
recently).
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a RAID5 has both a failed device and a device marked as
'WantReplacement', then we should preferentially replace the failed
device.
However the current code replaces whichever is found first.
So split into 2 loops, check fail failed/missing first, and only check
for WantReplacement if nothing is failed or missing.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a RAID10 has an odd number of chunks - as might happen when there
are an odd number of devices - the last chunk has no pair and so is
not mirrored. We don't store data there, but when recovering the last
device in an array we retry to recover that last chunk from a
non-existent location. This results in an error, and the recovery
aborts.
When we get to that last chunk we should just stop - there is nothing
more to do anyway.
This bug has been present since the introduction of RAID10, so the
patch is appropriate for any -stable kernel.
Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Tested-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
One sparse-warning fix, one bigfix for 3.4-stable
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT86Oqjnsnt1WYoG5AQJ9rhAAuLVImkvxtHHMM7j2E8ZTQ1pWT6JRf6qC
Rsz/s41olPwSEVRuLXpZrle/dSN2l1Ys49FR2u6m+96lM0At2JlkML/Sc4Gszr0g
Oeo8FN+Rv/Sv6Chv7MuWp0z0WOs3ruIR3AYQIo+jnaVzZLLQ2HRN8wupjvpCIZyk
WdPu6t/9G+OtnkFWCC3FDEIyqpghg1TcoK93b1eRFD/ZoPV8yDJ9bba//fDesVVI
OhvUJPqeJ/ow+sA1MzyLhKB6CLPmEob0qxi8++CdnTfx8fwnkYNKlgsxf0WQ8JQ5
GSClKNUpki0yiYWJR6pJrv6+e6WbesX1DriRSRODJLKls/bQKnskaxGx4DUa73BM
DkOUsALfaTfGD5XiXgEjTU2HR+codiqvDavQjWOlHWgwIKB2MYQWIwFLK/T2RSdC
5f30IiM6kHJMZS3lVP2kjfXAfQ10kiTBg7E6btzCO3aso84yxr6Er65skdnlIi5r
q1z7FCnQimfZYjlbuR8EUtdxHdGZkSQbtZ5E7X9dvmUpFstvgGGPr/SuAP3r87kM
LbyRSoDpGk8dXZ5/epY+IKCQGsFZIeTlg+eonjSuVNN8Anr3WAE1VeRmLBQilnXk
hGDLKAZ4v9YwRJWqoY3hewtpcYhCMqNGGk4hPKmJuh37OTOWFQl8sXVk2Pqzy1ap
uIP66qrvvI0=
=VrYL
-----END PGP SIGNATURE-----
Merge tag 'md-3.5-fixes' of git://neil.brown.name/md
Pull two md fixes from NeilBrown:
"One sparse-warning fix, one bugfix for 3.4-stable"
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md: raid1/raid10: fix problem with merge_bvec_fn
lib/raid6: fix sparse warnings in recovery functions
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPyqIdAAoJEK2W1qbAHj1nSPoQAJvAb/6UHufTWC/lufbEyo7t
ft6uwZZ4S/VV1Gdx8V5YXo3rxkVIZj/CV0hiJIDctmDMKGPMlzup39kCgjD/rOUF
mzcFAE8sEr3QEavkfjSWw2RHIIlhnJpvqVnb8nu3p/mSgAB4qYGgaDjBpi+W60PV
aqQSSWgwH1uNhfGDBIxQoJ8OIjjYvKPIf2Ir2FAXam/dNi9chWO9nzFdj3q2LccP
nZir094BDsFac1BF0FYW3J+rgT1FfPO7RRGAQct6WNJ197IZlYWYjKH3XehxnUHE
wgiJmjfUO8vrho1hhWmWDOesKJPPWFN67EQnl5FqAu9itP7c7k8bd7Ay4jWgtZQU
QIx10uiAgAuFUmTdWGK1fLlE8HGKUFINYLp63N5n5NZ4TDJrgo8e7CIID3rvYf/O
EtmL7HzAyztL9Uc6oaXzCK6TgMUtd/ht8OJCDFhjitzQTNjbrfAGz6m+RHnEZyyj
dtOVK7WBlmuKEANl2vDFGuVVF0+MwJLTlvPx1/b/ejFvnHI/R5Wuk9EH7t/DO4LB
nCmiwzB6uWMzU3y3vnZG72AYSF5NTKSvnAl5B8U/0rI1MZU+6PehjeviJNx6ddJN
2YheHBLU4vbBV/LF4XIpaHK2aiHN1ltaKCp8INo3EKhCwpR4ZdlVvnAGU9ocf9+c
qoaFTOP7zGD9zgPeGjoG
=wCpY
-----END PGP SIGNATURE-----
Merge tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper updates from Alasdair G Kergon:
"Improve multipath's retrying mechanism in some defined circumstances
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use."
* tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm thin: provide userspace access to pool metadata
dm thin: use slab mempools
dm mpath: allow ioctls to trigger pg init
dm mpath: delay retry of bypassed pg
dm mpath: reduce size of struct multipath
This patch implements two new messages that can be sent to the thin
pool target allowing it to take a snapshot of the _metadata_. This,
read-only snapshot can be accessed by userland, concurrently with the
live target.
Only one metadata snapshot can be held at a time. The pool's status
line will give the block location for the current msnap.
Since version 0.1.5 of the userland thin provisioning tools, the
thin_dump program displays the msnap as follows:
thin_dump -m <msnap root> <metadata dev>
Available here: https://github.com/jthornber/thin-provisioning-tools
Now that userland can access the metadata we can do various things
that have traditionally been kernel side tasks:
i) Incremental backups.
By using metadata snapshots we can work out what blocks have
changed over time. Combined with data snapshots we can ensure
the data doesn't change while we back it up.
A short proof of concept script can be found here:
https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
ii) Migration of thin devices from one pool to another.
iii) Merging snapshots back into an external origin.
iv) Asyncronous replication.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use dedicated caches prefixed with a "dm_" name rather than relying on
kmalloc mempools backed by generic slab caches so the memory usage of
thin provisioning (and any leaks) can be accounted for independently.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
After the failure of a group of paths, any alternative paths that
need initialising do not become available until further I/O is sent to
the device. Until this has happened, ioctls return -EAGAIN.
With this patch, new paths are made available in response to an ioctl
too. The processing of the ioctl gets delayed until this has happened.
Instead of returning an error, we submit a work item to kmultipathd
(that will potentially activate the new path) and retry in ten
milliseconds.
Note that the patch doesn't retry an ioctl if the ioctl itself fails due
to a path failure. Such retries should be handled intelligently by the
code that generated the ioctl in the first place, noting that some SCSI
commands should not be retried because they are not idempotent (XOR write
commands). For commands that could be retried, there is a danger that
if the device rejected the SCSI command, the path could be errorneously
marked as failed, and the request would be retried on another path which
might fail too. It can be determined if the failure happens on the
device or on the SCSI controller, but there is no guarantee that all
SCSI drivers set these flags correctly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If I/O needs retrying and only bypassed priority groups are available,
set the pg_init_delay_retry flag to wait before retrying.
If, for example, the reason for the bypass is that the controller is
getting reset or there is a firmware upgrade happening, retrying right
away would cause a flood of log messages and retries for what could be a
few seconds or even several minutes.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move multipath structure's 'lock' and 'queue_size' members to eliminate
two 4-byte holes. Also use a bit within a single unsigned int for each
existing flag (saves 8-bytes). This allows future flags to be added
without each consuming an unsigned int.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.
However were were only setting that when a device was hot-added,
not when a device was present from the start.
This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels. However that are conflicts in raid10.c so a separate
patch will be needed for 3.4.y.
Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Main features:
- RAID10 arrays can be reshapes - adding and removing devices and
changing chunks (not 'far' array though)
- allow RAID5 arrays to be reshaped with a backup file (not tested
yet, but the priciple works fine for RAID10).
- arrays can be reshaped while a bitmap is present - you no longer
need to remove it first
- SSSE3 support for RAID6 syndrome calculations
and of course a number of minor fixes etc.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT7xXijnsnt1WYoG5AQLvFg/+OGeptY2cRu3HpsNsibvIyfiOYSlDpLo+
2tYzBz2wFiFROfj41aV/PdeqE3xn/RelDmIgt9Apaimeg453O6IdjI9X00fPrgxV
ATWkwWy5ykozbLIsyJYQ/kLPo0NX2KR/TtEim2lwlEjs4bLsF8TGvRa6ylcko0zI
j6cbqVzkCDHXzLk/M6l0UoUaSG1PcjO6M10KBM7bS2sLoxhkn69gT7YTIlFySXW4
epNYSTKyeuSmEUI7L09s5HLf/zPZSp4MipoRIqQYcwk5gvmMNNuLbouDECvZ5BdV
TXxrVVSlh7tFSeoGwYXQXcv/nFg3n53Mc+Nimzo7hhmI5ytRR9Y0c6SwvRBCN7t6
HzapQu+vBqDIPzedH+6r/gk39Auzm60JjGDYHiSdjZCAWefcYUmYm/Iso9JJ/0hg
PVkSfnkgaFUx0GhXS+C9YgPHYlb5DnTCCMrbtQCL65D61D2det3oZtrQPfKIKMlw
SRz2Ls+4o4UhAY7JLYNhONa0mtxhk5VTZ3LH58I9+ZurVyvqrjvCV+neSiCUsRog
jT038/gT5nJ8HPsg5feQ9cS0TbEo92eg3gILy1D5cPTaMZhrV8gq0Ke7xgmBo0+Q
bWh4vxU9SM/96c/umCxcmHymKAFhsMVFbJTg4r9K5atFGNyMegJYedFFEEbQMQI3
u+KRDXHN700=
=q8bc
-----END PGP SIGNATURE-----
Merge tag 'md-3.5' of git://neil.brown.name/md
Pull md updates from NeilBrown:
"It's been a busy cycle for md - lots of fun stuff here.. if you like
this kind of thing :-)
Main features:
- RAID10 arrays can be reshaped - adding and removing devices and
changing chunks (not 'far' array though)
- allow RAID5 arrays to be reshaped with a backup file (not tested
yet, but the priciple works fine for RAID10).
- arrays can be reshaped while a bitmap is present - you no longer
need to remove it first
- SSSE3 support for RAID6 syndrome calculations
and of course a number of minor fixes etc."
* tag 'md-3.5' of git://neil.brown.name/md: (56 commits)
md/bitmap: record the space available for the bitmap in the superblock.
md/raid10: Remove extras after reshape to smaller number of devices.
md/raid5: improve removal of extra devices after reshape.
md: check the return of mddev_find()
MD RAID1: Further conditionalize 'fullsync'
DM RAID: Use md_error() in place of simply setting Faulty bit
DM RAID: Record and handle missing devices
DM RAID: Set recovery flags on resume
md/raid5: Allow reshape while a bitmap is present.
md/raid10: resize bitmap when required during reshape.
md: allow array to be resized while bitmap is present.
md/bitmap: make sure reshape request are reflected in superblock.
md/bitmap: add bitmap_resize function to allow bitmap resizing.
md/bitmap: use DIV_ROUND_UP instead of open-code
md/bitmap: create a 'struct bitmap_counts' substructure of 'struct bitmap'
md/bitmap: make bitmap bitops atomic.
md/bitmap: make _page_attr bitops atomic.
md/bitmap: merge bitmap_file_unmap and bitmap_file_put.
md/bitmap: remove async freeing of bitmap file.
md/bitmap: convert some spin_lock_irqsave to spin_lock_irq
...
Now that bitmaps can grow and shrink it is best if we record
how much space is available. This means that when
we reduce the size of the bitmap we won't "lose" the space
for late when we might want to increase the size of the bitmap
again.
Signed-off-by: NeilBrown <neilb@suse.de>
When a reshape which reduced the number of devices finishes
we must remove the extra devices.
So ensure that raid10_remove_disk won't try to keep them, and
have raid10_finish_reshape clear the 'in_sync' flag. Then
remove_and_add_spares will be able to remove them.
Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
After a reshape which reduced the number of devices we need
to disconnect the extra devices.
The code for this doesn't currently handle 'replacement' devices.
It is very unlikely that such devices will be present, but it is
safest to handle them anyway.
So simplify the handling. Just clear In_sync and leave it
to remove_and_add_spaces (which will be called soon) to do
the real works.
Signed-off-by: NeilBrown <neilb@suse.de>
Check the return of mddev_find(), since it may fail due to out of
memeory or out of usable minor number.
The reason I chose -ENODEV instead of -ENOMEM or something else is
md_alloc() function chose that ;)
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A RAID1 device does not necessarily need a fullsync if the bitmap can be used instead.
Similar to commit d6b212f4b1 in raid5.c, if a raid1
device can be brought back (i.e. from a transient failure) it shouldn't need a
complete resync. Provided the bitmap is not to old, it will have recorded the areas
of the disk that need recovery.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When encountering an error while reading the superblock, call md_error.
We are currently setting the 'Faulty' bit on one of the array devices when an
error is encountered while reading the superblock of a dm-raid array. We should
be calling md_error(), as it handles the error more completely.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Missing dm-raid devices should be recorded in the superblock
When specifying the devices that compose a DM RAID array, it is possible to denote
failed or missing devices with '-'s. When this occurs, we must record this in the
superblock. We do this by checking if the array position's data device is missing
and then forcing MD to record the superblock by setting 'MD_CHANGE_DEVS' in
'raid_resume'. If we do not cause the superblock to be rewritten by the resume
function, it is possible for a stale superblock to be written by an out-going
in-active table (during 'raid_dtr').
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Properly initialize MD recovery flags when resuming device-mapper devices.
When a device-mapper device is suspended, all I/O must stop. This is done by
calling 'md_stop_writes' and 'mddev_suspend'. These calls in-turn manipulate
the recovery flags - including setting 'MD_RECOVERY_FROZEN'. The DM device
may have been suspended while recovery was not yet complete, so the process
needs to pick-up where it left off. Since 'mddev_resume' does not unset
'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
(e.g. It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
setting 'MD_RECOVERY_FROZEN'. Clearing it in 'mddev_resume' would override the
desired behavior.)
Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.
Also clean up where level_store calls mddev_resume() - it current
duplicates some of the funcitons of that call. - NB
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We always should have allowed this. A raid5 reshape doesn't change
the size of the bitmap, so not need to restrict it.
Also add a test to make sure we don't try to start a reshape on a
failed array.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.
This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.
Signed-off-by: NeilBrown <neilb@suse.de>
As a reshape may change the sync_size and/or chunk_size, we need
to update these whenever we write out the bitmap superblock.
Signed-off-by: NeilBrown <neilb@suse.de>
This function will allocate the new data structures and copy
bits across from old to new, allowing for the possibility that the
chunksize has changed.
Use the same function for performing the initial allocation
of the structures. This improves test coverage.
When bitmap_resize is used to resize an existing bitmap, it
only copies '1' bits in, not '0' bits.
So when allocating the bitmap, ensure everything is initialised
to ZERO.
Signed-off-by: NeilBrown <neilb@suse.de>
The new "struct bitmap_counts" contains all the fields that are
related to counting the number of active writes in each bitmap chunk.
Having this separate will make it easier to change the chunksize
or overall size of a bitmap atomically.
Signed-off-by: NeilBrown <neilb@suse.de>
Using e.g. set_bit instead of __set_bit and using test_and_clear_bit
allow us to remove some locking and contract other locked ranges.
It is rare that we set or clear a lot of these bits, so gain should
outweigh any cost.
Signed-off-by: NeilBrown <neilb@suse.de>
There functions really do one thing together: release the
'bitmap_storage'. So make them just one function.
Since we removed the locking (previous patch), we don't need to zero
any fields before freeing them, so it all becomes a bit simpler.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no real value in freeing things the moment there is an error.
It is just as good to free the bitmap file and pages when the bitmap
is explicitly removed (and replaced?) or at shutdown.
With this gone, the bitmap will only disappear when the array is
quiescent, so we can remove some locking.
As the 'filemap' doesn't disappear now, include extra checks before
trying to write any of it out.
Also remove the check for "has it disappeared" in
bitmap_daemon_write().
Signed-off-by: NeilBrown <neilb@suse.de>
All of these sites can only be called from process context with
irqs enabled, so using irqsave/irqrestore just adds noise.
Remove it.
Signed-off-by: NeilBrown <neilb@suse.de>
We currently use '&' and '|' which isn't the norm in the kernel
and doesn't allow easy atomicity.
So change to bit numbers and {set,clear,test}_bit.
This allows us to remove a spinlock/unlock (which was dubious anyway)
and some other simplifications.
Signed-off-by: NeilBrown <neilb@suse.de>
Just do single-bit manipulations on bitmap->flags and copy whole
value between that and sb->state.
This will allow next patch which changes how bit manipulations are
performed on bitmap->flags.
This does result in BITMAP_STALE not being set in sb by
bitmap_read_sb, however as the setting is determined by other
information in the 'sb' we do not lose information this way.
Normally, bitmap_load will be called shortly which will clear
BITMAP_STALE anyway.
Signed-off-by: NeilBrown <neilb@suse.de>
This function isn't really needed. It sets or clears a flag in both
bitmap->flags and sb->state.
However both times it is called, bitmap_update_sb is called soon
afterwards which copies bitmap->flags to sb->state.
So just make changes to bitmap->flags, and open-code those rather than
hiding in a function.
Signed-off-by: NeilBrown <neilb@suse.de>
This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
Most often we have the page number, not the page. And that is what
the *_page_attr() functions really want. So change the arguments to
take that number.
Signed-off-by: NeilBrown <neilb@suse.de>
Instead of allocating pages in read_sb_page, read_page and
bitmap_read_sb, allocate them all in bitmap_init_from disk.
Also replace the hack of calling "attach_page_buffers(page, NULL)" to
ensure that free_buffer() won't complain, by putting a test for
PagePrivate in free_buffer().
Signed-off-by: NeilBrown <neilb@suse.de>
An md bitmap comprises two parts
- internal counting of active writes per 'chunk'.
- external storage of whether there are any active writes on
each chunk
The second requires the first, but the first doesn't require the
second.
Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.
So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.
A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.
We don't actually activate these bitmaps yet - that will come
in a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.
So create an attribute to store this information and set appropriate
defaults.
It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.
Signed-off-by: NeilBrown <neilb@suse.de>
There are two different 'pending' concepts in the handling of the
write intent bitmap.
Firstly, a 'page' from the bitmap (which container PAGE_SIZE*8 bits)
may have changes (bits cleared) that should be written in due course.
There is no hurry for these and the page will transition from
PENDING to NEEDWRITE and will then be written, though if it ever
becomes DIRTY it will be written much sooner and PENDING will be
cleared.
Secondly, a page of counters - which contains PAGE_SIZE/2 counters, one
for each bit, can usefully have a 'pending' flag which indicates if
any of the counters are low (2 or 1) and ready to be processed by
bitmap_daemon_work(). If this flag is clear we can skip the whole
page.
These two concepts are currently combined in the bitmap-file flag.
This causes a tighter connection between the counters and the bitmap
file than I would like - as I want to add some flexibility to the
bitmap file.
So introduce a new flag with the page-of-counters, and rewrite
bitmap_daemon_work() so that it handles the two different 'pending'
concepts separately.
This also allows us to clear BITMAP_PAGE_PENDING when we write out
a dirty page, which may occasionally reduce the number of times we
write a page.
Signed-off-by: NeilBrown <neilb@suse.de>
REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
policy,
for example ioscheduler. This patch adds it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If the allocation of rep1_bio fails, we currently don't free the 'bio'
of the same dev.
Reported by kmemleak.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When attempting to fix a read error, it is acceptable to read from a
device that is recovering, provided the recovery has got past the
place we are reading from. This makes the test for "can we read from
here" the same as the test in read_balance.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This ensures that it is always freed - there were case where
we failed to free the page.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.
So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c
Signed-off-by: NeilBrown <neilb@suse.de>
A 'near' or 'offset' lay RAID10 array can be reshaped to a different
'near' or 'offset' layout, a different chunk size, and a different
number of devices.
However the number of copies cannot change.
Unlike RAID5/6, we do not support having user-space backup data that
is being relocated during a 'critical section'. Rather, the
data_offset of each device must change so that when writing any block
to a new location, it will not over-write any data that is still
'live'.
This means that RAID10 reshape is not supportable on v0.90 metadata.
The different between the old data_offset and the new_offset must be
at least the larger of the chunksize multiplied by offset copies of
each of the old and new layout. (for 'near' mode, offset_copies == 1).
A larger difference of around 64M seems useful for in-place reshapes
as more data can be moved between metadata updates.
Very large differences (e.g. 512M) seem to slow the process down due
to lots of long seeks (on oldish consumer graded devices at least).
Metadata needs to be updated whenever the place we are about to write
to is considered - by the current metadata - to still contain data in
the old layout.
[unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]
Signed-off-by: NeilBrown <neilb@suse.de>
We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape. So split it out into separate
function.
Signed-off-by: NeilBrown <neilb@suse.de>
When RAID10 supports reshape it will need a 'previous' and a 'current'
geometry, so introduce that here.
Use the 'prev' geometry when before the reshape_position, and the
current 'geo' when beyond it. At other times, use both as
appropriate.
For now, both are identical (And reshape_position is never set).
When we use the 'prev' geometry, we must use the old data_offset.
When we use the current (And a reshape is happening) we must use
the new_data_offset.
Signed-off-by: NeilBrown <neilb@suse.de>
Some resync type operations need to act on the address space of the
device, others on the address space of the array.
This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only. However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.
Signed-off-by: NeilBrown <neilb@suse.de>
Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.
So add that understanding.
Signed-off-by: NeilBrown <neilb@suse.de>
We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there. Then we will more easily be able to add
a second set of fields.
Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.
Signed-off-by: NeilBrown <neilb@suse.de>
The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.
To this end we find the minimum offset difference across all devices
and add that where appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>
As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.
Signed-off-by: NeilBrown <neilb@suse.de>
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).
So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.
(As we didn't previous check that all 'pad' fields were zero,
we need a new FEATURE flag for this.
A (belatedly) check that all remaining 'pad' fields are
zero to avoid a repeat of this)
The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.
This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be. At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.
When a reshape finishes we will need to update the data_offset
and rdev->sectors. So provide an exported function to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.
To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.
However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.
So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.
This will become more important when we allow the data_offset to
change in a reshape. Then the explicit statement of what direction is
being used will be more useful.
This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.
Signed-off-by: NeilBrown <neilb@suse.de>
A flush request is usually issued in transaction commit code path, so
using GFP_KERNEL to allocate memory for flush request bio falls into
the classic deadlock issue.
This is suitable for any -stable kernel to which it applies as it
avoids a possible deadlock.
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When the thin pool target clears the discard_passdown parameter
internally, it incorrectly changes the table line reported to userspace.
This breaks dumb string comparisons on these table lines in generic
userspace device-mapper library code and leads to tables being reloaded
repeatedly when nothing is actually meant to be changing.
This patch corrects this by no longer changing the table line when
discard passdown was disabled.
We can still tell when discard passdown is overridden by looking for the
message "Discard unsupported by data device (sdX): Disabling discard passdown."
This automatic detection is also moved from the 'load' to the 'resume'
so that it is re-evaluated should the properties of underlying devices
change.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Without this patch, recovery will crash
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT7bVGznsnt1WYoG5AQL0ohAAvpF47NkSnuqw9gko25/Ndr5akrIC9kfz
YNzxsd0FvNSEQipGS4KgBcRxnFxbcVkmHL+pN8McmiDxNew1LW2+zdtV47RJngnQ
2slTUiSBLSHcvnOFblBrwIh+XpMdhv4FEMG13Si4tQaETtHp3JiNDr2JqvwbuKbo
aO/nW38DFNjEcB59X+9npknZYaymvavxso4J7iH/ec0Iys2c2h+jX5afewCXnbYr
Q6X2YlySUiUM3AXbgV0QI/w6xDM0TD+WWguHq411pgU1wHMYpHelYYwRGHSU73TJ
061K5WB83Q53A8czeRXK33x0xGr/AhQesTel+mlV3eODHoGSnsYxH8Zja7sV220h
HgTICrSvAFL2cvBz8dqJQ59KYH3yNo4Cie5xzmBxCuX6yyTxizgrQTp7L/HUy7RL
eL1SqNFTRJcwSl3MwmgbL57saBLw+/ejV+og7PNBCXSS9gZg39uJyn1uTqJZTrcL
PuBaizNG9HCYuR/pRTMsQGNWrksqk9r+hxwCJDIGb7xh9SBFI82YtQfK9F5fSB84
vdmIec8yUvwS/Gxhz+I+YB3kqk/nwfFRbVGfBWvJdRBOR0DsZphsBapsJ83LZqFb
VAa2nQL+NoHOweixv6RgiH4Y96SNHdWkvDJjj7R01BUDdpHeHMZuff07Effgqd64
Knk64XvWKQc=
=s7vy
-----END PGP SIGNATURE-----
Merge tag 'md-3.4-fixes' of git://neil.brown.name/md
Pull one more md bugfix from NeilBrown:
"Fix bug in recent fix to RAID10.
Without this patch, recovery will crash"
* tag 'md-3.4-fixes' of git://neil.brown.name/md:
md/raid10: fix transcription error in calc_sectors conversion.
The old code was
sector_div(stride, fc);
the new code was
sector_dir(size, conf->near_copies);
'size' is right (the stride various wasn't really needed), but
'fc' means 'far_copies', and that is an important difference.
Signed-off-by: NeilBrown <neilb@suse.de>
one fixes a bug in the new raid10 resize code so is relevant
to 3.4 only
Other fixes a bug in the use of md by dm-raid, so is relevant
to any kernel with dm-raid support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT7SBkznsnt1WYoG5AQK3wQ//Q2sPicPHb5MNGTTBpphYo1QWo+l9jFHs
ZDBM+MaiNJg3kBN5ueUU+MENvLcaA5+zoxsGVBXBKyXr70ffqiQcLXyU7fHwrGu3
5MD36p55ZPnq2pemCrp4qdTXEUabmDb+0/R7e5lywnzNdbmCAfh4uYih0VPiaClV
ihq/Ci12TDnezmLjksc09OCquhm0s3zH2BnMCVdmSAkhnXCxTeZ45s/ob71Y2xvj
cJ15SYlAG4t0QCikL5R8pZtkh0h2SuUhufDE09eD8yT4RGO4PHSQ4oHujajftzey
9sB0NGH7Yla8gOXjA+EpzKPaiqtZxJB+1v/bhqA2FoOYAks8VoFfeqgwUbPYE7bk
GIfGB4hFsUXaJo13uzofyJXBIp9mM/J5Sk1VJsiLE85P7wewg6N199B8lpC3lFDw
tMLjfTMJzFOUqZBESjJoxyrc4fairZ9VCUWwpqjuioLO50e+lOi/jQHTspX78e+w
GxgjHp8hh0RqQiTkl7vIz9KVcQIeOTG9uzz61IuDp15cRSrMs6E8gVKoX8gKW9g2
Hec17fdG/H6ZeZa7MB9GzUD4HCj0PRbODQ3/fPhUdsbgtQjOvsVUH8LCRRU0U6cb
YF+qsDFtUF7QT2kNbrs9R6adGj97c2HWUMyRWMQAXGuL5TkstvhrRv/rk1+bv2VG
w7ptbiklj7o=
=9zxe
-----END PGP SIGNATURE-----
Merge tag 'md-3.4-fixes' of git://neil.brown.name/md
Pull two md fixes from NeilBrown:
"One fixes a bug in the new raid10 resize code so is relevant to 3.4
only.
The other fixes a bug in the use of md by dm-raid, so is relevant to
any kernel with dm-raid support"
* tag 'md-3.4-fixes' of git://neil.brown.name/md:
MD: Add del_timer_sync to mddev_suspend (fix nasty panic)
md/raid10: set dev_sectors properly when resizing devices in array.
Use del_timer_sync to remove timer before mddev_suspend finishes.
We don't want a timer going off after an mddev_suspend is called. This is
especially true with device-mapper, since it can call the destructor function
immediately following a suspend. This results in the removal (kfree) of the
structures upon which the timer depends - resulting in a very ugly panic.
Therefore, we add a del_timer_sync to mddev_suspend to prevent this.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
raid10 stores dev_sectors in 'conf' separately from the one in
'mddev' because it can have a very significant effect on block
addressing and so need to be updated carefully.
However raid10_resize isn't updating it at all!
To update it correctly, we need to make sure it is a proper
multiple of the chunksize taking various details of the layout
in to account.
This calculation is currently done in setup_conf. So split it
out from there and call it from raid10_resize as well.
Then set conf->dev_sectors properly.
Signed-off-by: NeilBrown <neilb@suse.de>
Pull networking fixes from David S. Miller:
1) Since we do RCU lookups on ipv4 FIB entries, we have to test if the
entry is dead before returning it to our caller.
2) openvswitch locking and packet validation fixes from Ansis Atteka,
Jesse Gross, and Pravin B Shelar.
3) Fix PM resume locking in IGB driver, from Benjamin Poirier.
4) Fix VLAN header handling in vhost-net and macvtap, from Basil Gor.
5) Revert a bogus network namespace isolation change that was causing
regressions on S390 networking devices.
6) If bonding decides to process and handle a LACPDU frame, we
shouldn't bump the rx_dropped counter. From Jiri Bohac.
7) Fix mis-calculation of available TX space in r8169 driver when doing
TSO, which can lead to crashes and/or hung device. From Julien
Ducourthial.
8) SCTP does not validate cached routes properly in all cases, from
Nicolas Dichtel.
9) Link status interrupt needs to be handled in ks8851 driver, from
Stephen Boyd.
10) Use capable(), not cap_raised(), in connector/userns netlink code.
From Eric W. Biederman via Andrew Morton.
11) Fix pktgen OOPS on module unload, from Eric Dumazet.
12) iwlwifi under-estimates SKB truesizes, also from Eric Dumazet.
13) Cure division by zero in SFC driver, from Ben Hutchings.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
ks8851: Update link status during link change interrupt
macvtap: restore vlan header on user read
vhost-net: fix handle_rx buffer size
bonding: don't increase rx_dropped after processing LACPDUs
connector/userns: replace netlink uses of cap_raised() with capable()
sctp: check cached dst before using it
pktgen: fix crash at module unload
Revert "net: maintain namespace isolation between vlan and real device"
ehea: fix losing of NEQ events when one event occurred early
igb: fix rtnl race in PM resume path
ipv4: Do not use dead fib_info entries.
r8169: fix unsigned int wraparound with TSO
sfc: Fix division by zero when using one RX channel and no SR-IOV
openvswitch: Validation of IPv6 set port action uses IPv4 header
net: compare_ether_addr[_64bits]() has no ordering
cdc_ether: Ignore bogus union descriptor for RNDIS devices
bnx2x: bug fix when loading after SAN boot
e1000: Silence sparse warnings by correcting type
igb, ixgbe: netdev_tx_reset_queue incorrectly called from tx init path
openvswitch: Release rtnl_lock if ovs_vport_cmd_build_info() failed.
...
If the requested scsi_dh module is already loaded then skip
request_module().
Multipath table loads can hang in an unnecessary __request_module.
Reported-by: Ben Marzinski <bmarzins@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix two places in commit 104655fd4d ("dm thin: support discards") that
didn't use pool->lock to protect against concurrent changes to the
prepared_discards list.
Without this fix, thin_endio() can race with process_discard(), leading
to concurrent list_add()s that result in the processes locking up with
an error like the following:
WARNING: at lib/list_debug.c:32 __list_add+0x8f/0xa0()
...
list_add corruption. next->prev should be prev (ffff880323b96140), but was ffff8801d2c48440. (next=ffff8801d2c485c0).
...
Pid: 17205, comm: kworker/u:1 Tainted: G W O 3.4.0-rc3.snitm+ #1
Call Trace:
[<ffffffff8103ca1f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8103cb16>] warn_slowpath_fmt+0x46/0x50
[<ffffffffa04f6ce6>] ? bio_detain+0xc6/0x210 [dm_thin_pool]
[<ffffffff8124ff3f>] __list_add+0x8f/0xa0
[<ffffffffa04f70d2>] process_discard+0x2a2/0x2d0 [dm_thin_pool]
[<ffffffffa04f6a78>] ? remap_and_issue+0x38/0x50 [dm_thin_pool]
[<ffffffffa04f7c3b>] process_deferred_bios+0x7b/0x230 [dm_thin_pool]
[<ffffffffa04f7df0>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
[<ffffffffa04f7e42>] do_worker+0x52/0x60 [dm_thin_pool]
[<ffffffff81056fa9>] process_one_work+0x129/0x450
[<ffffffff81059b9c>] worker_thread+0x17c/0x3c0
[<ffffffff81059a20>] ? manage_workers+0x120/0x120
[<ffffffff8105eabe>] kthread+0x9e/0xb0
[<ffffffff814ceda4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105ea20>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff814ceda0>] ? gs_change+0x13/0x13
---[ end trace 7e0a523bc5e52692 ]---
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix a significant memory leak inadvertently introduced during
simplification of cell_release_singleton() in commit
6f94a4c45a ("dm thin: fix stacked bi_next
usage").
A cell's hlist_del() must be accompanied by a mempool_free().
Use __cell_release() to do this, like before.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.
In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).
Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN). The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.
In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.
Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.
The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).
To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 61a0d80c "md/bitmap: discard CHUNK_BLOCK_SHIFT macro"
replaced CHUNK_BLOCK_RATIO() by the same text that was
replacing CHUNK_BLOCK_SHIFT() - which is clearly wrong.
The result is that 'chunks' is often too small by 1,
which can sometimes result in a crash (not sure how).
So use the correct replacement, and get rid of CHUNK_BLOCK_RATIO
which is no longe used.
Reported-by: Karl Newman <siliconfiend@gmail.com>
Tested-by: Karl Newman <siliconfiend@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit c744a65c1e
md: don't set md arrays to readonly on shutdown.
removed the possibility of a 'BUG' when data is written to an array
that has just been switched to read-only, but also introduced the
possibility that the array metadata could be corrupted.
If, when md_notify_reboot gets the mddev lock, the array is
in a state where it is assembled but hasn't been started (as can
happen if the personality module is not available, or in other unusual
situations), then incorrect metadata will be written out making it
impossible to re-assemble the array.
So only call __md_stop_writes() if the array has actually been
activated.
This patch is needed for any stable kernel which has had the above
commit applied.
Cc: stable@vger.kernel.org
Reported-by: Christoph Nelles <evilazrael@evilazrael.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 7bfec5f35c
md/raid5: If there is a spare and a want_replacement device, start replacement.
cause md_check_recovery to call ->add_disk much more often.
Instead of only when the array is degraded, it is now called whenever
md_check_recovery finds anything useful to do, which includes
updating the metadata for clean<->dirty transition.
This causes unnecessary work, and causes info messages from ->add_disk
to be reported much too often.
So refine md_check_recovery to only do any actual recovery checking
(including ->add_disk) if MD_RECOVERY_NEEDED is set.
This fix is suitable for 3.3.y:
Cc: stable@vger.kernel.org
Reported-by: Jan Ceuleers <jan.ceuleers@computer.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe
Commit dafb20fa34 mistakenly replaced a safe
iterator with an unsafe one when making some macro changes.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a bitmap is added while the array is active, it is possible
for bitmap_daemon_work to run while the bitmap is being
initialised.
This is particularly a problem if bitmap_daemon_work sees
bitmap->filemap as non-NULL before it has been filled in properly.
So hold bitmap_info.mutex while filling in ->filemap
to prevent problems.
This patch is suitable for any -stable kernel, though it might not
apply cleanly before about 3.1.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If r1bio->sectors % 8 != 0,then the memcmp and a later
memcpy will omit the last bio_vec.
This is suitable for any stable kernel since 3.1 when bad-block
management was introduced.
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
bitmap_new_disk_sb() would still create V3 bitmap superblock
with host-endian layout.
Perhaps I'm confused, but shouldn't bitmap_new_disk_sb() be
creating a V4 bitmap superblock instead, that is portable,
as per comment in bitmap.h?
Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When comparing two pages read from different legs of a mirror, only
compare the bytes that were read, not the whole page.
In most cases we read a whole page, but in some cases with
bad blocks or odd sizes devices we might read fewer than that.
This bug has been present "forever" but at worst it might cause
a report of two many mismatches and generate a little bit
extra resync IO, so there is no need to back-port to -stable
kernels.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
if (do_recovery ||
sh->sector >= conf->mddev->recovery_cp)
s->syncing = 1;
else
s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.
This bug was introduced by commit 9a3e1101b8
md/raid5: detect and handle replacements during recovery.
so this patch is suitable for 3.3-stable.
Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Because rde->nr_pending > 0,so can not remove this disk.
And in any case, we aren't holding rcu_read_lock()
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
raid1 arrays do not have the notion of chunk size. Calculate the
largest chunk sector size we can use to avoid a divide by zero OOPS
when aligning the size of the new array to the chunk size.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
1/ We can only treat a known-bad-block like a read-error if we
have the data that belongs in that block. So fix that test.
2/ If we cannot recovery a stripe due to insufficient data,
don't tell "md_done_sync" that the sync failed unless we really
did fail something. If we successfully record bad blocks,
that is success.
Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This device-mapper target creates a read-only device that transparently
validates the data on one underlying device against a pre-generated tree
of cryptographic checksums stored on a second device.
Two checksum device formats are supported: version 0 which is already
shipping in Chromium OS and version 1 which incorporates some
improvements.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Elly Jones <ellyjones@chromium.org>
Cc: Milan Broz <mbroz@redhat.com>
Cc: Olof Johansson <olofj@chromium.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch introduces a new function dm_bufio_prefetch. It prefetches
the specified range of blocks into dm-bufio cache without waiting
for i/o completion.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add dm thin target arguments to control discard support.
ignore_discard: Disables discard support
no_discard_passdown: Don't pass discards down to the underlying data
device, but just remove the mapping within the thin provisioning target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support discards in the thin target.
On discard the corresponding mapping(s) are removed from the thin
device. If the associated block(s) are no longer shared the discard
is passed to the underlying device.
All bios other than discards now have an associated deferred_entry
that is saved to the 'all_io_entry' in endio_hook. When non-discard
IO completes and associated mappings are quiesced any discards that
were deferred, via ds_add_work() in process_discard(), will be queued
for processing by the worker thread.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
drivers/md/dm-thin.c | 173 ++++++++++++++++++++++++++++++++++++++++++++++----
drivers/md/dm-thin.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 158 insertions(+), 14 deletions(-)
This patch contains the ground work needed for dm-thin to support discard.
- Adds endio function that replaces shared_read_endio.
- Introduce an explicit 'quiesced' flag into the new_mapping structure.
Before, this was implicitly indicated by m->list being empty.
- The map_info->ptr remains constant for the duration of a bio's trip
through the thin target. Make it easier to reason about it.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support the use of an external _read only_ device as an origin for a thin
device.
Any read to an unprovisioned area of the thin device will be passed
through to the origin. Writes trigger allocation of new blocks as
usual.
One possible use case for this would be VM hosts that want to run
guests on thinly-provisioned volumes but have the base image on another
device (possibly shared between many VMs).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The thin metadata format can only make use of a device that is <=
THIN_METADATA_MAX_SECTORS (currently 15.9375 GB). Therefore, there is no
practical benefit to using a larger device.
However, it may be that other factors impose a certain granularity for
the space that is allocated to a device (E.g. lvm2 can impose a coarse
granularity through the use of large, >= 1 GB, physical extents).
Rather than reject a larger metadata device, during thin-pool device
construction, switch to allowing it but issue a warning if a device
larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is
provided. Any space over 15.9375 GB will not be used.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Save space by removing entries from the space map ref_count tree if
they're no longer needed.
Ref counts are stored in two places: a bitmap if the ref_count is
below 3, or a btree of uint32_t if 3 or above.
When a ref_count that was above 3 drops below we can remove it from
the tree and save some metadata space. This removal was commented out
before because I was unsure why this was causing under-populated btree
nodes. Earlier patches have fixed this issue.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Commit unwritten data every second to prevent too much building up.
Released blocks don't become available until after the next commit
(for crash resilience). Prior to this patch commits were only
triggered by a message to the target or a REQ_{FLUSH,FUA} bio. This
allowed far too big a position to build up.
The interval is hard-coded to 1 second. This is a sensible setting.
I'm not making this user configurable, since there isn't much to be
gained by tweaking this - and a lot lost by setting it far too high.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Device mapper uses sscanf to convert arguments to numbers. The problem is that
the way we use it ignores additional unmatched characters in the scanned string.
For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
but also it will match number with some garbage appended, like "123abc".
As a result, device mapper accepts garbage after some numbers. For example
the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
will pass without an error.
This patch fixes all sscanf uses in device mapper. It appends "%c" with
a pointer to a dummy character variable to every sscanf statement.
The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
only if string is a null-terminated number (optionally preceded by some
whitespace characters). If there is some character appended after the number,
sscanf matches "%c", writes the character to the dummy variable and returns 2.
We check the return value for 1 and consequently reject numbers with some
garbage appended.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The dm-raid code currently fails to create a RAID array if any of the
superblocks cannot be read. This was an oversight as there is already
code to handle this case if the values ('- -') were provided for the
failed array position.
With this patch, if a superblock cannot be read, the array position's
fields are initialized as though '- -' was set in the table. That is,
the device is failed and the position should not be used, but if there
is sufficient redundancy, the array should still be activated.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix a harmless typo.
The root is a chunk of data that gets written to the superblock. This
data is used to recreate the space map when opening a metadata area.
We have two space maps; one tracking space on the metadata device and
one of the data device. Both of these use the same format for their
root, so this typo was harmless.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Now that the value_size is held within every node of the btrees we can
remove this argument from value_ptr().
For the last few months a BUG_ON has been checking this argument is
the same as that held in the node. No issues were reported. So this
is a safe change.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The map_context pointer should always be set. However, we have reports
that upon requeuing it is not set correctly. So add set and clear
functions with a BUG_ON() to track the issue properly.
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
As a precaution, set bi_end_io to NULL when failing to remap.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
free_devices in dm_table.c already uses list_for_each(), so we don't
need to check if the list is empty.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove documentation for unimplemented 'trim' message.
I'd planned a 'trim' target message for shrinking thin devices, but
this is better handled via the discard ioctl.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The dm raid module (using md) is becoming the preferred way of creating long-lived
mirrors through userspace LVM so remove the EXPERIMENTAL tag.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Drop EXPERIMENTAL tag from dm-uevent.
It's not changed for a while and some userspace tools are relying upon it.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When we remove an entry from a node we sometimes rebalance with it's
two neighbours. This wasn't being done correctly; in some cases
entries have to move all the way from the right neighbour to the left
neighbour, or vice versa. This patch pretty much re-writes the
balancing code to fix it.
This code is barely used currently; only when you delete a thin
device, and then only if you have hundreds of them in the same pool.
Once we have discard support, which removes mappings, this will be used
much more heavily.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Avoid using the bi_next field for the holder of a cell when deferring
bios because a stacked device below might change it. Store the
holder in a new field in struct cell instead.
When a cell is created, the bio that triggered creation (the holder) was
added to the same bio list as subsequent bios. In some cases we pass
this holder bio directly to devices underneath. If those devices use
the bi_next field there will be trouble...
This also simplifies some code that had to work out which bio was the
holder.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Always set io->error to -EIO when an error is detected in dm-crypt.
There were cases where an error code would be set only if we finish
processing the last sector. If there were other encryption operations in
flight, the error would be ignored and bio would be returned with
success as if no error happened.
This bug is present in kcryptd_crypt_write_convert, kcryptd_crypt_read_convert
and kcryptd_async_done.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Reviewed-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch fixes a possible deadlock in dm-crypt's mempool use.
Currently, dm-crypt reserves a mempool of MIN_BIO_PAGES reserved pages.
It allocates first MIN_BIO_PAGES with non-failing allocation (the allocation
cannot fail and waits until the mempool is refilled). Further pages are
allocated with different gfp flags that allow failing.
Because allocations may be done in parallel, this code can deadlock. Example:
There are two processes, each tries to allocate MIN_BIO_PAGES and the processes
run simultaneously.
It may end up in a situation where each process allocates (MIN_BIO_PAGES / 2)
pages. The mempool is exhausted. Each process waits for more pages to be freed
to the mempool, which never happens.
To avoid this deadlock scenario, this patch changes the code so that only
the first page is allocated with non-failing gfp mask. Allocation of further
pages may fail.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Call the correct exit function on failure in dm_exception_store_init.
Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mostly tidying up code in preparation for some bigger changes
next time.
A few bug fixes tagged for -stable.
Main functionality change is that some RAID10 arrays can now
grow to use extra space that may have been made available on the
individual devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT2bLBjnsnt1WYoG5AQKN3xAAv1UlR5Kem5WN7Ex4lmR9xj3lr9dbURYT
TtvrUuCy3pYYWdTuijb+IBqkbODF0kPDHIhUiBx9fXUfMavkp/b9heXS/vJ3pcH4
1j99NUbOGL/AylD1TPRV9TQxGTKhEjK3n26bY0t/amLc92bWJaytMO1B9cz38LN+
qx6ufpIepz4DPXXtPYpnkBR4cZ6L4/ZXQvjf5BqG6WfKwc+0Nyncg8ipYEqhBWy7
R7ztF5yPo0yl96Wopa2KG91OroWflmyZo1DNYcbUbKtbNGGtYC92GFadOH+wNupM
FnmXv10ivfVGU5w4SpshAwOg+4OSUqmWNsBxUhpYbf8ChbN+lOl0VZdH6UBxo19D
3SqZWT/yz4I4HYd5rtr35MXFdOeBNM++CHQs4F68BLA0B6OcHfWsA9bvly2tnBVx
iEBFPd277qWztUr8m6yz7AFf/0dgyXuIhuB3d7IkVrG5yG3FX6hPi2T0FSA33qMx
Lwi5w6O4DREg5tG09xEYEnXgXe+PnB8HsKb1U/m76XMQ0UScvX6dLA6934Vg+DCv
xf+AYqob0Tc/Op5I7h2PbVXq7DciNXwlX1WvM0m+TEaV+3fl1FB0VsCcANAV6JVn
uRLmvtePQRt0hxAog2p7OsumVnxMhbuEo5h8rJMKWM7IbhueKNoz+gBwpcFLzBmY
ygWc4peLQpE=
=MGuM
-----END PGP SIGNATURE-----
Merge tag 'md-3.4' of git://neil.brown.name/md
Pull md updates for 3.4 from Neil Brown:
"Mostly tidying up code in preparation for some bigger changes next
time.
A few bug fixes tagged for -stable.
Main functionality change is that some RAID10 arrays can now grow to
use extra space that may have been made available on the individual
devices."
Fixed up trivial conflicts with the k[un]map_atomic() cleanups in
drivers/md/bitmap.c.
* tag 'md-3.4' of git://neil.brown.name/md: (22 commits)
md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
md: fix clearing of the 'changed' flags for the bad blocks list.
md/bitmap: discard CHUNK_BLOCK_SHIFT macro
md/bitmap: remove unnecessary indirection when allocating.
md/bitmap: remove some pointless locking.
md/bitmap: change a 'goto' to a normal 'if' construct.
md/bitmap: move printing of bitmap status to bitmap.c
md/bitmap: remove some unused noise from bitmap.h
md/raid10 - support resizing some RAID10 arrays.
md/raid1: handle merge_bvec_fn in member devices.
md/raid10: handle merge_bvec_fn in member devices.
md: add proper merge_bvec handling to RAID0 and Linear.
md: tidy up rdev_for_each usage.
md/raid1,raid10: avoid deadlock during resync/recovery.
md/bitmap: ensure to load bitmap when creating via sysfs.
md: don't set md arrays to readonly on shutdown.
md: allow re-add to failed arrays.
md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
md: Use existed macros instead of numbers
md/raid5: removed unused 'added_devices' variable.
...
Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Pull trivial tree from Jiri Kosina:
"It's indeed trivial -- mostly documentation updates and a bunch of
typo fixes from Masanari.
There are also several linux/version.h include removals from Jesper."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
kcore: fix spelling in read_kcore() comment
constify struct pci_dev * in obvious cases
Revert "char: Fix typo in viotape.c"
init: fix wording error in mm_init comment
usb: gadget: Kconfig: fix typo for 'different'
Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
writeback: fix typo in the writeback_control comment
Documentation: Fix multiple typo in Documentation
tpm_tis: fix tis_lock with respect to RCU
Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
Doc: Update numastat.txt
qla4xxx: Add missing spaces to error messages
compiler.h: Fix typo
security: struct security_operations kerneldoc fix
Documentation: broken URL in libata.tmpl
Documentation: broken URL in filesystems.tmpl
mtd: simplify return logic in do_map_probe()
mm: fix comment typo of truncate_inode_pages_range
power: bq27x00: Fix typos in comment
...
If there are no unacked bad blocks, then there is no point searching
for them to acknowledge them.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In super_1_sync (the first hunk) we need to clear 'changed' before
checking read_seqretry(), otherwise we might race with other code
adding a bad block and so won't retry later.
In md_update_sb (the second hunk), in the case where there is no
metadata (neither persistent nor external), we treat any bad blocks as
an error. However we need to clear the 'changed' flag before calling
md_ack_all_badblocks, else it won't do anything.
This patch is suitable for -stable release 3.0 and later.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Be redefining ->chunkshift as the shift from sectors to chunks rather
than bytes to chunks, we can just use "bitmap->chunkshift" which is
shorter than the macro call, and less indirect.
Signed-off-by: NeilBrown <neilb@suse.de>
These funcitons don't add anything useful except possibly the trace
points, and I don't think they are worth the extra indirection.
So remove them.
Signed-off-by: NeilBrown <neilb@suse.de>
There is nothing gained by holding a lock while we check if a pointer
is NULL or not. If there could be a race, then it could become NULL
immediately after the unlock - but there is no race here.
So just remove the locking.
Signed-off-by: NeilBrown <neilb@suse.de>
The use of a goto makes the control flow more obscure here.
So make it a normal:
if (x) {
Y;
}
No functional change.
Signed-off-by: NeilBrown <neilb@suse.de>
The part of /proc/mdstat which describes the bitmap should really
be generated by code in bitmap.c. So move it there.
Signed-off-by: NeilBrown <neilb@suse.de>
'resizing' an array in this context means making use of extra
space that has become available in component devices, not adding new
devices.
It also includes shrinking the array to take up less space of
component devices.
This is not supported for array with a 'far' layout. However
for 'near' and 'offset' layout arrays, adding and removing space at
the end of the devices is easy to support, and this patch provides
that support.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.
So create a raid1 merge_bvec_fn to check that function in children
as well.
This introduces a small problem. There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request. So a
device added between these could end up getting a request which
violates its merge_bvec_fn.
Currently the best we can do is synchronize_sched(). This will work
providing no preemption happens. If there is is preemption, we just
have to hope that new devices are largely consistent with old devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.
So enhance the raid10 merge_bvec_fn to check that function in children
as well.
This introduces a small problem. There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request. So a
device added between these could end up getting a request which
violates its merge_bvec_fn.
Currently the best we can do is synchronize_sched(). This will work
providing no preemption happens. If there is preemption, we just
have to hope that new devices are largely consistent with old devices.
Signed-off-by: NeilBrown <neilb@suse.de>
These personalities currently set a max request size of one page
when any member device has a merge_bvec_fn because they don't
bother to call that function.
This causes extra works in splitting and combining requests.
So make the extra effort to call the merge_bvec_fn when it exists
so that we end up with larger requests out the bottom.
Signed-off-by: NeilBrown <neilb@suse.de>