Commit Graph

3536 Commits

Author SHA1 Message Date
NeilBrown f97fcad38f md: remove need for mddev_lock() in md_seq_show()
The only access in md_seq_show that could suffer from races
not protected by ->lock is walking the rdev list.
This can receive sufficient protection from 'rcu'.

So use rdev_for_each_rcu() and get rid of mddev_lock().

Now reading /proc/mdstat will never block in md_seq_show.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
NeilBrown 978a7a47ca md/bitmap: protect clearing of ->bitmap by mddev->lock
This makes it safe to inspect the struct while holding only
the spinlock.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
Linus Torvalds 59acf65776 Two fixes for md
1/ Another live lock, needs backporting
 2/ work-around false positive with new warnings.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVNE+TDnsnt1WYoG5AQIX8w/9EWD9hTM3HEatrdZFfOgQrG4fafpZoOHT
 +fxMTyvIbIr7ppL3lZVA6KmyDS15/BIt0JhwMy7pzaPqvxSCK/qqGOdE8h1nVaN1
 /TbARZCCOn62PsRxKQDHCsU8lsRt3VNH4fGvm0RBTry/RtvGrxcqIBLGnwWseCQq
 SGVj1uKb+PI5FL8c4GvyVCdBD+uO8idpY6D6Rd2WQbuskOPoJhIEZRh0wPHEYvWw
 rJ+gzzWkalFOjPgejS54ZrTGxOgvZ0NiAaFuEQaDG2zRc27luDxF/eyCR9G12juC
 YH8M2IxNp0i20iaoNp8A+D8ksMbNE3OEFOZx2gtFwItQ3aye455Lv+C0ZnbxlWD/
 R+399E0wKtFp8onW+KALoJvgZHjlanj3uIjSPltlCxDDQ3F5Any6h6uGIEOAVYx2
 uruUmjp0JsxHio52R1Ai26VT+Ssc49GVEfBwcFej/ZGs8a0XxvYWuk1lllh9AL0w
 8THt9yVQMR8NmUYrNnceRK6BJN4PdFHi/jxoLzeQfW2OHpmuug2Q0M/raYZGOIx6
 xI92XPIGKN/kzRhBua75KhQkX5HBGJFP0kutIHj58AHacMFbiiJl9lzSIjGOJzjS
 sBxyvvnOYUV4QW2Kb3KNfJWu2dDbLx/z4xzzkiG22d+LSW03FaPPnqSXXT59FIhQ
 OzNfUxdNLJc=
 =qYoP
 -----END PGP SIGNATURE-----

Merge tag 'md/3.19-fixes' of git://neil.brown.name/md

Pull two fixes for md from Neil Brown:

 - Another live lock, needs backporting

 - work-around false positive with new warnings.

* tag 'md/3.19-fixes' of git://neil.brown.name/md:
  md/bitmap: fix a might_sleep() warning.
  md/raid5: fix another livelock caused by non-aligned writes.
2015-02-03 19:54:57 -08:00
NeilBrown 36d091f475 md: protect ->pers changes with mddev->lock
->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.

However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight:  level_show and md_seq_show().

So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.

[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
 Dan Carpenter <dan.carpenter@oracle.com>]

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:53 +11:00
NeilBrown db721d32b7 md: level_store: group all important changes into one place.
Gather all the changes that can happen atomically and might
be relevant to other code into one place.  This will
make it easier to refine the locking.

Note that this puts quite a few things between mddev_detach()
and ->free().  Enabling this was the point of some recent patches.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:53 +11:00
NeilBrown afa0f557cb md: rename ->stop to ->free
Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private.  This flexibility will be useful in level_store().

Finally, don't clear ->private.  It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before  ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 5aa61f427e md: split detach operation out from ->stop.
Each md personality has a 'stop' operation which does two
things:
 1/ it finalizes some aspects of the array to ensure nothing
    is accessing the ->private data
 2/ it frees the ->private data.

All the steps in '1' can apply to all arrays and so can be
performed in common code.

This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.

So split the 'step 1' functionality out into a new mddev_detach().

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 3be260cc18 md/linear: remove rcu protections in favour of suspend/resume
The use of 'rcu' to protect accesses to ->private_data so that
the ->private_data could be updated predates the introduction
of mddev_suspend/mddev_resume.
These are a cleaner mechanism for providing stability while
swapping in a new ->private data - it is used by level_store()
to support changing of raid levels.

So get rid of the RCU stuff and just use mddev_suspend, mddev_resume.

As these function call ->quiesce(), we add an empty function for
linear just like for raid0.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 64590f45dd md: make merge_bvec_fn more robust in face of personality changes.
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 5c675f83c6 md: make ->congested robust against personality changes.
There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown 85572d7c75 md: rename mddev->write_lock to mddev->lock
This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further.  So the
name is inappropriate.

Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.

So:
  -rename write_lock to lock
  -document what it protects
  -remove _irq ... except in md_flush_request() as there
     is no wait_event_lock() (with no _irq).  This can be
     cleaned up after appropriate changes to wait.h.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown ea664c8245 md/raid5: need_this_block: tidy/fix last condition.
That last condition is unclear and over cautious.

There are two related issues here.

If a partial write is destined for a missing device, then
either RMW or RCW can work.  We must read all the available
block.  Only then can the missing blocks be calculated, and
then the parity update performed.

If RMW is not an option, then there is a complication even
without partial writes.  If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.

So make these two cases more clear and separate, and perform
the relevant tests more  thoroughly.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown a9d56950f7 md/raid5: need_this_block: start simplifying the last two conditions.
Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point.  So factor out those tests and
explain them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown a79cfe12c6 md/raid5: separate out the easy conditions in need_this_block.
Some of the conditions in need_this_block have very straight
forward motivation.  Separate those out and document them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown 2c58f06e6f md/raid5: separate large if clause out of fetch_block().
fetch_block() has a very large and hard to read 'if' condition.

Separate it into its own function so that it can be
made more readable.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
Jes Sorensen ad3ab8b608 md: do_release_stripe(): No need to call md_wakeup_thread() twice
67f455486d introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.

Remove the unnecessary wakeup call.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown d959014334 md/bitmap: fix a might_sleep() warning.
commit 8eb23b9f35
    sched: Debug nested sleeps

causes false-positive warnings in RAID5 code.

This annotation removes them and adds a comment
explaining why there is no real problem.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02 17:08:03 +11:00
NeilBrown b1b02fe97f md/raid5: fix another livelock caused by non-aligned writes.
If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.

As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.

This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.

However since commit 67f455486d, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.

So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.

Fixes: 67f455486d
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02 16:57:17 +11:00
Keith Busch febf71588c block: require blk_rq_prep_clone() be given an initialized clone request
Prepare to allow blk_rq_prep_clone() to accept clone requests that were
allocated from blk-mq request queues.  As such the blk_rq_prep_clone()
caller must first initialize the clone request.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-28 09:44:11 -07:00
Joe Thornber 2a7eaea02b dm thin: don't allow messages to be sent to a pool target in READ_ONLY or FAIL mode
You can't modify the metadata in these modes.  It's better to fail these
messages immediately than let the block-manager deny write locks on
metadata blocks.  Otherwise these failed metadata changes will trigger
'needs_check' to get set in the metadata superblock -- requiring repair
using the thin_check utility.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-28 10:00:34 -05:00
Joe Thornber 766a78882d dm cache: fix missing ERR_PTR returns and handling
Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
inactive and active DM tables") mistakenly ignored the use of ERR_PTR
returns.  Restore missing IS_ERR checks and ERR_PTR returns where
appropriate.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-28 09:59:20 -05:00
Mikulas Patocka 96b26c8c64 dm: fix handling of multiple internal suspends
Commit ffcc393641 ("dm: enhance internal suspend and resume interface")
attempted to handle multiple internal suspends on the same device, but
it did that incorrectly.  When these functions are called in this order
on the same device the device is no longer suspended, but it should be:
	dm_internal_suspend_noflush
	dm_internal_suspend_noflush
	dm_internal_resume

Fix this bug by maintaining an 'internal_suspend_count' and resuming
the device when this count drops to zero.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-01-24 14:50:08 -05:00
Joe Thornber a59db67656 dm cache: fix problematic dual use of a single migration count variable
Introduce a new variable to count the number of allocated migration
structures.  The existing variable cache->nr_migrations became
overloaded.  It was used to:

 i) track of the number of migrations in flight for the purposes of
    quiescing during suspend.

 ii) to estimate the amount of background IO occuring.

Recent discard changes meant that REQ_DISCARD bios are processed with
a migration.  Discards are not background IO so nr_migrations was not
incremented.  However this could cause quiescing to complete early.

(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.

Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-23 11:06:08 -05:00
Joe Thornber 9b1cc9f251 dm cache: share cache-metadata object across inactive and active DM tables
If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge.  This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.

Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-23 10:57:15 -05:00
Ingo Molnar f49028292c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

  - SRCU updates.

  - RCU CPU stall-warning updates.

  - RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21 06:12:21 +01:00
Christoph Jaeger 6341e62b21 kconfig: use bool instead of boolean for type definition attributes
Support for keyword 'boolean' will be dropped later on.

No functional change.

Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2015-01-07 13:08:04 +01:00
Pranith Kumar 83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
zhendong chen 5164bece16 dm: fix missed error code if .end_io isn't implemented by target_type
In bio-based DM's clone_endio(), when target_type doesn't implement
.end_io (e.g. linear) r will be always be initialized 0.  So if a
WRITE SAME bio fails WRITE SAME will not be disabled as intended.

Fix this by initializing r to error, rather than 0, in clone_endio().

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails")
Cc: stable@vger.kernel.org
2014-12-17 12:31:13 -05:00
Marc Dionne 2b94e8960c dm thin: fix crash by initializing thin device's refcount and completion earlier
Commit 80e96c5484 ("dm thin: do not allow thin device activation
while pool is suspended") delayed the initialization of a new thin
device's refcount and completion until after this new thin was added
to the pool's active_thins list and the pool lock is released.  This
opens a race with a worker thread that walks the list and calls
thin_get/put, noticing that the refcount goes to 0 and calling
complete, freezing up the system and giving the oops below:

 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
 kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90

 kernel: Call Trace:
 kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
 kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
 kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170

Set the thin device's initial refcount and initialize the completion
before adding it to the pool's active_thins list in thin_ctr().

Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-17 12:06:25 -05:00
Joe Thornber 2c43fd26e4 dm thin: fix missing out-of-data-space to write mode transition if blocks are released
Discard bios and thin device deletion have the potential to release data
blocks.  If the thin-pool is in out-of-data-space mode, and blocks were
released, transition the thin-pool back to full write mode.

The correct time to do this is just after the thin-pool metadata commit.
It cannot be done before the commit because the space maps will not
allow immediate reuse of the data blocks in case there's a rollback
following power failure.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-17 11:59:36 -05:00
Joe Thornber 45ec9bd0fd dm thin: fix inability to discard blocks when in out-of-data-space mode
When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
function pointer was incorrectly being set to
process_prepared_discard_passdown rather than process_prepared_discard.

This incorrect function pointer meant the discard was being passed down,
but not effecting the mapping.  As such any discard that was issued, in
an attempt to reclaim blocks, would not successfully free data space.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-17 11:59:36 -05:00
Linus Torvalds 8fd9589ced Three fixes for md
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVIzozznsnt1WYoG5AQKEOQ/9G31KYGhe3/Tn+Hhyy7o16+6fjOBSUN2z
 6GCGtOuHhom0Q+y4cdeAohw+eW9bJPqc9tRyRrLnVCuAkeymPP1t7fIRLGAuOHq5
 QjjX7ZwKK8urmPAFIxRFgJwrJctFEoaCQiZHk4Pcx9YcYm4paJP2liqqQ0khGPcl
 KGGjWgD3KMAG0moJzNmLMWBZevG3SLCTirPxOcDJuBkQ7KfP0e4UGVn/UYXdgF+f
 73yForADVIE/Jq/CXnsOEZa/nPTM7DIsXZAQqqFCnkUiuCvwhgfkowxhwCExx2d6
 w3+I69anD1WII1Z27bNDBgnJGR5xvdiAl6NTJ+GY1uMCo3lnV9OYpQrJxAEJP4dC
 nXILm6SVm7WLfLvVY1lJm3wZndIyWNmTcDxODGyMZhAY4RmjDHG+AZ9tcnY1pM+9
 rt0jHbmaj/ZZ5jKbGxPKBlQ3/U7EP8d4oiruOaDWCGT4sN+e2dbZVLpaWoNaBFUx
 eyUft+DEvY82ryz2Ktn/Exsx1aWk2Cy3cMl0ELq2L/2WAcjhA6gsikNmKG7tfafn
 Bd6WIGzAygATaFs4hOtFANyXqWubWi8TgLUHXNYUkJ2JooaMuSIfWrpeMNJyHHU7
 5bWTaSZqZ86Ygb6upvJqaA3olePMo1RPBsOev46TnuFT9vtfpzPRy6FlvbXfx8sK
 M5YCqNpTDTs=
 =Ar2y
 -----END PGP SIGNATURE-----

Merge tag 'md/3.19' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Three fixes for md.

   I did have a largish set of locking changes queued, but late testing
  showed they weren't quite as stable as I thought and while I fixed
  what I found, I decided it safer to delay them a release ...
  particularly as I'll be AFK for a few weeks.  So expect a larger batch
  next time :-)"

* tag 'md/3.19' of git://neil.brown.name/md:
  md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
  md: fix semicolon.cocci warnings
  md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
2014-12-14 12:13:05 -08:00
Linus Torvalds 9ea18f8cab Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block
Pull block layer driver updates from Jens Axboe:

 - NVMe updates:
        - The blk-mq conversion from Matias (and others)

        - A stack of NVMe bug fixes from the nvme tree, mostly from Keith.

        - Various bug fixes from me, fixing issues in both the blk-mq
          conversion and generic bugs.

        - Abort and CPU online fix from Sam.

        - Hot add/remove fix from Indraneel.

 - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)

 - With the generic IO stat accounting from 3.19/core, converting md,
   bcache, and rsxx to use those.  From Gu Zheng.

 - Boundary check for queue/irq mode for null_blk from Matias.  Fixes
   cases where invalid values could be given, causing the device to hang.

 - The xen blkfront pull request, with two bug fixes from Vitaly.

* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
  NVMe: fix race condition in nvme_submit_sync_cmd()
  NVMe: fix retry/error logic in nvme_queue_rq()
  NVMe: Fix FS mount issue (hot-remove followed by hot-add)
  NVMe: fix error return checking from blk_mq_alloc_request()
  NVMe: fix freeing of wrong request in abort path
  xen/blkfront: remove redundant flush_op
  xen/blkfront: improve protection against issuing unsupported REQ_FUA
  NVMe: Fix command setup on IO retry
  null_blk: boundary check queue_mode and irqmode
  block/rsxx: use generic io stats accounting functions to simplify io stat accounting
  md: use generic io stats accounting functions to simplify io stat accounting
  drbd: use generic io stats accounting functions to simplify io stat accounting
  md/bcache: use generic io stats accounting functions to simplify io stat accounting
  NVMe: Update module version major number
  NVMe: fail pci initialization if the device doesn't have any BARs
  NVMe: add ->exit_hctx() hook
  NVMe: make setup work for devices that don't do INTx
  NVMe: enable IO stats by default
  NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
  NVMe: replace blk_put_request() with blk_mq_free_request()
  ...
2014-12-13 14:22:26 -08:00
NeilBrown f851b60db0 md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously.  This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.

So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.

Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well.  To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.

Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started.  This is particularly best if we are about to stop the
sync thread.

Fixes: ac05f25669
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-11 10:02:10 +11:00
Linus Torvalds 140dfc9299 - Significant DM thin-provisioning performance improvements to meet
performance requirements that were requested by the Gluster
   distributed filesystem.  Specifically, dm-thinp now takes care to
   aggregate IO that will be issued to the same thinp block before
   issuing IO to the underlying devices.  This really helps improve
   performance on HW RAID6 devices that have a writeback cache because it
   avoids RMW in the HW RAID controller.
 
 - Some stable fixes: fix leak in DM bufio if integrity profiles were
   enabled, use memzero_explicit in DM crypt to avoid any potential for
   information leak, and a DM cache fix to properly mark a cache block
   dirty if it was promoted to the cache via the overwrite optimization.
 
 - A few simple DM persistent data library fixes
 
 - DM cache multiqueue policy block promotion improvements.
 
 - DM cache discard improvements that take advantage of range
   (multiblock) discard support in the DM bio-prison.  This allows for
   much more efficient bulk discard processing (e.g. when mkfs.xfs
   discards the entire device).
 
 - Some small optimizations in DM core and RCU deference cleanups
 
 - DM core changes to suspend/resume code to introduce the new internal
   suspend/resume interface that the DM thin-pool target now uses to
   suspend/resume active thin devices when the thin-pool must
   suspend/resume.  This avoids forcing userspace to track all active
   thin volumes in a thin-pool when the thin-pool is suspended for the
   purposes of metadata or data space resize.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUhcvVAAoJEMUj8QotnQNaB78H+wSA6sDJGOhc6e1KlWoFh4Hx
 hTmwm/O8Fxrp9StO3NPlcv9l+l9FX9pGzN/lo3OsxgWMTs/vLTKZ5SAe3/YT3/b9
 6SFC7pC70+glakgMhhXWRvoeSEQC1OWb5BuvOF8irl2n+7B9NAn/zHd9pgpmyWHp
 nBXK2GJBMzVSiI47NMjo2n6007LgQq0xxSJ9luwdrpwjDqD1d406DrhzbHou5H2Y
 b8XJGQzUy0GZCX8ycwPVXo9svp2Bc+XajVcgOj5Qg7s2uV5car8NN7TxhSOKSXn2
 VpiSyEa2MLHAbFuWtGs8XO98z/m5JlGf1eIgRO4s7w59URgpzdxOHXLlAoyqIGw=
 =opXi
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Significant DM thin-provisioning performance improvements to meet
   performance requirements that were requested by the Gluster
   distributed filesystem.

   Specifically, dm-thinp now takes care to aggregate IO that will be
   issued to the same thinp block before issuing IO to the underlying
   devices.  This really helps improve performance on HW RAID6 devices
   that have a writeback cache because it avoids RMW in the HW RAID
   controller.

 - Some stable fixes: fix leak in DM bufio if integrity profiles were
   enabled, use memzero_explicit in DM crypt to avoid any potential for
   information leak, and a DM cache fix to properly mark a cache block
   dirty if it was promoted to the cache via the overwrite optimization.

 - A few simple DM persistent data library fixes

 - DM cache multiqueue policy block promotion improvements.

 - DM cache discard improvements that take advantage of range
   (multiblock) discard support in the DM bio-prison.  This allows for
   much more efficient bulk discard processing (e.g.  when mkfs.xfs
   discards the entire device).

 - Some small optimizations in DM core and RCU deference cleanups

 - DM core changes to suspend/resume code to introduce the new internal
   suspend/resume interface that the DM thin-pool target now uses to
   suspend/resume active thin devices when the thin-pool must
   suspend/resume.

   This avoids forcing userspace to track all active thin volumes in a
   thin-pool when the thin-pool is suspended for the purposes of
   metadata or data space resize.

* tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (49 commits)
  dm crypt: use memzero_explicit for on-stack buffer
  dm space map metadata: fix sm_bootstrap_get_count()
  dm space map metadata: fix sm_bootstrap_get_nr_blocks()
  dm bufio: fix memleak when using a dm_buffer's inline bio
  dm cache: fix spurious cell_defer when dealing with partial block at end of device
  dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
  dm cache: only use overwrite optimisation for promotion when in writeback mode
  dm cache: discard block size must be a multiple of cache block size
  dm cache: fix a harmless race when working out if a block is discarded
  dm cache: when reloading a discard bitset allow for a different discard block size
  dm cache: fix some issues with the new discard range support
  dm array: if resizing the array is a noop set the new root to the old one
  dm: use rcu_dereference_protected instead of rcu_dereference
  dm thin: fix pool_io_hints to avoid looking at max_hw_sectors
  dm thin: suspend/resume active thin devices when reloading thin-pool
  dm: enhance internal suspend and resume interface
  dm thin: do not allow thin device activation while pool is suspended
  dm: add presuspend_undo hook to target_type
  dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl
  dm thin: remove stale 'trim' message in block comment above pool_message
  ...
2014-12-08 21:10:03 -08:00
kbuild test robot 7d7e64f2ec md: fix semicolon.cocci warnings
drivers/md/md.c:7175:43-44: Unneeded semicolon

 Removes unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-03 16:07:59 +11:00
NeilBrown 108cef3aa4 md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.

Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.

So treat that case just like RAID6, just as we do in
handle_stripe_dirtying.  RAID6 always does reconstruct-write.

This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.

Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-03 16:07:58 +11:00
Milan Broz 1a71d6ffe1 dm crypt: use memzero_explicit for on-stack buffer
Use memzero_explicit to cleanup sensitive data allocated on stack
to prevent the compiler from optimizing and removing memset() calls.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-02 10:25:07 -05:00
Joe Thornber 02717d9855 dm space map metadata: fix sm_bootstrap_get_count()
Must set 'result' accordingly rather than return it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-02 10:25:06 -05:00
Dan Carpenter c1c6156fe4 dm space map metadata: fix sm_bootstrap_get_nr_blocks()
This function isn't right and it causes a static checker warning:

	drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
	error: potentially using uninitialized 'sb_data_size'.

It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.

Fixes: 3241b1d3e0 ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:31:58 -05:00
Darrick J. Wong 445559cdcb dm bufio: fix memleak when using a dm_buffer's inline bio
When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload.  Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.

Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:31:17 -05:00
Joe Thornber f824a2af3d dm cache: fix spurious cell_defer when dealing with partial block at end of device
We never bother caching a partial block that is at the back end of the
origin device.  No cell ever gets locked, but the calling code was
assuming it was and trying to release it.

Now the code only releases if the cell has been set to a non NULL
value.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:13 -05:00
Joe Thornber 1e32134a5a dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation.  Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:12 -05:00
Joe Thornber f29a3147e2 dm cache: only use overwrite optimisation for promotion when in writeback mode
Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:12 -05:00
Joe Thornber 2bb812df63 dm cache: discard block size must be a multiple of cache block size
Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:11 -05:00
Joe Thornber 43c32bf2b0 dm cache: fix a harmless race when working out if a block is discarded
It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:11 -05:00
Joe Thornber 3e2e1c3098 dm cache: when reloading a discard bitset allow for a different discard block size
The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.

To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits.  The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.

Fixes: 7ae34e7778 ("dm cache: improve discard support")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:10 -05:00
Joe Thornber 2572629a13 dm cache: fix some issues with the new discard range support
Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
  boundaries
- calculate the cache's discard_nr_blocks relative to the determined
  discard_block_size rather than using oblock_to_dblock()

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:09 -05:00
Joe Thornber 8001e87d0e dm array: if resizing the array is a noop set the new root to the old one
This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target.  And the DM cache target passes in
the same old root to array_resize() anyway.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:07 -05:00
Gu Zheng 18c0b223cf md: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:16 -07:00
Gu Zheng aae4933da9 md/bcache: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:12 -07:00
Eric Dumazet a12f5d48bd dm: use rcu_dereference_protected instead of rcu_dereference
rcu_dereference() should be used in sections protected by rcu_read_lock.

For writers, holding some kind of mutex or lock,
rcu_dereference_protected() is the way to go, adding explicit lockdep
bits.

In __unbind(), we are the last user of this mapped device, so can use
the constant '1' instead of a lockdep_is_held(), not consistent with
other uses of rcu_dereference_protected() which use md->suspend_lock
mutex.

Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33423974bf ("dm: Use rcu_dereference() for accessing rcu pointer")
Cc: Pranith Kumar <bobby.prani@gmail.com>
[snitzer: allow lines longer than 80 columns, refine subject]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-23 20:32:45 -05:00
Mike Snitzer d200c30ef0 dm thin: fix pool_io_hints to avoid looking at max_hw_sectors
Simplify the pool_io_hints code that works to establish a max_sectors
value that is a power-of-2 factor of the thin-pool's blocksize.  The
biggest associated improvement is that the DM thin-pool is no longer
concerning itself with the data device's max_hw_sectors when adjusting
max_sectors.

This fixes the relative fragility of the original "dm thin: adjust
max_sectors_kb based on thinp blocksize" commit that only became
apparent when testing was performed using a DM thin-pool ontop of a
virtio_blk device.  One proposed upstream patch detailed the problems
inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611

So even though virtio_blk incorrectly set its max_hw_sectors it actually
helped make it clear that we need DM thinp to be tolerant of any future
Linux driver that incorrectly sets max_hw_sectors.

We only need to be concerned with modifying the thin-pool device's
max_sectors limit if it is smaller than the thin-pool's blocksize.  In
this case the value of max_sectors does become a limiting factor when
upper layers (e.g. filesystems) construct their bios.  But if the
hardware can support IOs larger than the thin-pool's blocksize the user
is encouraged to adjust the thin-pool's data device's max_sectors
accordingly -- doing so will enable the thin-pool to inherit the
established user-defined max_sectors.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-21 12:54:23 -05:00
Mike Snitzer 583024d248 dm thin: suspend/resume active thin devices when reloading thin-pool
Before this change it was expected that userspace would first suspend
all active thin devices, reload/resize the thin-pool target, then resume
all active thin devices.  Now the thin-pool suspend/resume will trigger
the suspend/resume of all active thins via appropriate calls to
dm_internal_suspend and dm_internal_resume.

Store the mapped_device for each thin device in struct thin_c to make
these calls possible.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 12:34:08 -05:00
Mike Snitzer ffcc393641 dm: enhance internal suspend and resume interface
Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.

Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend).  Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.

Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock.  Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared.  Add lockdep annotation to dm_suspend() and dm_resume().

The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().

Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend".  A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.

But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.

Also add DM_INTERNAL_SUSPEND_FLAG to status report.  This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 12:31:17 -05:00
Mike Snitzer 80e96c5484 dm thin: do not allow thin device activation while pool is suspended
Otherwise IO could be issued to the pool while it is suspended.

Care was taken to properly interlock between the thin and thin-pool
targets when accessing the pool's 'suspended' flag.  The thin_ctr will
not add a new thin device to the pool's active_thins list if the pool is
susepended.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 11:25:36 -05:00
Mike Snitzer d67ee213fa dm: add presuspend_undo hook to target_type
The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 11:24:59 -05:00
Mike Snitzer 4d341d8216 dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl
No point checking if the device is suspended if the current target
doesn't even implement .ioctl

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-19 11:24:56 -05:00
Linus Torvalds 0fbae13642 One fix for md for 3.18.
This fixes a regression introduced in 3.13.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVGkmSjnsnt1WYoG5AQKu8xAAkRYFm9FDLIU5W4AVnNhtsHfWBtNphi1X
 myHi75jRO5XUVVAYZ2x7EsGBvjDC3iOmkB++b0qLJ4MCf3yq07P2Y6osd+5pq1gD
 XOzGefzi9kXkF+7KGKNrQY+xN++q5jcqMWtiSa+ef2j5YqGt06tqgvFz9YtBozrF
 TEbe73yAVsFdy8XAGi3Z3ceYLaECbTjXRIMhksBqX4YByNVM9N7XT5Gk5L5ykYya
 90EV5nDrfQPTicsL5/8Nb9qKczoRT7I6yubNgUpazdd8g3+wWJycew7I+CiVb44I
 wGQbh5FaJT9KkTYrfkNOwT6N+fAEj4y9GxVMvSW80tmk9VKpv2MkCGrdwwWxp0/q
 XXI3hSIRjBszvkMlLCANg7VFFvNeehVhYrn1ml3fGiZ548STdsCVewP7cOwhuQFp
 f3dniAj49zw9GxZNopLkIfI+HmNZCOTf+E5U1nLOKZKOKpsw9ksNJrvZV8ZZxMkK
 gZRAJwsd64Mob2ClRII9ZKzdRwygN1pDdtS5pa+rvzdRQplE4Flg4Ipv9w+5lsQh
 346ijrxim11NpO/nRV0pXDNDudMzpF0cJvzxMo5uTTsX+eLUBbsdm/qmb2rEAxM7
 JDdW8b7Vluz8fxq7+0Lc1O31CcEGJlBACtdRAXWIAhLZwIaps8+tn+yAjyMEb73H
 jBJ9UAfmdCU=
 =Fs49
 -----END PGP SIGNATURE-----

Merge tag 'md/3.18-fix' of git://neil.brown.name/md

Pull md bugfix from Neil Brown:
 "One fix for md for 3.18.

  This fixes a regression introduced in 3.13"

* tag 'md/3.18-fix' of git://neil.brown.name/md:
  md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN
2014-11-16 15:34:31 -08:00
NeilBrown 45eaf45dfa md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN
md_check_recovery will skip any recovery and also clear
MD_RECOVERY_NEEDED if MD_RECOVERY_FROZEN is set.
So when we clear _FROZEN, we must set _NEEDED and ensure that
md_check_recovery gets run.
Otherwise we could miss out on something that is needed.

In particular, this can make it impossible to remove a
failed device from an array is the  'recovery-needed' processing
didn't happen.
Suitable for stable kernels since 3.13.

Cc: stable@vger.kernel.org (3.13+)
Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Fixes: 30b8feb730
Signed-off-by: NeilBrown <neilb@suse.de>
2014-11-17 09:17:46 +11:00
Linus Torvalds 5a7a662cc6 . stable fix for dm-thin that avoids normal IO racing with discard
. stable fix for a dm-cache related bug in dm-btree walking code that
   results from using very large fast device (e.g. 4T) with a very small
   cache blocksize (e.g. 32K) -- this is a very uncommon configuration
 
 . a couple fixes for dm-raid (one for stable and the other addresses a
   crash in 3.18-rc1 code)
 
 . stable fix for dm-thinp that addresses a very rare dm-bufio bug having
   to do with memory reclaimation (via shrinker) when using dm-thinp
   ontop of loopback devices
 
 . fix a leak in dm-stripe target constructor's error path
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUY/p7AAoJEMUj8QotnQNaxPEIAJsmJC5ujQAIdm5yUxsOWruU
 Y/36HbPvlmV8fgWqGyjaubBrzqgWry/yW/u/Sv9+9rE3Zh6JSVLVrCA6uZZ3Yr+j
 HKYEPjm/O0zVJepfEDKtjG6dxeaql47+luwU1iP1bAYeZE3zmKn1oFT2GW5gTbxO
 2n3MiN/dyX8v0cTw6r0O69luIAu93CSY0XDk+1ynfKlKKVmgcAUPvKuobF+yHXoF
 Rd7KTqFoK6HgRhdUHvUQnCGDandZ9MHjt3oW9p3dv3ezvW1cNUARoVHMRGG6Awfu
 WZkQ/VORDeaJT+bhjGfPIla1HbgxEKJrgzTUlpj+P6K2uPK2f6ECEyBpDLWKy9g=
 =lkSu
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - stable fix for dm-thin that avoids normal IO racing with discard

 - stable fix for a dm-cache related bug in dm-btree walking code that
   results from using very large fast device (eg 4T) with a very small
   cache blocksize (eg 32K) -- this is a very uncommon configuration

 - a couple fixes for dm-raid (one for stable and the other addresses a
   crash in 3.18-rc1 code)

 - stable fix for dm-thinp that addresses a very rare dm-bufio bug
   having to do with memory reclaimation (via shrinker) when using
   dm-thinp ontop of loopback devices

 - fix a leak in dm-stripe target constructor's error path

* tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm btree: fix a recursion depth bug in btree walking code
  dm thin: grab a virtual cell before looking up the mapping
  dm raid: fix inaccessible superblocks causing oops in configure_discard_support
  dm raid: ensure superblock's size matches device's logical block size
  dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks
  dm stripe: fix potential for leak in stripe_ctr error path
2014-11-13 09:19:20 -08:00
Mike Snitzer 5ec02084f6 dm thin: remove stale 'trim' message in block comment above pool_message
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:15:05 -05:00
Mikulas Patocka 17181fb7a0 dm thin: fix a race in thin_dtr
As long as struct thin_c is in the list, anyone can grab a reference of
it.  Consequently, we must wait for the reference count to drop to zero
*after* we remove the structure from the list, not before.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:15:04 -05:00
Joe Thornber d1d9220cba dm cache: emit a warning message if there are a lot of cache blocks
Loading and saving millions of block mappings takes time.  We may as
well explain what's going on, and encourage people to use a larger
cache block size.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:14:59 -05:00
Joe Thornber 7ae34e7778 dm cache: improve discard support
Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support.  This also improves
discard performance considerly because larger discards are issued to the
dm-cache device.  The discard blocksize was always intended to be
greater than the cache blocksize.  But until now it wasn't implemented
safely.

Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset.  Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size.  For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset!  Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber 08b184514f dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size"
This reverts commit d132cc6d9e because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber 1bad9bc4ee dm cache: revert "remove remainder of distinct discard block size"
This reverts commit 64ab346a36 because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber 5f274d8865 dm bio prison: introduce support for locking ranges of blocks
Ranges will be placed in the same cell if they overlap.

Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Mike Snitzer f1afb36a61 dm cache policy mq: simplify ability to promote sequential IO to the cache
Before, if the user wanted sequential IO to be promoted to the cache
they'd have to set sequential_threshold to some nebulous large value.

Now, the user may easily disable sequential IO detection (and sequential
IO's implicit bypass of the cache) by setting sequential_threshold to 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber b155aa0e5a dm cache policy mq: tweak algorithm that decides when to promote a block
Rather than maintaining a separate promote_threshold variable that we
periodically update we now use the hit count of the oldest clean
block.  Also add a fudge factor to discourage demoting dirty blocks.

With some tests this has a sizeable difference, because the old code
was too eager to demote blocks.  For example, device-mapper-test-suite's
git_extract_cache_quick test goes from taking 190 seconds, to 142
(linear on spindle takes 250).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Hannes Reinecke 41abc4e1af dm: do not call dm_sync_table() when creating new devices
When creating new devices dm_sync_table() calls
synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be
flushed. This causes a latency overhead that is especially noticeable
when creating lots of devices.

And all of this is pointless as there are no old maps to be
disconnected, and hence no stale pointers which would need to be
cleared up.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Pranith Kumar 6fa9952097 dm: sparse: Annotate field with __rcu for checking
Annotate the map field with __rcu since this is a rcu pointer which is checked
by sparse.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Pranith Kumar 33423974bf dm: Use rcu_dereference() for accessing rcu pointer
The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference()
while accessing it.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Mike Snitzer 42d6a8ce3c dm thin: refactor requeue_io to eliminate spinlock bouncing
Also refactor some other bio_list erroring helpers.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Mike Snitzer 9d094eebd7 dm thin: optimize retry_bios_on_resume
Eliminate redundant should_error_unserviceable_bio check and error
loop.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber ac4c3f34a9 dm thin: sort the deferred cells
Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells().  This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber 23ca2bb6c6 dm thin: direct dispatch when breaking sharing
This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber 2d759a46b4 dm thin: remap the bios in a cell immediately
This use of direct submission in process_prepared_mapping() reduces
latency for submitting bios in a cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.

But this direct submission exposes the potential for a race between
releasing a cell and incrementing deferred set.  Fix this by introducing
dm_cell_visit_release() and refactoring inc_remap_and_issue_cell()
accordingly.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber a374bb217b dm thin: defer whole cells rather than individual bios
This avoids dropping the cell, so increases the probability that other
bios will collect within the cell, rather than being passed individually
to the worker.

Also add required process_cell and process_discard_cell error handling
wrappers and set associated pool-mode function pointers accordingly.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Mike Snitzer 452d7a620d dm thin: factor out remap_and_issue_overwrite
Purely cleanup of duplicated code, no functional change.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber 7a7e97ca58 dm thin: performance improvement to discard processing
When processing a discard bio, if the block is already quiesced do the
discard immediately rather than adding the mapping to a list for the
next iteration of the worker thread.

Discarding a fully provisioned 100G thin volume with 64k block size goes
from 860s to 95s with this change.

Clearly there's something wrong with the worker architecture, more
investigation needed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Mike Snitzer 36f12aeb71 dm thin: implement thin_merge
Introduce thin_merge so that any additional constraints from the data
volume may be taken into account when determing the maximum number of
sectors that can be issued relative to the specified logical offset.

This is particularly important if/when the data volume is layered ontop
of a more sophisticated device (e.g. dm-raid or some other DM target).

Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Mike Snitzer 148e51baf8 dm: improve documentation and code clarity in dm_merge_bvec
These code changes do not introduce a functional change.

But bio_add_page() will never attempt to build up a bio larger than
queue_max_sectors().  Similarly, bio_get_nr_vecs() is also bound by
queue_max_sectors().  Therefore, there is no point in allowing
dm_merge_bvec() to answer "how many sectors can a bio have at this
offset?" with anything larger than queue_max_sectors().  Using
queue_max_sectors() rather than BIO_MAX_SECTORS serves to more
accurately convey the limits that are being imposed.

Also, use unlikely() to clarify the fact that the defensive code in
dm_merge_bvec() relative to max_size going negative shouldn't ever
happen -- if it does happen there is a bug in the block layer for
requesting larger than dm_merge_bvec()'s initial response for a given
offset.  Also, update a comment in dm_merge_bvec() relative to
max_hw_sectors_kb.  And fix empty newline whitespace.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Mike Snitzer 604ea90641 dm thin: adjust max_sectors_kb based on thinp blocksize
Allows for filesystems to submit bios that are a factor of the thinp
blocksize, improving dm-thinp efficiency (particularly when the data
volume is RAID).

Also set io_min to max_sectors_kb if it is a factor of the thinp
blocksize.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Joe Thornber 7d327fe051 dm thin: throttle incoming IO
Throttle IO based on the time it's taking the worker to do one loop.
There were reports of hung task timeouts occuring and it was observed
that the excessively long avgqu-sz (as reported by iostat) was
contributing to these hung tasks.

Throttling definitely helps dm-thinp perform better under heavy IO load
(without being detremental by being overzealous).  It reduces avgqu-sz
drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata
is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2.  And
avgqu-sz stays at or below 30K even with dirty_ratio=20,
dirty_background_ratio=10.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Joe Thornber 8a01a6af75 dm thin: prefetch missing metadata pages
Prefetch metadata at the start of the worker thread and then again every
128th bio processed from the deferred list.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Joe Thornber 4646015d7e dm transaction manager: add support for prefetching blocks of metadata
Introduce the dm_tm_issue_prefetches interface.  If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core.  dm_tm_issue_prefetches will request those blocks to be
prefetched.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber e5cfc69a51 dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO
This change is a prerequisite for allowing metadata to be prefetched.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber a195db2d29 dm bio prison: switch to using a red black tree
Previously it was using a fixed sized hash table.  There are times
when very many concurrent cells are held (such as when processing a very
large discard).  When this happens the hash table performance becomes
very poor.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber 33096a7822 dm bufio: evict buffers that are past the max age but retain some buffers
These changes help keep metadata backed by dm-bufio in-core longer which
fixes reports of metadata churn in the face of heavy random IO workloads.

Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
because associated buffers had been idle for some time is unfriendly.

Now, the user may now configure the number of bytes that bufio retains
using the 'retain_bytes' module parameter.  The default is 256K.

Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
defaults were quite low so increase them (to 30 and 300 respectively).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber 4e420c452b dm bufio: switch from a huge hash table to an rbtree
Converting over to using an rbtree eliminates a fixed 8MB allocation
from vmalloc space for the hash table.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber 9b460d3699 dm btree: fix a recursion depth bug in btree walking code
The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain.  This is not suitable for the simple recursive walk algorithm,
which retraces its steps.

This code is only used by the persistent array code, which in turn is
only used by dm-cache.  In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks.  For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.

The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-11-10 15:23:58 -05:00
Joe Thornber c822ed967c dm thin: grab a virtual cell before looking up the mapping
Avoids normal IO racing with discard.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-11-04 13:05:53 -05:00
Heinz Mauelshagen d20c4b08be dm raid: fix inaccessible superblocks causing oops in configure_discard_support
Commit 48cf06bc5f ("dm raid: add discard support for RAID levels 4, 5
and 6") did not properly handle missing metadata device(s).  A failing
read of the superblock causes the metadata and data devices to be
removed from the dev array in struct raid_set, setting references to
both devices to NULL.  configure_discard_support() nonetheless tries to
access the data dev unconditionally causing an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-29 14:53:27 -04:00
Heinz Mauelshagen 40d43c4b4c dm raid: ensure superblock's size matches device's logical block size
The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.

Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.

Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work.  Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.

[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-10-21 09:32:15 -04:00
Linus Torvalds 929254d8da . fix DM's long-standing excessive use of memory by leveraging the new
bioset_create_nobvec() interface when creating the DM's bioset
 
 . fix a few bugs in dm-bufio and dm-log-userspace
 
 . add DM core support for a DM multipath use-case that requires loading
   DM tables that contain devices that have failed (by allowing active
   and inactive DM tables to share dm_devs)
 
 . add discard support to the DM raid target; like MD raid456 the user
   must opt-in to raid456 discard support be specifying the
   devices_handle_discard_safely=Y module param.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUQWcdAAoJEMUj8QotnQNaHaQH/RD0xf54AnaQ0tEGuNQXFwtx
 Gc/3s+VEcKlmTvk9nm2FWNvVagPn8uBQ0O2eid4UJk9AyfPJnPwGUoVqxbKhKK9i
 G5/O5s8opLlItk14h/btw/zB8RNC1bg8NGnBrGYDudiwHm+Gv4jlnHErp2JMHv9F
 nonb+QoG23wlEJkBafzBNYhthkNDq1ZFrDjhqG7dNySkXh8VZAW8VcZ/ZfskkhOa
 C8CDl3TKL1BBJHQKesvqHQbCSqh8Ujzs63bLA3heaSMExkhmUgdfpnbHK4hzPNJP
 rtmVEW57mVI+O5Cfva1p9RClT5EjiO+5VufHkpRJSIsfsH5PMaQ7vW8gKmwd5JA=
 =z+Yz
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper updates from Mike Snitzer:
 "I rebased the DM tree ontop of linux-block.git's 'for-3.18/core' at
  the beginning of October because DM core now depends on the newly
  introduced bioset_create_nobvec() interface.

  Summary:

   - fix DM's long-standing excessive use of memory by leveraging the
     new bioset_create_nobvec() interface when creating the DM's bioset

   - fix a few bugs in dm-bufio and dm-log-userspace

   - add DM core support for a DM multipath use-case that requires
     loading DM tables that contain devices that have failed (by
     allowing active and inactive DM tables to share dm_devs)

   - add discard support to the DM raid target; like MD raid456 the user
     must opt-in to raid456 discard support be specifying the
     devices_handle_discard_safely=Y module param"

* tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
  dm bufio: when done scanning return from __scan immediately
  dm bufio: update last_accessed when relinking a buffer
  dm raid: add discard support for RAID levels 4, 5 and 6
  dm raid: add discard support for RAID levels 1 and 10
  dm: allow active and inactive tables to share dm_devs
  dm mpath: stop queueing IO when no valid paths exist
  dm: use bioset_create_nobvec()
  dm: remove nr_iovecs parameter from alloc_tio()
2014-10-18 12:25:30 -07:00
Linus Torvalds e75437fb93 Merge branch 'for-3.18/drivers' of git://git.kernel.dk/linux-block
Pull block layer driver update from Jens Axboe:
 "This is the block driver pull request for 3.18.  Not a lot in there
  this round, and nothing earth shattering.

   - A round of drbd fixes from the linbit team, and an improvement in
     asender performance.

   - Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and
     hd from Michael Opdenacker.

   - Disable entropy collection from flash devices by default, from Mike
     Snitzer.

   - A small collection of xen blkfront/back fixes from Roger Pau Monné
     and Vitaly Kuznetsov"

* 'for-3.18/drivers' of git://git.kernel.dk/linux-block:
  block: disable entropy contributions for nonrot devices
  xen, blkfront: factor out flush-related checks from do_blkif_request()
  xen-blkback: fix leak on grant map error path
  xen/blkback: unmap all persistent grants when frontend gets disconnected
  rsxx: Remove deprecated IRQF_DISABLED
  block: hd: remove deprecated IRQF_DISABLED
  drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks
  drbd: compute the end before rb_insert_augmented()
  drbd: Add missing newline in resync progress display in /proc/drbd
  drbd: reduce lock contention in drbd_worker
  drbd: Improve asender performance
  drbd: Get rid of the WORK_PENDING macro
  drbd: Get rid of the __no_warn and __cond_lock macros
  drbd: Avoid inconsistent locking warning
  drbd: Remove superfluous newline from "resync_extents" debugfs entry.
  drbd: Use consistent names for all the bi_end_io callbacks
  drbd: Use better variable names
2014-10-18 12:12:45 -07:00
Linus Torvalds 88ed806abb md updates for 3.18
- a few minor bug fixes
 - quite a lot of code tidy-up and simplification
 - remove PRINT_RAID_DEBUG ioctl.  I'm fairly sure
   it is unused, and it isn't particularly useful.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAVD9k1jnsnt1WYoG5AQKCaw/9F7jE9PDYtEfJ8ShEWMM0CWNsCKmgqfpV
 i4RaeVKe1IoA5JOurYk+wvdbWSXGfz5XQ9GX8ptRl9X7ZoG4aJ65v9GHpBamPLrc
 mB2Lz3zR9AZVYrMDeZym9cSZ6FpZNzzXpJE2O2RslXq3gFI03MObyM8xyeh8ybOD
 45nhH+CJ17OFNn5OzzFLhYEAoYDeOS97zAwInWeFlUp14Jl403xnZ3srF2YJ78TR
 PjcCpxo1IhGEnYE8rDYqH/UjDPzEfAdYrqM5k3NEnuPiqn+KxYNSsbAQGdeMrGUc
 DO0H8dt6U1U2tq/t/qN8n01uQ7AJ3S3JrTsQxSW/UC1SVfgpztK/a78eA/YSy/zs
 iZzPP7CpLfF4T945jaQionevZOBFRM+gbrMqeoQTPO2QtfrSGe4Awoht7Z+no3RR
 dCX0ScO16kHkcAcSXXGZkGtC1DcteEwUfufSdako12exo1k3efc98DsyMw2VfzOM
 EJcQD1JGYVW+czZM58EEue92TT5jvWnhU5s3PEUMTZrDgSWwTVQC3oNCgDGFKI4X
 eebpWlG3gEjNrnL5givbBwC2LCfI59R70gpnGhavdKtt9AtpfsMJnj8E3cCqHE9I
 xR6YPF161KSmKGOG47RK/VJJnq5SmZbxxeShT101uq3SVeqImit6ql3JfAM9HoMD
 RI2iWG9yma4=
 =2QEJ
 -----END PGP SIGNATURE-----

Merge tag 'md/3.18' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 - a few minor bug fixes
 - quite a lot of code tidy-up and simplification
 - remove PRINT_RAID_DEBUG ioctl.  I'm fairly sure it is unused, and it
   isn't particularly useful.

* tag 'md/3.18' of git://neil.brown.name/md: (21 commits)
  lib/raid6: Add log level to printks
  md: move EXPORT_SYMBOL to after function in md.c
  md: discard PRINT_RAID_DEBUG ioctl
  md: remove MD_BUG()
  md: clean up 'exit' labels in md_ioctl().
  md: remove unnecessary test for MD_MAJOR in md_ioctl()
  md: don't allow "-sync" to be set for device in an active array.
  md: remove unwanted white space from md.c
  md: don't start resync thread directly from md thread.
  md: Just use RCU when checking for overlap between arrays.
  md: avoid potential long delay under pers_lock
  md: simplify export_array()
  md: discard find_rdev_nr in favour of find_rdev_nr_rcu
  md: use wait_event() to simplify md_super_wait()
  md: be more relaxed about stopping an array which isn't started.
  md/raid1: process_checks doesn't use its return value.
  md/raid5: fix init_stripe() inconsistencies
  md/raid10: another memory leak due to reshape.
  md: use set_bit/clear_bit instead of shift/mask for bi_flags changes.
  md/raid1: minor typos and reformatting.
  ...
2014-10-18 11:39:52 -07:00
Mikulas Patocka 9d28eb1244 dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks
The shrinker uses gfp flags to indicate what kind of operation can the
driver wait for. If __GFP_IO flag is present, the driver can wait for
block I/O operations, if __GFP_FS flag is present, the driver can wait on
operations involving the filesystem.

dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
device that makes calls into the filesystem. If __GFP_IO is present and
__GFP_FS isn't, dm-bufio could still block on filesystem operations if it
runs on a loop block device.

The change from __GFP_IO to __GFP_FS supposedly fixes one observed (though
unreproducible) deadlock involving dm-bufio and loop device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-10-17 01:40:23 -04:00
Linus Torvalds 0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00